Add full monorepo: virtual-banker, backend, frontend, docs, scripts, deployment

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-10 11:32:49 -08:00
parent aafcd913c2
commit 88bc76da91
815 changed files with 125522 additions and 264 deletions
--- a/docs/specs/database/data-lake-schema.md
+++ b/docs/specs/database/data-lake-schema.md
@@ -0,0 +1,294 @@
+# Data Lake Schema Specification
+
+## Overview
+
+This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.
+
+## Storage Structure
+
+### Directory Layout
+
+```
+s3://explorer-data-lake/
+├── raw/
+│   ├── chain_id=138/
+│   │   ├── year=2024/
+│   │   │   ├── month=01/
+│   │   │   │   ├── day=01/
+│   │   │   │   │   ├── blocks.parquet
+│   │   │   │   │   ├── transactions.parquet
+│   │   │   │   │   └── logs.parquet
+│   │   │   │   └── ...
+│   │   │   └── ...
+│   │   └── ...
+│   └── ...
+├── processed/
+│   ├── chain_id=138/
+│   │   ├── daily_aggregates/
+│   │   │   ├── year=2024/
+│   │   │   │   └── month=01/
+│   │   │   │       └── day=01.parquet
+│   │   └── ...
+│   └── ...
+└── archived/
+    └── ...
+```
+
+### Partitioning Strategy
+
+**Partition Keys**:
+- `chain_id`: Chain identifier
+- `year`: Year (YYYY)
+- `month`: Month (MM)
+- `day`: Day (DD)
+
+**Benefits**:
+- Efficient query pruning
+- Parallel processing
+- Easy data management (delete by partition)
+
+## Parquet Schema
+
+### Blocks Parquet Schema
+
+```json
+{
+  "type": "struct",
+  "fields": [
+    {"name": "chain_id", "type": "integer", "nullable": false},
+    {"name": "number", "type": "long", "nullable": false},
+    {"name": "hash", "type": "string", "nullable": false},
+    {"name": "parent_hash", "type": "string", "nullable": false},
+    {"name": "timestamp", "type": "timestamp", "nullable": false},
+    {"name": "miner", "type": "string", "nullable": true},
+    {"name": "gas_used", "type": "long", "nullable": true},
+    {"name": "gas_limit", "type": "long", "nullable": true},
+    {"name": "transaction_count", "type": "integer", "nullable": true},
+    {"name": "size", "type": "integer", "nullable": true}
+  ]
+}
+```
+
+### Transactions Parquet Schema
+
+```json
+{
+  "type": "struct",
+  "fields": [
+    {"name": "chain_id", "type": "integer", "nullable": false},
+    {"name": "hash", "type": "string", "nullable": false},
+    {"name": "block_number", "type": "long", "nullable": false},
+    {"name": "transaction_index", "type": "integer", "nullable": false},
+    {"name": "from_address", "type": "string", "nullable": false},
+    {"name": "to_address", "type": "string", "nullable": true},
+    {"name": "value", "type": "string", "nullable": false}, // Decimal as string
+    {"name": "gas_price", "type": "long", "nullable": true},
+    {"name": "gas_used", "type": "long", "nullable": true},
+    {"name": "gas_limit", "type": "long", "nullable": false},
+    {"name": "status", "type": "integer", "nullable": true},
+    {"name": "timestamp", "type": "timestamp", "nullable": false}
+  ]
+}
+```
+
+### Logs Parquet Schema
+
+```json
+{
+  "type": "struct",
+  "fields": [
+    {"name": "chain_id", "type": "integer", "nullable": false},
+    {"name": "transaction_hash", "type": "string", "nullable": false},
+    {"name": "block_number", "type": "long", "nullable": false},
+    {"name": "log_index", "type": "integer", "nullable": false},
+    {"name": "address", "type": "string", "nullable": false},
+    {"name": "topic0", "type": "string", "nullable": true},
+    {"name": "topic1", "type": "string", "nullable": true},
+    {"name": "topic2", "type": "string", "nullable": true},
+    {"name": "topic3", "type": "string", "nullable": true},
+    {"name": "data", "type": "string", "nullable": true},
+    {"name": "timestamp", "type": "timestamp", "nullable": false}
+  ]
+}
+```
+
+### Token Transfers Parquet Schema
+
+```json
+{
+  "type": "struct",
+  "fields": [
+    {"name": "chain_id", "type": "integer", "nullable": false},
+    {"name": "transaction_hash", "type": "string", "nullable": false},
+    {"name": "block_number", "type": "long", "nullable": false},
+    {"name": "token_address", "type": "string", "nullable": false},
+    {"name": "token_type", "type": "string", "nullable": false},
+    {"name": "from_address", "type": "string", "nullable": false},
+    {"name": "to_address", "type": "string", "nullable": false},
+    {"name": "amount", "type": "string", "nullable": true},
+    {"name": "token_id", "type": "string", "nullable": true},
+    {"name": "timestamp", "type": "timestamp", "nullable": false}
+  ]
+}
+```
+
+## Data Ingestion
+
+### ETL Pipeline
+
+**Process**:
+1. Extract: Query PostgreSQL for daily data
+2. Transform: Convert to Parquet format
+3. Load: Upload to S3 with partitioning
+
+**Schedule**: Daily batch job after day ends
+
+**Tools**: Apache Spark, AWS Glue, or custom ETL scripts
+
+### Compression
+
+**Format**: Snappy compression (good balance of speed and compression ratio)
+
+**Alternative**: Gzip (better compression, slower)
+
+### File Sizing
+
+**Target Size**: 100-500 MB per Parquet file
+- Smaller files: Better parallelism
+- Larger files: Better compression
+
+**Strategy**: Write files of target size, or split by time ranges
+
+## Query Interface
+
+### AWS Athena / Presto
+
+**Table Definition**:
+```sql
+CREATE EXTERNAL TABLE blocks_138 (
+  chain_id int,
+  number bigint,
+  hash string,
+  parent_hash string,
+  timestamp timestamp,
+  miner string,
+  gas_used bigint,
+  gas_limit bigint,
+  transaction_count int,
+  size int
+)
+STORED AS PARQUET
+LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
+TBLPROPERTIES (
+  'projection.enabled' = 'true',
+  'projection.year.type' = 'integer',
+  'projection.year.range' = '2020,2030',
+  'projection.month.type' = 'integer',
+  'projection.month.range' = '1,12',
+  'projection.day.type' = 'integer',
+  'projection.day.range' = '1,31'
+);
+```
+
+### Query Examples
+
+**Daily Transaction Count**:
+```sql
+SELECT 
+  DATE(timestamp) as date,
+  COUNT(*) as transaction_count
+FROM transactions_138
+WHERE year = 2024 AND month = 1
+GROUP BY DATE(timestamp)
+ORDER BY date;
+```
+
+**Token Transfer Analytics**:
+```sql
+SELECT 
+  token_address,
+  COUNT(*) as transfer_count,
+  SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
+FROM token_transfers_138
+WHERE year = 2024 AND month = 1
+GROUP BY token_address
+ORDER BY total_volume DESC
+LIMIT 100;
+```
+
+## Data Retention
+
+### Retention Policies
+
+**Raw Data**: 7 years (compliance requirement)
+**Processed Aggregates**: Indefinite
+**Archived Data**: Move to Glacier after 1 year
+
+### Lifecycle Policies
+
+**S3 Lifecycle Rules**:
+1. Move to Infrequent Access after 30 days
+2. Move to Glacier after 1 year
+3. Delete after 7 years (raw data)
+
+## Data Processing
+
+### Aggregation Jobs
+
+**Daily Aggregates**:
+- Transaction counts by hour
+- Gas usage statistics
+- Token transfer volumes
+- Address activity metrics
+
+**Monthly Aggregates**:
+- Network growth metrics
+- Token distribution changes
+- Protocol usage statistics
+
+### ML/Analytics Workflows
+
+**Use Cases**:
+- Anomaly detection
+- Fraud detection
+- Market analysis
+- Network health monitoring
+
+**Tools**: Spark, Pandas, Jupyter notebooks
+
+## Security and Access Control
+
+### Access Control
+
+**IAM Policies**: Restrict access to specific prefixes
+**Encryption**: Server-side encryption (SSE-S3 or SSE-KMS)
+**Audit Logging**: Enable S3 access logging
+
+### Data Classification
+
+**Public Data**: Blocks, transactions (public blockchain data)
+**Sensitive Data**: User addresses, labels (requires authentication)
+**Compliance Data**: Banking/transaction data (strict access control)
+
+## Cost Optimization
+
+### Storage Optimization
+
+**Strategies**:
+- Use appropriate storage classes (Standard, IA, Glacier)
+- Compress data (Parquet + Snappy)
+- Delete old data per retention policy
+- Use intelligent tiering
+
+### Query Optimization
+
+**Strategies**:
+- Partition pruning (query only relevant partitions)
+- Column pruning (select only needed columns)
+- Predicate pushdown (filter early)
+
+## References
+
+- Database Schema: See `postgres-schema.md`
+- Analytics: See `../observability/metrics-monitoring.md`
+