Add full monorepo: virtual-banker, backend, frontend, docs, scripts, deployment
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
294
docs/specs/database/data-lake-schema.md
Normal file
294
docs/specs/database/data-lake-schema.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# Data Lake Schema Specification
|
||||
|
||||
## Overview
|
||||
|
||||
This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.
|
||||
|
||||
## Storage Structure
|
||||
|
||||
### Directory Layout
|
||||
|
||||
```
|
||||
s3://explorer-data-lake/
|
||||
├── raw/
|
||||
│ ├── chain_id=138/
|
||||
│ │ ├── year=2024/
|
||||
│ │ │ ├── month=01/
|
||||
│ │ │ │ ├── day=01/
|
||||
│ │ │ │ │ ├── blocks.parquet
|
||||
│ │ │ │ │ ├── transactions.parquet
|
||||
│ │ │ │ │ └── logs.parquet
|
||||
│ │ │ │ └── ...
|
||||
│ │ │ └── ...
|
||||
│ │ └── ...
|
||||
│ └── ...
|
||||
├── processed/
|
||||
│ ├── chain_id=138/
|
||||
│ │ ├── daily_aggregates/
|
||||
│ │ │ ├── year=2024/
|
||||
│ │ │ │ └── month=01/
|
||||
│ │ │ │ └── day=01.parquet
|
||||
│ │ └── ...
|
||||
│ └── ...
|
||||
└── archived/
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Partitioning Strategy
|
||||
|
||||
**Partition Keys**:
|
||||
- `chain_id`: Chain identifier
|
||||
- `year`: Year (YYYY)
|
||||
- `month`: Month (MM)
|
||||
- `day`: Day (DD)
|
||||
|
||||
**Benefits**:
|
||||
- Efficient query pruning
|
||||
- Parallel processing
|
||||
- Easy data management (delete by partition)
|
||||
|
||||
## Parquet Schema
|
||||
|
||||
### Blocks Parquet Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "struct",
|
||||
"fields": [
|
||||
{"name": "chain_id", "type": "integer", "nullable": false},
|
||||
{"name": "number", "type": "long", "nullable": false},
|
||||
{"name": "hash", "type": "string", "nullable": false},
|
||||
{"name": "parent_hash", "type": "string", "nullable": false},
|
||||
{"name": "timestamp", "type": "timestamp", "nullable": false},
|
||||
{"name": "miner", "type": "string", "nullable": true},
|
||||
{"name": "gas_used", "type": "long", "nullable": true},
|
||||
{"name": "gas_limit", "type": "long", "nullable": true},
|
||||
{"name": "transaction_count", "type": "integer", "nullable": true},
|
||||
{"name": "size", "type": "integer", "nullable": true}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Transactions Parquet Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "struct",
|
||||
"fields": [
|
||||
{"name": "chain_id", "type": "integer", "nullable": false},
|
||||
{"name": "hash", "type": "string", "nullable": false},
|
||||
{"name": "block_number", "type": "long", "nullable": false},
|
||||
{"name": "transaction_index", "type": "integer", "nullable": false},
|
||||
{"name": "from_address", "type": "string", "nullable": false},
|
||||
{"name": "to_address", "type": "string", "nullable": true},
|
||||
{"name": "value", "type": "string", "nullable": false}, // Decimal as string
|
||||
{"name": "gas_price", "type": "long", "nullable": true},
|
||||
{"name": "gas_used", "type": "long", "nullable": true},
|
||||
{"name": "gas_limit", "type": "long", "nullable": false},
|
||||
{"name": "status", "type": "integer", "nullable": true},
|
||||
{"name": "timestamp", "type": "timestamp", "nullable": false}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Logs Parquet Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "struct",
|
||||
"fields": [
|
||||
{"name": "chain_id", "type": "integer", "nullable": false},
|
||||
{"name": "transaction_hash", "type": "string", "nullable": false},
|
||||
{"name": "block_number", "type": "long", "nullable": false},
|
||||
{"name": "log_index", "type": "integer", "nullable": false},
|
||||
{"name": "address", "type": "string", "nullable": false},
|
||||
{"name": "topic0", "type": "string", "nullable": true},
|
||||
{"name": "topic1", "type": "string", "nullable": true},
|
||||
{"name": "topic2", "type": "string", "nullable": true},
|
||||
{"name": "topic3", "type": "string", "nullable": true},
|
||||
{"name": "data", "type": "string", "nullable": true},
|
||||
{"name": "timestamp", "type": "timestamp", "nullable": false}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Token Transfers Parquet Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "struct",
|
||||
"fields": [
|
||||
{"name": "chain_id", "type": "integer", "nullable": false},
|
||||
{"name": "transaction_hash", "type": "string", "nullable": false},
|
||||
{"name": "block_number", "type": "long", "nullable": false},
|
||||
{"name": "token_address", "type": "string", "nullable": false},
|
||||
{"name": "token_type", "type": "string", "nullable": false},
|
||||
{"name": "from_address", "type": "string", "nullable": false},
|
||||
{"name": "to_address", "type": "string", "nullable": false},
|
||||
{"name": "amount", "type": "string", "nullable": true},
|
||||
{"name": "token_id", "type": "string", "nullable": true},
|
||||
{"name": "timestamp", "type": "timestamp", "nullable": false}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Data Ingestion
|
||||
|
||||
### ETL Pipeline
|
||||
|
||||
**Process**:
|
||||
1. Extract: Query PostgreSQL for daily data
|
||||
2. Transform: Convert to Parquet format
|
||||
3. Load: Upload to S3 with partitioning
|
||||
|
||||
**Schedule**: Daily batch job after day ends
|
||||
|
||||
**Tools**: Apache Spark, AWS Glue, or custom ETL scripts
|
||||
|
||||
### Compression
|
||||
|
||||
**Format**: Snappy compression (good balance of speed and compression ratio)
|
||||
|
||||
**Alternative**: Gzip (better compression, slower)
|
||||
|
||||
### File Sizing
|
||||
|
||||
**Target Size**: 100-500 MB per Parquet file
|
||||
- Smaller files: Better parallelism
|
||||
- Larger files: Better compression
|
||||
|
||||
**Strategy**: Write files of target size, or split by time ranges
|
||||
|
||||
## Query Interface
|
||||
|
||||
### AWS Athena / Presto
|
||||
|
||||
**Table Definition**:
|
||||
```sql
|
||||
CREATE EXTERNAL TABLE blocks_138 (
|
||||
chain_id int,
|
||||
number bigint,
|
||||
hash string,
|
||||
parent_hash string,
|
||||
timestamp timestamp,
|
||||
miner string,
|
||||
gas_used bigint,
|
||||
gas_limit bigint,
|
||||
transaction_count int,
|
||||
size int
|
||||
)
|
||||
STORED AS PARQUET
|
||||
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
|
||||
TBLPROPERTIES (
|
||||
'projection.enabled' = 'true',
|
||||
'projection.year.type' = 'integer',
|
||||
'projection.year.range' = '2020,2030',
|
||||
'projection.month.type' = 'integer',
|
||||
'projection.month.range' = '1,12',
|
||||
'projection.day.type' = 'integer',
|
||||
'projection.day.range' = '1,31'
|
||||
);
|
||||
```
|
||||
|
||||
### Query Examples
|
||||
|
||||
**Daily Transaction Count**:
|
||||
```sql
|
||||
SELECT
|
||||
DATE(timestamp) as date,
|
||||
COUNT(*) as transaction_count
|
||||
FROM transactions_138
|
||||
WHERE year = 2024 AND month = 1
|
||||
GROUP BY DATE(timestamp)
|
||||
ORDER BY date;
|
||||
```
|
||||
|
||||
**Token Transfer Analytics**:
|
||||
```sql
|
||||
SELECT
|
||||
token_address,
|
||||
COUNT(*) as transfer_count,
|
||||
SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
|
||||
FROM token_transfers_138
|
||||
WHERE year = 2024 AND month = 1
|
||||
GROUP BY token_address
|
||||
ORDER BY total_volume DESC
|
||||
LIMIT 100;
|
||||
```
|
||||
|
||||
## Data Retention
|
||||
|
||||
### Retention Policies
|
||||
|
||||
**Raw Data**: 7 years (compliance requirement)
|
||||
**Processed Aggregates**: Indefinite
|
||||
**Archived Data**: Move to Glacier after 1 year
|
||||
|
||||
### Lifecycle Policies
|
||||
|
||||
**S3 Lifecycle Rules**:
|
||||
1. Move to Infrequent Access after 30 days
|
||||
2. Move to Glacier after 1 year
|
||||
3. Delete after 7 years (raw data)
|
||||
|
||||
## Data Processing
|
||||
|
||||
### Aggregation Jobs
|
||||
|
||||
**Daily Aggregates**:
|
||||
- Transaction counts by hour
|
||||
- Gas usage statistics
|
||||
- Token transfer volumes
|
||||
- Address activity metrics
|
||||
|
||||
**Monthly Aggregates**:
|
||||
- Network growth metrics
|
||||
- Token distribution changes
|
||||
- Protocol usage statistics
|
||||
|
||||
### ML/Analytics Workflows
|
||||
|
||||
**Use Cases**:
|
||||
- Anomaly detection
|
||||
- Fraud detection
|
||||
- Market analysis
|
||||
- Network health monitoring
|
||||
|
||||
**Tools**: Spark, Pandas, Jupyter notebooks
|
||||
|
||||
## Security and Access Control
|
||||
|
||||
### Access Control
|
||||
|
||||
**IAM Policies**: Restrict access to specific prefixes
|
||||
**Encryption**: Server-side encryption (SSE-S3 or SSE-KMS)
|
||||
**Audit Logging**: Enable S3 access logging
|
||||
|
||||
### Data Classification
|
||||
|
||||
**Public Data**: Blocks, transactions (public blockchain data)
|
||||
**Sensitive Data**: User addresses, labels (requires authentication)
|
||||
**Compliance Data**: Banking/transaction data (strict access control)
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Storage Optimization
|
||||
|
||||
**Strategies**:
|
||||
- Use appropriate storage classes (Standard, IA, Glacier)
|
||||
- Compress data (Parquet + Snappy)
|
||||
- Delete old data per retention policy
|
||||
- Use intelligent tiering
|
||||
|
||||
### Query Optimization
|
||||
|
||||
**Strategies**:
|
||||
- Partition pruning (query only relevant partitions)
|
||||
- Column pruning (select only needed columns)
|
||||
- Predicate pushdown (filter early)
|
||||
|
||||
## References
|
||||
|
||||
- Database Schema: See `postgres-schema.md`
|
||||
- Analytics: See `../observability/metrics-monitoring.md`
|
||||
|
||||
Reference in New Issue
Block a user