1.6 KiB
1.6 KiB
Metrics & Monitoring Specification
Overview
Metrics collection and monitoring for the explorer platform.
Metrics Catalog
API Metrics
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate (by status code)
- Endpoint usage
Indexer Metrics
- Blocks processed per minute
- Transactions processed per minute
- Block lag (current block - last indexed)
- Error rate
- Processing time
Database Metrics
- Query performance
- Connection pool usage
- Slow queries
- Replication lag
Infrastructure Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
Dashboard Specifications
Key Dashboards
1. System Health:
- Overall system status
- Service health
- Error rates
- Resource usage
2. API Performance:
- Request rates
- Latency percentiles
- Error rates
- Top endpoints
3. Indexer Performance:
- Block processing rate
- Indexer lag
- Error rates
- Chain status
Alerting Rules
Alert Conditions
Critical:
- Service down
- Error rate > 5%
- Indexer lag > 100 blocks
- Database connection failures
Warning:
- Error rate > 1%
- Indexer lag > 10 blocks
- High latency (p95 > 1s)
- High resource usage (> 80%)
Alert Channels
- Slack
- PagerDuty (for critical)
SLO Definitions
API SLOs
- Availability: 99.9% uptime
- Latency: p95 < 500ms
- Error Rate: < 0.1%
Indexer SLOs
- Lag: < 10 blocks behind chain head
- Processing Time: p95 < 5 seconds per block
WebSocket SLOs
- Delivery: 99.9% message delivery
- Latency: < 100ms message delivery
References
- Logging: See
logging.md - Tracing: See
tracing.md