Files
explorer-monorepo/docs/specs/observability/metrics-monitoring.md

1.6 KiB

Metrics & Monitoring Specification

Overview

Metrics collection and monitoring for the explorer platform.

Metrics Catalog

API Metrics

  • Request rate (requests/second)
  • Response time (p50, p95, p99)
  • Error rate (by status code)
  • Endpoint usage

Indexer Metrics

  • Blocks processed per minute
  • Transactions processed per minute
  • Block lag (current block - last indexed)
  • Error rate
  • Processing time

Database Metrics

  • Query performance
  • Connection pool usage
  • Slow queries
  • Replication lag

Infrastructure Metrics

  • CPU usage
  • Memory usage
  • Disk I/O
  • Network I/O

Dashboard Specifications

Key Dashboards

1. System Health:

  • Overall system status
  • Service health
  • Error rates
  • Resource usage

2. API Performance:

  • Request rates
  • Latency percentiles
  • Error rates
  • Top endpoints

3. Indexer Performance:

  • Block processing rate
  • Indexer lag
  • Error rates
  • Chain status

Alerting Rules

Alert Conditions

Critical:

  • Service down
  • Error rate > 5%
  • Indexer lag > 100 blocks
  • Database connection failures

Warning:

  • Error rate > 1%
  • Indexer lag > 10 blocks
  • High latency (p95 > 1s)
  • High resource usage (> 80%)

Alert Channels

  • Email
  • Slack
  • PagerDuty (for critical)

SLO Definitions

API SLOs

  • Availability: 99.9% uptime
  • Latency: p95 < 500ms
  • Error Rate: < 0.1%

Indexer SLOs

  • Lag: < 10 blocks behind chain head
  • Processing Time: p95 < 5 seconds per block

WebSocket SLOs

  • Delivery: 99.9% message delivery
  • Latency: < 100ms message delivery

References

  • Logging: See logging.md
  • Tracing: See tracing.md