Files
smom-dbis-138/docs/archive/status-reports/operations-legacy/RPC_SLO.md

4.7 KiB

RPC Service Level Objectives (SLO)

Service level objectives for RPC endpoints on ChainID 138.

Overview

This document defines the service level objectives for RPC endpoints serving ChainID 138 (DeFi Oracle Meta Mainnet).

RPC Endpoints

Primary Endpoint

  • URL: https://rpc.d-bis.org
  • Protocol: HTTPS
  • WebSocket: wss://rpc.d-bis.org
  • Location: Azure (Primary region)
  • Infrastructure: Azure Application Gateway + AKS RPC nodes

Secondary Endpoint

  • URL: https://rpc2.d-bis.org
  • Protocol: HTTPS
  • WebSocket: wss://rpc2.d-bis.org
  • Location: Azure (Secondary region)
  • Infrastructure: Azure Application Gateway + AKS RPC nodes

Service Level Objectives

Availability

  • Target: ≥99.9% monthly uptime
  • Measurement: Percentage of time RPC endpoints are accessible
  • Monitoring: Azure Monitor, Prometheus, Status page
  • Alerting: Alert on <99.9% uptime

Latency

  • Target: <200ms p95 latency
  • Measurement: 95th percentile response time
  • Monitoring: Azure Application Insights, Prometheus
  • Alerting: Alert on >200ms p95 latency

Throughput

  • Target: 1000+ requests/second
  • Measurement: Requests per second (RPS)
  • Monitoring: Azure Monitor, Prometheus
  • Alerting: Alert on capacity issues

Error Rate

  • Target: <0.1% error rate
  • Measurement: Percentage of requests that result in errors
  • Monitoring: Azure Monitor, Prometheus
  • Alerting: Alert on >0.1% error rate

Service Level Indicators (SLI)

Uptime SLI

Uptime SLI = (Successful requests / Total requests) * 100

Latency SLI

Latency SLI = p95 response time

Throughput SLI

Throughput SLI = Requests per second

Error Rate SLI

Error Rate SLI = (Error requests / Total requests) * 100

Monitoring

Metrics

  • Uptime: Percentage of time endpoints are up
  • Latency: Response time percentiles (p50, p95, p99)
  • Throughput: Requests per second
  • Error Rate: Percentage of errors
  • Availability: Endpoint availability status

Tools

  • Azure Monitor: Cloud monitoring
  • Prometheus: Metrics collection
  • Grafana: Metrics visualization
  • Application Insights: Application performance monitoring
  • Status Page: Public status page

Alerting

Alerts

  • Uptime < 99.9%: Critical alert
  • Latency > 200ms p95: Warning alert
  • Throughput > 90% capacity: Warning alert
  • Error Rate > 0.1%: Critical alert
  • Endpoint down: Critical alert

Notification Channels

  • Email: Operations team
  • Slack: Operations channel
  • PagerDuty: On-call rotation
  • SMS: Critical alerts only

Status Page

Public Status Page

  • URL: https://status.d-bis.org (to be created)
  • Updates: Real-time status updates
  • Incidents: Incident reporting
  • Maintenance: Maintenance windows

Status Indicators

  • Operational: All systems operational
  • Degraded: Some issues, but service available
  • Outage: Service unavailable
  • Maintenance: Scheduled maintenance

Incident Response

Severity Levels

  1. Critical: Service completely down
  2. High: Significant degradation
  3. Medium: Minor issues
  4. Low: Informational

Response Times

  • Critical: 15 minutes
  • High: 1 hour
  • Medium: 4 hours
  • Low: 24 hours

Escalation

  1. Level 1: On-call engineer
  2. Level 2: Senior engineer
  3. Level 3: Engineering manager
  4. Level 4: CTO

Disaster Recovery

Backup Endpoints

  • Primary: https://rpc.d-bis.org
  • Secondary: https://rpc2.d-bis.org
  • Tertiary: [To be configured]

Failover

  • Automatic: DNS-based failover
  • Manual: Manual failover procedures
  • Testing: Quarterly failover tests

Capacity Planning

Current Capacity

  • RPS: 1000+ requests/second
  • Concurrent Connections: 10,000+
  • Bandwidth: 1 Gbps+

Scaling

  • Horizontal: Add more RPC nodes
  • Vertical: Increase node resources
  • Auto-scaling: Kubernetes auto-scaling
  • Load Balancing: Application Gateway load balancing

Reporting

Monthly Reports

  • Uptime: Monthly uptime percentage
  • Latency: Average and p95 latency
  • Throughput: Average and peak throughput
  • Error Rate: Error rate percentage
  • Incidents: Number and duration of incidents

Quarterly Reviews

  • SLO Performance: Review SLO performance
  • Improvements: Identify improvements
  • Capacity Planning: Plan for capacity increases
  • Disaster Recovery: Review disaster recovery procedures

References