d-bis/smom-dbis-138

Fork 0

Files

defiQUG 79750d92e6 Archive legacy status docs and canonicalize genesis entrypoints

2026-04-13 21:45:16 -07:00

4.7 KiB

Raw Blame History

RPC Service Level Objectives (SLO)

Service level objectives for RPC endpoints on ChainID 138.

Overview

This document defines the service level objectives for RPC endpoints serving ChainID 138 (DeFi Oracle Meta Mainnet).

RPC Endpoints

Primary Endpoint

URL: https://rpc.d-bis.org
Protocol: HTTPS
WebSocket: wss://rpc.d-bis.org
Location: Azure (Primary region)
Infrastructure: Azure Application Gateway + AKS RPC nodes

Secondary Endpoint

URL: https://rpc2.d-bis.org
Protocol: HTTPS
WebSocket: wss://rpc2.d-bis.org
Location: Azure (Secondary region)
Infrastructure: Azure Application Gateway + AKS RPC nodes

Service Level Objectives

Availability

Target: ≥99.9% monthly uptime
Measurement: Percentage of time RPC endpoints are accessible
Monitoring: Azure Monitor, Prometheus, Status page
Alerting: Alert on <99.9% uptime

Latency

Target: <200ms p95 latency
Measurement: 95th percentile response time
Monitoring: Azure Application Insights, Prometheus
Alerting: Alert on >200ms p95 latency

Throughput

Target: 1000+ requests/second
Measurement: Requests per second (RPS)
Monitoring: Azure Monitor, Prometheus
Alerting: Alert on capacity issues

Error Rate

Target: <0.1% error rate
Measurement: Percentage of requests that result in errors
Monitoring: Azure Monitor, Prometheus
Alerting: Alert on >0.1% error rate

Service Level Indicators (SLI)

Uptime SLI

Uptime SLI = (Successful requests / Total requests) * 100

Latency SLI

Latency SLI = p95 response time

Throughput SLI

Throughput SLI = Requests per second

Error Rate SLI

Error Rate SLI = (Error requests / Total requests) * 100

Monitoring

Metrics

Uptime: Percentage of time endpoints are up
Latency: Response time percentiles (p50, p95, p99)
Throughput: Requests per second
Error Rate: Percentage of errors
Availability: Endpoint availability status

Tools

Azure Monitor: Cloud monitoring
Prometheus: Metrics collection
Grafana: Metrics visualization
Application Insights: Application performance monitoring
Status Page: Public status page

Alerting

Alerts

Uptime < 99.9%: Critical alert
Latency > 200ms p95: Warning alert
Throughput > 90% capacity: Warning alert
Error Rate > 0.1%: Critical alert
Endpoint down: Critical alert

Notification Channels

Email: Operations team
Slack: Operations channel
PagerDuty: On-call rotation
SMS: Critical alerts only

Status Page

Public Status Page

URL: https://status.d-bis.org (to be created)
Updates: Real-time status updates
Incidents: Incident reporting
Maintenance: Maintenance windows

Status Indicators

Operational: All systems operational
Degraded: Some issues, but service available
Outage: Service unavailable
Maintenance: Scheduled maintenance

Incident Response

Severity Levels

Critical: Service completely down
High: Significant degradation
Medium: Minor issues
Low: Informational

Response Times

Critical: 15 minutes
High: 1 hour
Medium: 4 hours
Low: 24 hours

Escalation

Level 1: On-call engineer
Level 2: Senior engineer
Level 3: Engineering manager
Level 4: CTO

Disaster Recovery

Backup Endpoints

Primary: https://rpc.d-bis.org
Secondary: https://rpc2.d-bis.org
Tertiary: [To be configured]

Failover

Automatic: DNS-based failover
Manual: Manual failover procedures
Testing: Quarterly failover tests

Capacity Planning

Current Capacity

RPS: 1000+ requests/second
Concurrent Connections: 10,000+
Bandwidth: 1 Gbps+

Scaling

Horizontal: Add more RPC nodes
Vertical: Increase node resources
Auto-scaling: Kubernetes auto-scaling
Load Balancing: Application Gateway load balancing

Reporting

Monthly Reports

Uptime: Monthly uptime percentage
Latency: Average and p95 latency
Throughput: Average and peak throughput
Error Rate: Error rate percentage
Incidents: Number and duration of incidents

Quarterly Reviews

SLO Performance: Review SLO performance
Improvements: Identify improvements
Capacity Planning: Plan for capacity increases
Disaster Recovery: Review disaster recovery procedures

4.7 KiB Raw Blame History