Files
smom-dbis-138/docs/operations/status-reports/RPC_SLO.md
defiQUG 1fb7266469 Add Oracle Aggregator and CCIP Integration
- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control.
- Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities.
- Created .gitmodules to include OpenZeppelin contracts as a submodule.
- Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment.
- Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks.
- Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring.
- Created scripts for resource import and usage validation across non-US regions.
- Added tests for CCIP error handling and integration to ensure robust functionality.
- Included various new files and directories for the orchestration portal and deployment scripts.
2025-12-12 14:57:48 -08:00

210 lines
4.7 KiB
Markdown

# RPC Service Level Objectives (SLO)
Service level objectives for RPC endpoints on ChainID 138.
## Overview
This document defines the service level objectives for RPC endpoints serving ChainID 138 (DeFi Oracle Meta Mainnet).
## RPC Endpoints
### Primary Endpoint
- **URL**: `https://rpc.d-bis.org`
- **Protocol**: HTTPS
- **WebSocket**: `wss://rpc.d-bis.org`
- **Location**: Azure (Primary region)
- **Infrastructure**: Azure Application Gateway + AKS RPC nodes
### Secondary Endpoint
- **URL**: `https://rpc2.d-bis.org`
- **Protocol**: HTTPS
- **WebSocket**: `wss://rpc2.d-bis.org`
- **Location**: Azure (Secondary region)
- **Infrastructure**: Azure Application Gateway + AKS RPC nodes
## Service Level Objectives
### Availability
- **Target**: ≥99.9% monthly uptime
- **Measurement**: Percentage of time RPC endpoints are accessible
- **Monitoring**: Azure Monitor, Prometheus, Status page
- **Alerting**: Alert on <99.9% uptime
### Latency
- **Target**: <200ms p95 latency
- **Measurement**: 95th percentile response time
- **Monitoring**: Azure Application Insights, Prometheus
- **Alerting**: Alert on >200ms p95 latency
### Throughput
- **Target**: 1000+ requests/second
- **Measurement**: Requests per second (RPS)
- **Monitoring**: Azure Monitor, Prometheus
- **Alerting**: Alert on capacity issues
### Error Rate
- **Target**: <0.1% error rate
- **Measurement**: Percentage of requests that result in errors
- **Monitoring**: Azure Monitor, Prometheus
- **Alerting**: Alert on >0.1% error rate
## Service Level Indicators (SLI)
### Uptime SLI
```
Uptime SLI = (Successful requests / Total requests) * 100
```
### Latency SLI
```
Latency SLI = p95 response time
```
### Throughput SLI
```
Throughput SLI = Requests per second
```
### Error Rate SLI
```
Error Rate SLI = (Error requests / Total requests) * 100
```
## Monitoring
### Metrics
- **Uptime**: Percentage of time endpoints are up
- **Latency**: Response time percentiles (p50, p95, p99)
- **Throughput**: Requests per second
- **Error Rate**: Percentage of errors
- **Availability**: Endpoint availability status
### Tools
- **Azure Monitor**: Cloud monitoring
- **Prometheus**: Metrics collection
- **Grafana**: Metrics visualization
- **Application Insights**: Application performance monitoring
- **Status Page**: Public status page
## Alerting
### Alerts
- **Uptime < 99.9%**: Critical alert
- **Latency > 200ms p95**: Warning alert
- **Throughput > 90% capacity**: Warning alert
- **Error Rate > 0.1%**: Critical alert
- **Endpoint down**: Critical alert
### Notification Channels
- **Email**: Operations team
- **Slack**: Operations channel
- **PagerDuty**: On-call rotation
- **SMS**: Critical alerts only
## Status Page
### Public Status Page
- **URL**: `https://status.d-bis.org` (to be created)
- **Updates**: Real-time status updates
- **Incidents**: Incident reporting
- **Maintenance**: Maintenance windows
### Status Indicators
- **Operational**: All systems operational
- **Degraded**: Some issues, but service available
- **Outage**: Service unavailable
- **Maintenance**: Scheduled maintenance
## Incident Response
### Severity Levels
1. **Critical**: Service completely down
2. **High**: Significant degradation
3. **Medium**: Minor issues
4. **Low**: Informational
### Response Times
- **Critical**: 15 minutes
- **High**: 1 hour
- **Medium**: 4 hours
- **Low**: 24 hours
### Escalation
1. **Level 1**: On-call engineer
2. **Level 2**: Senior engineer
3. **Level 3**: Engineering manager
4. **Level 4**: CTO
## Disaster Recovery
### Backup Endpoints
- **Primary**: `https://rpc.d-bis.org`
- **Secondary**: `https://rpc2.d-bis.org`
- **Tertiary**: [To be configured]
### Failover
- **Automatic**: DNS-based failover
- **Manual**: Manual failover procedures
- **Testing**: Quarterly failover tests
## Capacity Planning
### Current Capacity
- **RPS**: 1000+ requests/second
- **Concurrent Connections**: 10,000+
- **Bandwidth**: 1 Gbps+
### Scaling
- **Horizontal**: Add more RPC nodes
- **Vertical**: Increase node resources
- **Auto-scaling**: Kubernetes auto-scaling
- **Load Balancing**: Application Gateway load balancing
## Reporting
### Monthly Reports
- **Uptime**: Monthly uptime percentage
- **Latency**: Average and p95 latency
- **Throughput**: Average and peak throughput
- **Error Rate**: Error rate percentage
- **Incidents**: Number and duration of incidents
### Quarterly Reviews
- **SLO Performance**: Review SLO performance
- **Improvements**: Identify improvements
- **Capacity Planning**: Plan for capacity increases
- **Disaster Recovery**: Review disaster recovery procedures
## References
- [Azure Monitor](https://azure.microsoft.com/services/monitor/)
- [Prometheus](https://prometheus.io)
- [Grafana](https://grafana.com)
- [Status Page](https://statuspage.io)