- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control. - Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities. - Created .gitmodules to include OpenZeppelin contracts as a submodule. - Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment. - Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks. - Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring. - Created scripts for resource import and usage validation across non-US regions. - Added tests for CCIP error handling and integration to ensure robust functionality. - Included various new files and directories for the orchestration portal and deployment scripts.
210 lines
4.7 KiB
Markdown
210 lines
4.7 KiB
Markdown
# RPC Service Level Objectives (SLO)
|
|
|
|
Service level objectives for RPC endpoints on ChainID 138.
|
|
|
|
## Overview
|
|
|
|
This document defines the service level objectives for RPC endpoints serving ChainID 138 (DeFi Oracle Meta Mainnet).
|
|
|
|
## RPC Endpoints
|
|
|
|
### Primary Endpoint
|
|
|
|
- **URL**: `https://rpc.d-bis.org`
|
|
- **Protocol**: HTTPS
|
|
- **WebSocket**: `wss://rpc.d-bis.org`
|
|
- **Location**: Azure (Primary region)
|
|
- **Infrastructure**: Azure Application Gateway + AKS RPC nodes
|
|
|
|
### Secondary Endpoint
|
|
|
|
- **URL**: `https://rpc2.d-bis.org`
|
|
- **Protocol**: HTTPS
|
|
- **WebSocket**: `wss://rpc2.d-bis.org`
|
|
- **Location**: Azure (Secondary region)
|
|
- **Infrastructure**: Azure Application Gateway + AKS RPC nodes
|
|
|
|
## Service Level Objectives
|
|
|
|
### Availability
|
|
|
|
- **Target**: ≥99.9% monthly uptime
|
|
- **Measurement**: Percentage of time RPC endpoints are accessible
|
|
- **Monitoring**: Azure Monitor, Prometheus, Status page
|
|
- **Alerting**: Alert on <99.9% uptime
|
|
|
|
### Latency
|
|
|
|
- **Target**: <200ms p95 latency
|
|
- **Measurement**: 95th percentile response time
|
|
- **Monitoring**: Azure Application Insights, Prometheus
|
|
- **Alerting**: Alert on >200ms p95 latency
|
|
|
|
### Throughput
|
|
|
|
- **Target**: 1000+ requests/second
|
|
- **Measurement**: Requests per second (RPS)
|
|
- **Monitoring**: Azure Monitor, Prometheus
|
|
- **Alerting**: Alert on capacity issues
|
|
|
|
### Error Rate
|
|
|
|
- **Target**: <0.1% error rate
|
|
- **Measurement**: Percentage of requests that result in errors
|
|
- **Monitoring**: Azure Monitor, Prometheus
|
|
- **Alerting**: Alert on >0.1% error rate
|
|
|
|
## Service Level Indicators (SLI)
|
|
|
|
### Uptime SLI
|
|
|
|
```
|
|
Uptime SLI = (Successful requests / Total requests) * 100
|
|
```
|
|
|
|
### Latency SLI
|
|
|
|
```
|
|
Latency SLI = p95 response time
|
|
```
|
|
|
|
### Throughput SLI
|
|
|
|
```
|
|
Throughput SLI = Requests per second
|
|
```
|
|
|
|
### Error Rate SLI
|
|
|
|
```
|
|
Error Rate SLI = (Error requests / Total requests) * 100
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
### Metrics
|
|
|
|
- **Uptime**: Percentage of time endpoints are up
|
|
- **Latency**: Response time percentiles (p50, p95, p99)
|
|
- **Throughput**: Requests per second
|
|
- **Error Rate**: Percentage of errors
|
|
- **Availability**: Endpoint availability status
|
|
|
|
### Tools
|
|
|
|
- **Azure Monitor**: Cloud monitoring
|
|
- **Prometheus**: Metrics collection
|
|
- **Grafana**: Metrics visualization
|
|
- **Application Insights**: Application performance monitoring
|
|
- **Status Page**: Public status page
|
|
|
|
## Alerting
|
|
|
|
### Alerts
|
|
|
|
- **Uptime < 99.9%**: Critical alert
|
|
- **Latency > 200ms p95**: Warning alert
|
|
- **Throughput > 90% capacity**: Warning alert
|
|
- **Error Rate > 0.1%**: Critical alert
|
|
- **Endpoint down**: Critical alert
|
|
|
|
### Notification Channels
|
|
|
|
- **Email**: Operations team
|
|
- **Slack**: Operations channel
|
|
- **PagerDuty**: On-call rotation
|
|
- **SMS**: Critical alerts only
|
|
|
|
## Status Page
|
|
|
|
### Public Status Page
|
|
|
|
- **URL**: `https://status.d-bis.org` (to be created)
|
|
- **Updates**: Real-time status updates
|
|
- **Incidents**: Incident reporting
|
|
- **Maintenance**: Maintenance windows
|
|
|
|
### Status Indicators
|
|
|
|
- **Operational**: All systems operational
|
|
- **Degraded**: Some issues, but service available
|
|
- **Outage**: Service unavailable
|
|
- **Maintenance**: Scheduled maintenance
|
|
|
|
## Incident Response
|
|
|
|
### Severity Levels
|
|
|
|
1. **Critical**: Service completely down
|
|
2. **High**: Significant degradation
|
|
3. **Medium**: Minor issues
|
|
4. **Low**: Informational
|
|
|
|
### Response Times
|
|
|
|
- **Critical**: 15 minutes
|
|
- **High**: 1 hour
|
|
- **Medium**: 4 hours
|
|
- **Low**: 24 hours
|
|
|
|
### Escalation
|
|
|
|
1. **Level 1**: On-call engineer
|
|
2. **Level 2**: Senior engineer
|
|
3. **Level 3**: Engineering manager
|
|
4. **Level 4**: CTO
|
|
|
|
## Disaster Recovery
|
|
|
|
### Backup Endpoints
|
|
|
|
- **Primary**: `https://rpc.d-bis.org`
|
|
- **Secondary**: `https://rpc2.d-bis.org`
|
|
- **Tertiary**: [To be configured]
|
|
|
|
### Failover
|
|
|
|
- **Automatic**: DNS-based failover
|
|
- **Manual**: Manual failover procedures
|
|
- **Testing**: Quarterly failover tests
|
|
|
|
## Capacity Planning
|
|
|
|
### Current Capacity
|
|
|
|
- **RPS**: 1000+ requests/second
|
|
- **Concurrent Connections**: 10,000+
|
|
- **Bandwidth**: 1 Gbps+
|
|
|
|
### Scaling
|
|
|
|
- **Horizontal**: Add more RPC nodes
|
|
- **Vertical**: Increase node resources
|
|
- **Auto-scaling**: Kubernetes auto-scaling
|
|
- **Load Balancing**: Application Gateway load balancing
|
|
|
|
## Reporting
|
|
|
|
### Monthly Reports
|
|
|
|
- **Uptime**: Monthly uptime percentage
|
|
- **Latency**: Average and p95 latency
|
|
- **Throughput**: Average and peak throughput
|
|
- **Error Rate**: Error rate percentage
|
|
- **Incidents**: Number and duration of incidents
|
|
|
|
### Quarterly Reviews
|
|
|
|
- **SLO Performance**: Review SLO performance
|
|
- **Improvements**: Identify improvements
|
|
- **Capacity Planning**: Plan for capacity increases
|
|
- **Disaster Recovery**: Review disaster recovery procedures
|
|
|
|
## References
|
|
|
|
- [Azure Monitor](https://azure.microsoft.com/services/monitor/)
|
|
- [Prometheus](https://prometheus.io)
|
|
- [Grafana](https://grafana.com)
|
|
- [Status Page](https://statuspage.io)
|
|
|