- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control. - Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities. - Created .gitmodules to include OpenZeppelin contracts as a submodule. - Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment. - Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks. - Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring. - Created scripts for resource import and usage validation across non-US regions. - Added tests for CCIP error handling and integration to ensure robust functionality. - Included various new files and directories for the orchestration portal and deployment scripts.
4.3 KiB
4.3 KiB
Disaster Recovery Runbook
Overview
This runbook provides procedures for disaster recovery in the DeFi Oracle Meta Mainnet (ChainID 138) network.
Recovery Objectives
RTO (Recovery Time Objective)
- Critical Services: 1 hour
- Non-Critical Services: 4 hours
- Full Recovery: 24 hours
RPO (Recovery Point Objective)
- Chaindata: 24 hours
- Configuration: 1 hour
- Keys: Real-time (Key Vault)
Disaster Scenarios
Scenario 1: Complete Cluster Failure
Symptoms:
- All pods unavailable
- Cluster unresponsive
- No network connectivity
Recovery Steps:
- Assess damage
- Restore infrastructure
- Restore chaindata from backups
- Restore configuration
- Restore keys from Key Vault
- Restart services
- Verify network operation
Scenario 2: Data Loss
Symptoms:
- Chaindata corrupted
- Blocks missing
- Database errors
Recovery Steps:
- Stop affected services
- Restore from backup
- Verify data integrity
- Restart services
- Verify synchronization
- Monitor for issues
Scenario 3: Key Compromise
Symptoms:
- Unauthorized transactions
- Suspicious activity
- Key exposure
Recovery Steps:
- Isolate affected components
- Rotate compromised keys
- Update validator set
- Update configuration
- Restart services
- Monitor for issues
Scenario 4: Network Partition
Symptoms:
- Validators split into groups
- Conflicting blocks
- Network instability
Recovery Steps:
- Identify partition
- Stop minority partition
- Continue with majority
- Resolve conflicts
- Restart stopped validators
- Verify consensus
Recovery Procedures
Infrastructure Recovery
-
Restore Terraform State
terraform init -backend-config="..." terraform plan terraform apply -
Restore Kubernetes Cluster
# Restore from backup kubectl apply -f backup/cluster-backup.yaml -
Restore Network Configuration
kubectl apply -f k8s/base/
Data Recovery
-
Restore Chaindata
./scripts/backup/restore-chaindata.sh <backup-file> <pod-name> -
Restore Database
# Restore Blockscout database kubectl exec -i -n besu-network blockscout-db-0 -- \ psql -U blockscout -d blockscout < backup/blockscout-backup.sql -
Restore Configuration
kubectl apply -f config/
Key Recovery
-
Restore from Key Vault
az keyvault secret show --vault-name defi-oracle-kv --name validator-key-1 -
Update Kubernetes Secrets
kubectl create secret generic besu-validator-keys \ --from-literal=key-1=<key-from-vault> \ -n besu-network -
Restart Validators
kubectl rollout restart statefulset/besu-validator -n besu-network
Backup Procedures
Daily Backups
-
Chaindata Backup
./scripts/backup/backup-chaindata.sh -
Database Backup
kubectl exec -n besu-network blockscout-db-0 -- \ pg_dump -U blockscout blockscout > backup/blockscout-$(date +%Y%m%d).sql -
Configuration Backup
kubectl get all -n besu-network -o yaml > backup/cluster-$(date +%Y%m%d).yaml
Backup Retention
- Daily: 7 days
- Weekly: 4 weeks
- Monthly: 12 months
- Yearly: 7 years
Testing Recovery
Test Procedures
-
Test Backup Restoration
- Monthly backup restoration test
- Verify data integrity
- Document results
-
Test Disaster Recovery
- Quarterly disaster recovery drill
- Simulate failure scenarios
- Measure recovery time
- Document lessons learned
-
Test Key Rotation
- Monthly key rotation test
- Verify key update process
- Document results
Monitoring Recovery
Recovery Metrics
- Recovery time
- Data loss
- Service availability
- Error rates
Post-Recovery Checklist
- All services running
- Network operational
- Blocks being produced
- RPC endpoints responding
- Monitoring working
- Alerts configured
- Documentation updated
Contacts
- On-Call: Check PagerDuty
- Engineering Lead: engineering@d-bis.org
- Emergency: +1-XXX-XXX-XXXX