Files
smom-dbis-138/runbooks/disaster-recovery.md
defiQUG 1fb7266469 Add Oracle Aggregator and CCIP Integration
- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control.
- Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities.
- Created .gitmodules to include OpenZeppelin contracts as a submodule.
- Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment.
- Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks.
- Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring.
- Created scripts for resource import and usage validation across non-US regions.
- Added tests for CCIP error handling and integration to ensure robust functionality.
- Included various new files and directories for the orchestration portal and deployment scripts.
2025-12-12 14:57:48 -08:00

4.3 KiB

Disaster Recovery Runbook

Overview

This runbook provides procedures for disaster recovery in the DeFi Oracle Meta Mainnet (ChainID 138) network.

Recovery Objectives

RTO (Recovery Time Objective)

  • Critical Services: 1 hour
  • Non-Critical Services: 4 hours
  • Full Recovery: 24 hours

RPO (Recovery Point Objective)

  • Chaindata: 24 hours
  • Configuration: 1 hour
  • Keys: Real-time (Key Vault)

Disaster Scenarios

Scenario 1: Complete Cluster Failure

Symptoms:

  • All pods unavailable
  • Cluster unresponsive
  • No network connectivity

Recovery Steps:

  1. Assess damage
  2. Restore infrastructure
  3. Restore chaindata from backups
  4. Restore configuration
  5. Restore keys from Key Vault
  6. Restart services
  7. Verify network operation

Scenario 2: Data Loss

Symptoms:

  • Chaindata corrupted
  • Blocks missing
  • Database errors

Recovery Steps:

  1. Stop affected services
  2. Restore from backup
  3. Verify data integrity
  4. Restart services
  5. Verify synchronization
  6. Monitor for issues

Scenario 3: Key Compromise

Symptoms:

  • Unauthorized transactions
  • Suspicious activity
  • Key exposure

Recovery Steps:

  1. Isolate affected components
  2. Rotate compromised keys
  3. Update validator set
  4. Update configuration
  5. Restart services
  6. Monitor for issues

Scenario 4: Network Partition

Symptoms:

  • Validators split into groups
  • Conflicting blocks
  • Network instability

Recovery Steps:

  1. Identify partition
  2. Stop minority partition
  3. Continue with majority
  4. Resolve conflicts
  5. Restart stopped validators
  6. Verify consensus

Recovery Procedures

Infrastructure Recovery

  1. Restore Terraform State

    terraform init -backend-config="..."
    terraform plan
    terraform apply
    
  2. Restore Kubernetes Cluster

    # Restore from backup
    kubectl apply -f backup/cluster-backup.yaml
    
  3. Restore Network Configuration

    kubectl apply -f k8s/base/
    

Data Recovery

  1. Restore Chaindata

    ./scripts/backup/restore-chaindata.sh <backup-file> <pod-name>
    
  2. Restore Database

    # Restore Blockscout database
    kubectl exec -i -n besu-network blockscout-db-0 -- \
      psql -U blockscout -d blockscout < backup/blockscout-backup.sql
    
  3. Restore Configuration

    kubectl apply -f config/
    

Key Recovery

  1. Restore from Key Vault

    az keyvault secret show --vault-name defi-oracle-kv --name validator-key-1
    
  2. Update Kubernetes Secrets

    kubectl create secret generic besu-validator-keys \
      --from-literal=key-1=<key-from-vault> \
      -n besu-network
    
  3. Restart Validators

    kubectl rollout restart statefulset/besu-validator -n besu-network
    

Backup Procedures

Daily Backups

  1. Chaindata Backup

    ./scripts/backup/backup-chaindata.sh
    
  2. Database Backup

    kubectl exec -n besu-network blockscout-db-0 -- \
      pg_dump -U blockscout blockscout > backup/blockscout-$(date +%Y%m%d).sql
    
  3. Configuration Backup

    kubectl get all -n besu-network -o yaml > backup/cluster-$(date +%Y%m%d).yaml
    

Backup Retention

  • Daily: 7 days
  • Weekly: 4 weeks
  • Monthly: 12 months
  • Yearly: 7 years

Testing Recovery

Test Procedures

  1. Test Backup Restoration

    • Monthly backup restoration test
    • Verify data integrity
    • Document results
  2. Test Disaster Recovery

    • Quarterly disaster recovery drill
    • Simulate failure scenarios
    • Measure recovery time
    • Document lessons learned
  3. Test Key Rotation

    • Monthly key rotation test
    • Verify key update process
    • Document results

Monitoring Recovery

Recovery Metrics

  • Recovery time
  • Data loss
  • Service availability
  • Error rates

Post-Recovery Checklist

  • All services running
  • Network operational
  • Blocks being produced
  • RPC endpoints responding
  • Monitoring working
  • Alerts configured
  • Documentation updated

Contacts

Resources