- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control. - Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities. - Created .gitmodules to include OpenZeppelin contracts as a submodule. - Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment. - Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks. - Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring. - Created scripts for resource import and usage validation across non-US regions. - Added tests for CCIP error handling and integration to ensure robust functionality. - Included various new files and directories for the orchestration portal and deployment scripts.
4.2 KiB
CCIP Recovery Procedures
Overview
This document provides recovery procedures for CCIP failures and outages.
Recovery Scenarios
Scenario 1: Router Failure
Symptoms: Router service is down or unresponsive
Recovery Steps:
-
Check Router Status
kubectl get pods -n besu-network -l app=ccip-router kubectl describe pod <router-pod> -n besu-network -
Restart Router
kubectl delete pod <router-pod> -n besu-network # Wait for new pod to start kubectl get pods -n besu-network -l app=ccip-router -
Verify Recovery
# Test router connectivity cast call $CCIP_ROUTER "getSupportedTokens(uint64)" $CHAIN_SELECTOR --rpc-url $RPC_URL -
Resume Operations
- Monitor message sending
- Verify message delivery
- Check for backlog
Scenario 2: Contract Failure
Symptoms: Sender or receiver contract is not functioning
Recovery Steps:
-
Identify Issue
# Check contract state cast call $SENDER "paused()" --rpc-url $RPC_URL cast call $RECEIVER "processedMessages(bytes32)" $MESSAGE_ID --rpc-url $RPC_URL -
Pause Operations (if needed)
// Call pause function if available sender.pause(); -
Fix Contract
- Update configuration
- Fix bugs if any
- Deploy new version if needed
-
Resume Operations
// Unpause if paused sender.unpause(); -
Verify Recovery
- Test message sending
- Verify message receiving
- Monitor for issues
Scenario 3: Message Backlog
Symptoms: Messages queued but not being processed
Recovery Steps:
-
Assess Backlog
# Check pending messages cast logs --from-block latest-1000 --address $SENDER --rpc-url $RPC_URL | grep MessageSent -
Identify Cause
- Check router status
- Verify LINK balance
- Check target chain status
-
Clear Backlog
- Fix underlying issue
- Process messages in order
- Monitor processing
-
Prevent Future Backlog
- Increase processing capacity
- Improve error handling
- Add monitoring
Scenario 4: Data Corruption
Symptoms: Invalid messages or corrupted data
Recovery Steps:
-
Identify Corrupted Messages
- Review message logs
- Check for invalid formats
- Identify affected messages
-
Isolate Issue
- Pause message processing
- Prevent further corruption
- Assess impact
-
Recover Data
- Resend valid messages
- Skip corrupted messages
- Update data if possible
-
Prevent Recurrence
- Fix encoding/decoding
- Add validation
- Improve error handling
Scenario 5: Network Partition
Symptoms: Cannot communicate with target chain
Recovery Steps:
-
Verify Connectivity
# Test target chain connectivity curl -X POST $TARGET_CHAIN_RPC_URL -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' -
Wait for Recovery
- Monitor network status
- Check chain status pages
- Wait for partition to resolve
-
Resume Operations
- Verify connectivity restored
- Process queued messages
- Resume normal operations
Recovery Verification
Post-Recovery Checks
-
Functionality
- Test message sending
- Verify message receiving
- Check fee calculation
-
Performance
- Check message latency
- Verify throughput
- Monitor error rates
-
Data Integrity
- Verify message order
- Check for duplicates
- Validate message content
Prevention
Best Practices
-
Monitoring
- Set up comprehensive monitoring
- Configure alerts
- Regular health checks
-
Redundancy
- Deploy multiple router instances
- Use backup contracts
- Maintain backup configurations
-
Testing
- Regular disaster recovery drills
- Test recovery procedures
- Update procedures based on tests
-
Documentation
- Keep runbooks updated
- Document lessons learned
- Share knowledge