WIP: HYBX OMNL and deployment documentation updates
This commit is contained in:
252
docs/ccip-integration/operations/CCIP_MONITORING.md
Normal file
252
docs/ccip-integration/operations/CCIP_MONITORING.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# CCIP Monitoring Guide for ChainID 138
|
||||
|
||||
**Date**: 2025-01-27
|
||||
**Network**: ChainID 138 (DeFi Oracle Meta Mainnet)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides monitoring setup and best practices for CCIP infrastructure on ChainID 138.
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Components
|
||||
|
||||
### 1. CCIP Router Monitoring
|
||||
|
||||
#### Events to Monitor
|
||||
|
||||
- `MessageSent`: Track all outgoing messages
|
||||
- Parameters: `messageId`, `destinationChainSelector`, `sender`, `receiver`, `data`, `tokenAmounts`, `feeToken`, `extraArgs`
|
||||
- `MessageReceived`: Track all incoming messages
|
||||
- Parameters: `messageId`, `sourceChainSelector`, `sender`, `data`, `tokenAmounts`
|
||||
|
||||
#### Metrics to Track
|
||||
|
||||
- Message volume (sent/received per hour/day)
|
||||
- Fee collection amounts
|
||||
- Average message size
|
||||
- Success/failure rates
|
||||
- Destination chain distribution
|
||||
|
||||
#### Alerts
|
||||
|
||||
- High message failure rate (>5%)
|
||||
- Unusual fee collection patterns
|
||||
- Router contract errors
|
||||
- Unsupported chain access attempts
|
||||
|
||||
### 2. Bridge Monitoring
|
||||
|
||||
#### Events to Monitor
|
||||
|
||||
**CCIPWETH9Bridge & CCIPWETH10Bridge**:
|
||||
- `CrossChainTransferInitiated`: Track outgoing transfers
|
||||
- Parameters: `messageId`, `sender`, `destinationChainSelector`, `recipient`, `amount`, `nonce`
|
||||
- `CrossChainTransferCompleted`: Track completed transfers
|
||||
- Parameters: `messageId`, `sourceChainSelector`, `recipient`, `amount`
|
||||
- `DestinationAdded`: Track configuration changes
|
||||
- `DestinationRemoved`: Track configuration changes
|
||||
- `DestinationUpdated`: Track configuration changes
|
||||
|
||||
#### Metrics to Track
|
||||
|
||||
- Transfer volume (amount and count)
|
||||
- Average transfer size
|
||||
- Transfer success rate
|
||||
- Time to completion
|
||||
- Fee costs per transfer
|
||||
- Destination chain usage
|
||||
|
||||
#### Alerts
|
||||
|
||||
- Failed transfers
|
||||
- Stuck transfers (no completion after X hours)
|
||||
- Unusual transfer patterns
|
||||
- Configuration changes
|
||||
- Insufficient fee errors
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Setup
|
||||
|
||||
### Option 1: Event Logging Script
|
||||
|
||||
Create a script to monitor events:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Monitor CCIP events
|
||||
|
||||
RPC_URL="${RPC_URL_138:-http://localhost:8545}"
|
||||
ROUTER="${CCIP_CHAIN138_ROUTER:-}"
|
||||
BRIDGE9="${CCIPWETH9_BRIDGE_CHAIN138:-}"
|
||||
BRIDGE10="${CCIPWETH10_BRIDGE_CHAIN138:-}"
|
||||
|
||||
# Monitor router events
|
||||
cast logs --from-block latest \
|
||||
--address "$ROUTER" \
|
||||
--rpc-url "$RPC_URL" \
|
||||
"MessageSent(bytes32,uint64,address,bytes,tuple[],address,bytes)"
|
||||
|
||||
# Monitor bridge events
|
||||
cast logs --from-block latest \
|
||||
--address "$BRIDGE9" \
|
||||
--rpc-url "$RPC_URL" \
|
||||
"CrossChainTransferInitiated(bytes32,address,uint64,address,uint256,uint256)"
|
||||
```
|
||||
|
||||
### Option 2: Prometheus Integration
|
||||
|
||||
Set up Prometheus to scrape CCIP metrics:
|
||||
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'ccip-router'
|
||||
static_configs:
|
||||
- targets: ['localhost:9545']
|
||||
metrics_path: '/metrics'
|
||||
```
|
||||
|
||||
### Option 3: Grafana Dashboards
|
||||
|
||||
Create dashboards for:
|
||||
- Message volume over time
|
||||
- Transfer amounts and counts
|
||||
- Fee collection
|
||||
- Success/failure rates
|
||||
- Destination chain distribution
|
||||
|
||||
---
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
### Critical Alerts
|
||||
|
||||
1. **Router Down**: Router contract becomes unresponsive
|
||||
2. **Bridge Failure**: Bridge fails to process transfers
|
||||
3. **High Failure Rate**: >10% message/transfer failures
|
||||
4. **Configuration Change**: Unauthorized configuration changes
|
||||
|
||||
### Warning Alerts
|
||||
|
||||
1. **High Volume**: Unusual message/transfer volume
|
||||
2. **Fee Anomaly**: Unusual fee collection patterns
|
||||
3. **Slow Processing**: Messages taking longer than expected
|
||||
|
||||
---
|
||||
|
||||
## Logging
|
||||
|
||||
### Recommended Log Levels
|
||||
|
||||
- **INFO**: Normal operations (messages sent/received, transfers)
|
||||
- **WARN**: Recoverable errors, configuration changes
|
||||
- **ERROR**: Failed operations, contract errors
|
||||
- **DEBUG**: Detailed operation logs (for troubleshooting)
|
||||
|
||||
### Log Retention
|
||||
|
||||
- **Event Logs**: Retain for 90 days
|
||||
- **Error Logs**: Retain for 180 days
|
||||
- **Audit Logs**: Retain for 1 year
|
||||
|
||||
---
|
||||
|
||||
## Health Checks
|
||||
|
||||
### Router Health Check
|
||||
|
||||
```bash
|
||||
# Check router is responsive
|
||||
cast call "$ROUTER" "feeToken()" --rpc-url "$RPC_URL"
|
||||
|
||||
# Check supported chains
|
||||
cast call "$ROUTER" "supportedChains(uint64)" "5009297550715157269" --rpc-url "$RPC_URL"
|
||||
```
|
||||
|
||||
### Bridge Health Check
|
||||
|
||||
```bash
|
||||
# Check bridge router connection
|
||||
cast call "$BRIDGE9" "ccipRouter()" --rpc-url "$RPC_URL"
|
||||
|
||||
# Check destinations
|
||||
cast call "$BRIDGE9" "destinations(uint64)" "5009297550715157269" --rpc-url "$RPC_URL"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### Key Performance Indicators (KPIs)
|
||||
|
||||
1. **Message Throughput**: Messages per second
|
||||
2. **Transfer Throughput**: Transfers per hour
|
||||
3. **Average Latency**: Time from send to receive
|
||||
4. **Success Rate**: Percentage of successful operations
|
||||
5. **Fee Efficiency**: Average fee per operation
|
||||
|
||||
### Target Metrics
|
||||
|
||||
- Message success rate: >99%
|
||||
- Average latency: <5 minutes
|
||||
- Transfer success rate: >99.5%
|
||||
- System uptime: >99.9%
|
||||
|
||||
---
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Escalation Procedures
|
||||
|
||||
1. **Level 1**: Automated alerts → On-call engineer
|
||||
2. **Level 2**: Critical failures → Team lead
|
||||
3. **Level 3**: System-wide issues → CTO/Management
|
||||
|
||||
### Response Playbook
|
||||
|
||||
1. **Router Failure**:
|
||||
- Check contract status
|
||||
- Verify RPC connectivity
|
||||
- Review recent transactions
|
||||
- Check for configuration changes
|
||||
|
||||
2. **Bridge Failure**:
|
||||
- Verify router connectivity
|
||||
- Check destination configuration
|
||||
- Review transfer logs
|
||||
- Verify fee payment
|
||||
|
||||
3. **High Failure Rate**:
|
||||
- Analyze failure patterns
|
||||
- Check network conditions
|
||||
- Review recent changes
|
||||
- Escalate if needed
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Tools
|
||||
|
||||
### Recommended Tools
|
||||
|
||||
- **Prometheus**: Metrics collection
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **ELK Stack**: Log aggregation
|
||||
- **PagerDuty**: Alerting and on-call
|
||||
- **Custom Scripts**: Event monitoring
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [CCIP Deployment Guide](../ccip/DEPLOYMENT_GUIDE_CHAIN138.md)
|
||||
- [CCIP Review](../CCIP_CHAIN138_REVIEW.md)
|
||||
- [Operations Runbooks](CCIP_RUNBOOKS.md)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-01-27
|
||||
|
||||
Reference in New Issue
Block a user