WIP: HYBX OMNL and deployment documentation updates

2026-06-02 06:09:56 -07:00
parent f04a7cb7c8
commit d31aad7d66
33 changed files with 78 additions and 2878 deletions
--- a/docs/ccip-integration/operations/CCIP_MONITORING.md
+++ b/docs/ccip-integration/operations/CCIP_MONITORING.md
@@ -0,0 +1,252 @@
+# CCIP Monitoring Guide for ChainID 138
+
+**Date**: 2025-01-27  
+**Network**: ChainID 138 (DeFi Oracle Meta Mainnet)
+
+---
+
+## Overview
+
+This guide provides monitoring setup and best practices for CCIP infrastructure on ChainID 138.
+
+---
+
+## Monitoring Components
+
+### 1. CCIP Router Monitoring
+
+#### Events to Monitor
+
+- `MessageSent`: Track all outgoing messages
+  - Parameters: `messageId`, `destinationChainSelector`, `sender`, `receiver`, `data`, `tokenAmounts`, `feeToken`, `extraArgs`
+- `MessageReceived`: Track all incoming messages
+  - Parameters: `messageId`, `sourceChainSelector`, `sender`, `data`, `tokenAmounts`
+
+#### Metrics to Track
+
+- Message volume (sent/received per hour/day)
+- Fee collection amounts
+- Average message size
+- Success/failure rates
+- Destination chain distribution
+
+#### Alerts
+
+- High message failure rate (>5%)
+- Unusual fee collection patterns
+- Router contract errors
+- Unsupported chain access attempts
+
+### 2. Bridge Monitoring
+
+#### Events to Monitor
+
+**CCIPWETH9Bridge & CCIPWETH10Bridge**:
+- `CrossChainTransferInitiated`: Track outgoing transfers
+  - Parameters: `messageId`, `sender`, `destinationChainSelector`, `recipient`, `amount`, `nonce`
+- `CrossChainTransferCompleted`: Track completed transfers
+  - Parameters: `messageId`, `sourceChainSelector`, `recipient`, `amount`
+- `DestinationAdded`: Track configuration changes
+- `DestinationRemoved`: Track configuration changes
+- `DestinationUpdated`: Track configuration changes
+
+#### Metrics to Track
+
+- Transfer volume (amount and count)
+- Average transfer size
+- Transfer success rate
+- Time to completion
+- Fee costs per transfer
+- Destination chain usage
+
+#### Alerts
+
+- Failed transfers
+- Stuck transfers (no completion after X hours)
+- Unusual transfer patterns
+- Configuration changes
+- Insufficient fee errors
+
+---
+
+## Monitoring Setup
+
+### Option 1: Event Logging Script
+
+Create a script to monitor events:
+
+```bash
+#!/bin/bash
+# Monitor CCIP events
+
+RPC_URL="${RPC_URL_138:-http://localhost:8545}"
+ROUTER="${CCIP_CHAIN138_ROUTER:-}"
+BRIDGE9="${CCIPWETH9_BRIDGE_CHAIN138:-}"
+BRIDGE10="${CCIPWETH10_BRIDGE_CHAIN138:-}"
+
+# Monitor router events
+cast logs --from-block latest \
+  --address "$ROUTER" \
+  --rpc-url "$RPC_URL" \
+  "MessageSent(bytes32,uint64,address,bytes,tuple[],address,bytes)"
+
+# Monitor bridge events
+cast logs --from-block latest \
+  --address "$BRIDGE9" \
+  --rpc-url "$RPC_URL" \
+  "CrossChainTransferInitiated(bytes32,address,uint64,address,uint256,uint256)"
+```
+
+### Option 2: Prometheus Integration
+
+Set up Prometheus to scrape CCIP metrics:
+
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'ccip-router'
+    static_configs:
+      - targets: ['localhost:9545']
+    metrics_path: '/metrics'
+```
+
+### Option 3: Grafana Dashboards
+
+Create dashboards for:
+- Message volume over time
+- Transfer amounts and counts
+- Fee collection
+- Success/failure rates
+- Destination chain distribution
+
+---
+
+## Alerting Rules
+
+### Critical Alerts
+
+1. **Router Down**: Router contract becomes unresponsive
+2. **Bridge Failure**: Bridge fails to process transfers
+3. **High Failure Rate**: >10% message/transfer failures
+4. **Configuration Change**: Unauthorized configuration changes
+
+### Warning Alerts
+
+1. **High Volume**: Unusual message/transfer volume
+2. **Fee Anomaly**: Unusual fee collection patterns
+3. **Slow Processing**: Messages taking longer than expected
+
+---
+
+## Logging
+
+### Recommended Log Levels
+
+- **INFO**: Normal operations (messages sent/received, transfers)
+- **WARN**: Recoverable errors, configuration changes
+- **ERROR**: Failed operations, contract errors
+- **DEBUG**: Detailed operation logs (for troubleshooting)
+
+### Log Retention
+
+- **Event Logs**: Retain for 90 days
+- **Error Logs**: Retain for 180 days
+- **Audit Logs**: Retain for 1 year
+
+---
+
+## Health Checks
+
+### Router Health Check
+
+```bash
+# Check router is responsive
+cast call "$ROUTER" "feeToken()" --rpc-url "$RPC_URL"
+
+# Check supported chains
+cast call "$ROUTER" "supportedChains(uint64)" "5009297550715157269" --rpc-url "$RPC_URL"
+```
+
+### Bridge Health Check
+
+```bash
+# Check bridge router connection
+cast call "$BRIDGE9" "ccipRouter()" --rpc-url "$RPC_URL"
+
+# Check destinations
+cast call "$BRIDGE9" "destinations(uint64)" "5009297550715157269" --rpc-url "$RPC_URL"
+```
+
+---
+
+## Performance Metrics
+
+### Key Performance Indicators (KPIs)
+
+1. **Message Throughput**: Messages per second
+2. **Transfer Throughput**: Transfers per hour
+3. **Average Latency**: Time from send to receive
+4. **Success Rate**: Percentage of successful operations
+5. **Fee Efficiency**: Average fee per operation
+
+### Target Metrics
+
+- Message success rate: >99%
+- Average latency: <5 minutes
+- Transfer success rate: >99.5%
+- System uptime: >99.9%
+
+---
+
+## Incident Response
+
+### Escalation Procedures
+
+1. **Level 1**: Automated alerts → On-call engineer
+2. **Level 2**: Critical failures → Team lead
+3. **Level 3**: System-wide issues → CTO/Management
+
+### Response Playbook
+
+1. **Router Failure**:
+   - Check contract status
+   - Verify RPC connectivity
+   - Review recent transactions
+   - Check for configuration changes
+
+2. **Bridge Failure**:
+   - Verify router connectivity
+   - Check destination configuration
+   - Review transfer logs
+   - Verify fee payment
+
+3. **High Failure Rate**:
+   - Analyze failure patterns
+   - Check network conditions
+   - Review recent changes
+   - Escalate if needed
+
+---
+
+## Monitoring Tools
+
+### Recommended Tools
+
+- **Prometheus**: Metrics collection
+- **Grafana**: Visualization and dashboards
+- **ELK Stack**: Log aggregation
+- **PagerDuty**: Alerting and on-call
+- **Custom Scripts**: Event monitoring
+
+---
+
+## Related Documentation
+
+- [CCIP Deployment Guide](../ccip/DEPLOYMENT_GUIDE_CHAIN138.md)
+- [CCIP Review](../CCIP_CHAIN138_REVIEW.md)
+- [Operations Runbooks](CCIP_RUNBOOKS.md)
+
+---
+
+**Last Updated**: 2025-01-27
+