Files
smom-dbis-138/docs/ccip-integration/operations/CCIP_MONITORING.md

5.7 KiB

CCIP Monitoring Guide for ChainID 138

Date: 2025-01-27
Network: ChainID 138 (DeFi Oracle Meta Mainnet)


Overview

This guide provides monitoring setup and best practices for CCIP infrastructure on ChainID 138.


Monitoring Components

1. CCIP Router Monitoring

Events to Monitor

  • MessageSent: Track all outgoing messages
    • Parameters: messageId, destinationChainSelector, sender, receiver, data, tokenAmounts, feeToken, extraArgs
  • MessageReceived: Track all incoming messages
    • Parameters: messageId, sourceChainSelector, sender, data, tokenAmounts

Metrics to Track

  • Message volume (sent/received per hour/day)
  • Fee collection amounts
  • Average message size
  • Success/failure rates
  • Destination chain distribution

Alerts

  • High message failure rate (>5%)
  • Unusual fee collection patterns
  • Router contract errors
  • Unsupported chain access attempts

2. Bridge Monitoring

Events to Monitor

CCIPWETH9Bridge & CCIPWETH10Bridge:

  • CrossChainTransferInitiated: Track outgoing transfers
    • Parameters: messageId, sender, destinationChainSelector, recipient, amount, nonce
  • CrossChainTransferCompleted: Track completed transfers
    • Parameters: messageId, sourceChainSelector, recipient, amount
  • DestinationAdded: Track configuration changes
  • DestinationRemoved: Track configuration changes
  • DestinationUpdated: Track configuration changes

Metrics to Track

  • Transfer volume (amount and count)
  • Average transfer size
  • Transfer success rate
  • Time to completion
  • Fee costs per transfer
  • Destination chain usage

Alerts

  • Failed transfers
  • Stuck transfers (no completion after X hours)
  • Unusual transfer patterns
  • Configuration changes
  • Insufficient fee errors

Monitoring Setup

Option 1: Event Logging Script

Create a script to monitor events:

#!/bin/bash
# Monitor CCIP events

RPC_URL="${RPC_URL_138:-http://localhost:8545}"
ROUTER="${CCIP_CHAIN138_ROUTER:-}"
BRIDGE9="${CCIPWETH9_BRIDGE_CHAIN138:-}"
BRIDGE10="${CCIPWETH10_BRIDGE_CHAIN138:-}"

# Monitor router events
cast logs --from-block latest \
  --address "$ROUTER" \
  --rpc-url "$RPC_URL" \
  "MessageSent(bytes32,uint64,address,bytes,tuple[],address,bytes)"

# Monitor bridge events
cast logs --from-block latest \
  --address "$BRIDGE9" \
  --rpc-url "$RPC_URL" \
  "CrossChainTransferInitiated(bytes32,address,uint64,address,uint256,uint256)"

Option 2: Prometheus Integration

Set up Prometheus to scrape CCIP metrics:

# prometheus.yml
scrape_configs:
  - job_name: 'ccip-router'
    static_configs:
      - targets: ['localhost:9545']
    metrics_path: '/metrics'

Option 3: Grafana Dashboards

Create dashboards for:

  • Message volume over time
  • Transfer amounts and counts
  • Fee collection
  • Success/failure rates
  • Destination chain distribution

Alerting Rules

Critical Alerts

  1. Router Down: Router contract becomes unresponsive
  2. Bridge Failure: Bridge fails to process transfers
  3. High Failure Rate: >10% message/transfer failures
  4. Configuration Change: Unauthorized configuration changes

Warning Alerts

  1. High Volume: Unusual message/transfer volume
  2. Fee Anomaly: Unusual fee collection patterns
  3. Slow Processing: Messages taking longer than expected

Logging

  • INFO: Normal operations (messages sent/received, transfers)
  • WARN: Recoverable errors, configuration changes
  • ERROR: Failed operations, contract errors
  • DEBUG: Detailed operation logs (for troubleshooting)

Log Retention

  • Event Logs: Retain for 90 days
  • Error Logs: Retain for 180 days
  • Audit Logs: Retain for 1 year

Health Checks

Router Health Check

# Check router is responsive
cast call "$ROUTER" "feeToken()" --rpc-url "$RPC_URL"

# Check supported chains
cast call "$ROUTER" "supportedChains(uint64)" "5009297550715157269" --rpc-url "$RPC_URL"

Bridge Health Check

# Check bridge router connection
cast call "$BRIDGE9" "ccipRouter()" --rpc-url "$RPC_URL"

# Check destinations
cast call "$BRIDGE9" "destinations(uint64)" "5009297550715157269" --rpc-url "$RPC_URL"

Performance Metrics

Key Performance Indicators (KPIs)

  1. Message Throughput: Messages per second
  2. Transfer Throughput: Transfers per hour
  3. Average Latency: Time from send to receive
  4. Success Rate: Percentage of successful operations
  5. Fee Efficiency: Average fee per operation

Target Metrics

  • Message success rate: >99%
  • Average latency: <5 minutes
  • Transfer success rate: >99.5%
  • System uptime: >99.9%

Incident Response

Escalation Procedures

  1. Level 1: Automated alerts → On-call engineer
  2. Level 2: Critical failures → Team lead
  3. Level 3: System-wide issues → CTO/Management

Response Playbook

  1. Router Failure:

    • Check contract status
    • Verify RPC connectivity
    • Review recent transactions
    • Check for configuration changes
  2. Bridge Failure:

    • Verify router connectivity
    • Check destination configuration
    • Review transfer logs
    • Verify fee payment
  3. High Failure Rate:

    • Analyze failure patterns
    • Check network conditions
    • Review recent changes
    • Escalate if needed

Monitoring Tools

  • Prometheus: Metrics collection
  • Grafana: Visualization and dashboards
  • ELK Stack: Log aggregation
  • PagerDuty: Alerting and on-call
  • Custom Scripts: Event monitoring


Last Updated: 2025-01-27