# CCIP Relay Service

Off-chain relay for forwarding Chain 138 `MessageSent` events to destination relay routers/bridges.

## Current Topology

Source (Chain 138) — match `.env.bsc` / operator deploy:
- Router: `0x42DAb7b888Dd382bD5Adcf9E038dBF1fD03b4817`
- WETH9 bridge: `0xcacfd227A040002e49e2e01626363071324f820a`

Destinations:
- BSC relay router: `0x4d9Bc6c74ba65E37c4139F0aEC9fc5Ddff28Dcc4`
- BSC relay bridge: `0x886C6A4ABC064dbf74E7caEc460b7eeC31F1b78C`
- AVAX relay router: `0x2a0023Ad5ce1Ac6072B454575996DfFb1BB11b16`
- AVAX relay bridge: `0x3f8C409C6072a2B6a4Ff17071927bA70F80c725F`

Direct first-hop support from Chain 138 is intentionally narrow today:
- Mainnet: supported with the default `.env` profile
- BSC: supported with `.env.bsc`
- Avalanche: supported with `.env.avax`
- Gnosis / Cronos / Celo / Polygon / Arbitrum / Optimism / Base: treat as `via Mainnet hub` unless a dedicated relay router + relay profile are added and proven live

Important: on 2026-04-04, a direct `138 -> Arbitrum` WETH send produced a real source `MessageSent` event but no destination delivery because the live relay worker was running a Mainnet-only destination profile. There is currently no tracked `.env.arbitrum` profile in this folder.

## Env Profiles

Use the prebuilt env files in this folder:
- `.env.mainnet-cw` — Chain 138 cW → **Ethereum mainnet** (`CW_BRIDGE_MAINNET`)
- `.env.mainnet-weth` — WETH lane to mainnet
- `.env.bsc` (template: `.env.bsc.example`)
- `.env.bsc-cw` — Chain 138 cW → **BSC** (`CWMultiTokenBridgeL2` @ `0x0909Fc58…`)
- `.env.avax-cw` — cW → Avalanche
- `.env.avax` — WETH → Avalanche
- `.env` (default/fallback)
- `.env.local` — **only** when running without a named profile, or set `RELAY_ALLOW_ENV_LOCAL=1`

**Pre-flight (required before restart):**

```bash
./scripts/verify/validate-relay-profiles.sh
./scripts/verify/diagnose-cw-mesh-ccip-relay.sh   # mainnet cW lane + balances
```

Named profiles **do not** load `.env.local` (prevents mainnet router + Avalanche RPC mismatch).

Each profile sets destination RPC, selector, relay router/bridge, and destination WETH token.

### `START_BLOCK` after catch-up

When historical `MessageSent` logs are fully relayed, set **`START_BLOCK=latest`** in `.env.bsc` (or your profile) so a cold start only scans from **~current head − 1** instead of re-queuing the whole backfill range. To replay from an old height again, set an explicit decimal block (e.g. `3012930`) and restart.

**BSC RPC:** Prefer a node that accepts short `eth_getLogs` windows (e.g. `https://bsc.publicnode.com`). Some Binance seeds return `-32005` for log queries the relay uses for destination checks.

### Fund BSC relay bridge (WETH)

From repo root (loads `smom-dbis-138/.env` and relay `.env.bsc` for addresses):

```bash
./scripts/bridge/fund-bsc-relay-bridge.sh --dry-run
./scripts/bridge/fund-bsc-relay-bridge.sh          # full deployer WETH → bridge
# ./scripts/bridge/fund-bsc-relay-bridge.sh 1000000000000000  # 0.001 WETH wei
```

Wrap BNB to WETH on the deployer first (`cast send <WETH> "deposit()" --value ...` on BSC) if needed.

### Fund Mainnet relay bridge (WETH)

From repo root:

```bash
./scripts/bridge/fund-mainnet-relay-bridge.sh --dry-run
./scripts/bridge/fund-mainnet-relay-bridge.sh          # full deployer WETH → bridge
# ./scripts/bridge/fund-mainnet-relay-bridge.sh 1000000000000000  # 0.001 WETH wei
```

## Destination tx confirmation timeout

| Env | Default | Purpose |
|-----|---------|---------|
| `RELAY_TX_CONFIRM_TIMEOUT_MS` | `180000` (3 min) | Max wait for `tx.wait()` on mainnet relay txs. On timeout the message is retried instead of blocking the queue processor indefinitely. |

## Relay shedding (save destination gas)

When **no** 138→Mainnet (or configured destination) relay deliveries are needed, pause **destination-chain** transactions so the relayer does not spend native gas on `relayMessage` / direct `ccipReceive`:

| Variable | Meaning |
|----------|---------|
| `RELAY_SHEDDING=1` | **On** — shedding active (`true` / `yes` / `on` also work). |
| `RELAY_DELIVERY_ENABLED=0` | Same as shedding on (`false` / `no` / `off`). |
| `RELAY_SHEDDING_SOURCE_POLL_INTERVAL_MS` | Source router log poll interval while shedding (default **60000** ms, min 5000). Reduces Chain 138 RPC usage. |
| `RELAY_SHEDDING_QUEUE_POLL_MS` | Idle interval for the queue loop while shedding (default **5000** ms, min 1000). |

**Behavior:** Source `MessageSent` logs are still ingested and messages queue locally. Pending queue state is now persisted to `services/relay/data/queue-state.json` by default (override with `RELAY_QUEUE_STATE_PATH`), so a restart no longer drops queued work. For production, still plan shedding around low bridge traffic so the persisted backlog stays small and intentional.

## Skip specific message IDs

Use `RELAY_SKIP_MESSAGE_IDS` as a comma-separated list of source `MessageSent.messageId` values that the relay should intentionally ignore.

This is the safest operational way to park an already-confirmed source message when:
- destination relay inventory is below the requested release amount
- you do not want the relay to keep retrying it after service restarts
- there is no on-chain cancel / refund path on the source bridge

Example:

```bash
RELAY_SKIP_MESSAGE_IDS=0xf718c9895c0a5442349996383184d017d2fa041af7aaeb9f0c0675d3ceed756b
```

The relay checks this list during live event ingestion, historical replay, and queue processing.

For the current Mainnet WETH backlog policy, see:

- [`docs/03-deployment/MAINNET_WETH_RELAY_BACKLOG_POLICY.md`](../../../docs/03-deployment/MAINNET_WETH_RELAY_BACKLOG_POLICY.md)

### On-chain pause (`CCIPRelayRouter`)

The destination **CCIPRelayRouter** inherits OpenZeppelin **`Pausable`**: admins with `DEFAULT_ADMIN_ROLE` may call **`pause()`** / **`unpause()`**. While paused, **`relayMessage` reverts** (no delivery through the router).

**Relay service:** Before sending `relayMessage`, the worker calls **`paused()`** on the destination router (router mode only). If paused, it **re-queues** the message and waits 15s instead of broadcasting a reverting tx. Older routers without `paused()` skip this check (call errors are logged at debug).

**Important:** If you `pause()` the router but leave the relay **process** running **without** `RELAY_SHEDDING=1`, failed txs are much less likely thanks to the check above, but off-chain activity (source polling, queue growth) still runs. Prefer **`RELAY_SHEDDING=1`** (or stop the service) whenever the router is paused for an extended period.

**Direct-delivery** mode (`DEST_DELIVERY_MODE=direct`) calls the bridge’s `ccipReceive` directly and **does not** go through the router—pause the router alone does not stop that path; use shedding or revoke `ROUTER_ROLE` on the bridge as appropriate.

## Start Relay

```bash
cd /home/intlc/projects/proxmox/smom-dbis-138/services/relay
npm install

# BSC relay profile
./start-relay.sh bsc

# AVAX relay profile
./start-relay.sh avax

# Default profile
./start-relay.sh
```

`start-relay.sh` loads env in this order:
1. `.env.<profile>` (if profile argument provided)
2. `.env.local`
3. `.env`

If parent project `.env` defines `PRIVATE_KEY`, `${PRIVATE_KEY}` references in relay env files are expanded.

## Relay Health Endpoint

The relay now exposes a lightweight JSON status endpoint for explorer / mission-control monitoring.

- Default listen address: `0.0.0.0`
- Default port: `9860`
- Endpoints: `GET /healthz`, `GET /health`, `GET /status`
- Health payload includes `concurrency.active_relay_tasks`, `concurrency.max_concurrent`, and `queue.in_flight`

Optional env overrides:

```bash
RELAY_HEALTH_ENABLED=1
RELAY_HEALTH_HOST=0.0.0.0
RELAY_HEALTH_PORT=9860
```

### Fleet health monitor (all lanes)

From repo root:

```bash
pnpm relay:monitor-health
RELAY_MONITOR_STRICT=1 pnpm relay:monitor-health   # exit 1 on alerts
pnpm relay:check-eth                              # relayer ETH on mainnet (min 0.05)
pnpm relay:audit-env                              # START_BLOCK / shedding / concurrency audit
```

Endpoint registry: `config/ccip-relay-health-endpoints.v1.json`

### Unstick stuck messages (mainnet-cw / bsc-cw)

```bash
# Dry-run
./scripts/deployment/unstick-ccip-relay-profile.sh --profile mainnet-cw --start-block 5623000

# Execute: stop, scrub failedIds, replay, drain, reset START_BLOCK=latest
./scripts/deployment/unstick-ccip-relay-profile.sh --profile mainnet-cw --start-block 5623000 --execute
```

## Throughput and RPC optimization

| Variable | Default | Purpose |
|----------|---------|---------|
| `RELAY_MAX_CONCURRENT` | `1` | Parallel queue workers (1–12). Mainnet cW profile uses `3`; BSC cW uses `2`. |
| `DEST_RPC_URL_POOL` | — | Comma-separated destination RPC URLs for read calls (`processed`, inventory probes). Round-robin with failover. |
| `RELAY_DEST_SUBMIT_RPC_URL` | — | Dedicated RPC for **submitting** relay txs (overrides pool for broadcasts). |
| `BLINK_RPC_URL` / `MEV_BLOCKER_RPC_URL` | — | If set in parent `.env`, used as submit RPC when `RELAY_DEST_SUBMIT_RPC_URL` is unset. |

**Behavior:** Each concurrent worker pulls the next queue message, uses `NonceManager` for ordered destination txs, and shares the same retry / shedding rules. Read probes (`isDeliveredOnDestination`, bridge inventory) use the RPC pool; writes use the submit URL when configured.

Idle lanes (Avalanche WETH/cW, Avax→138) set `RELAY_SHEDDING=1` and slower `POLL_INTERVAL` to reduce RPC and gas spend until traffic resumes.

Example from another LAN host:

```bash
curl http://192.168.11.11:9860/healthz | jq .
```

Example explorer backend wiring:

```bash
CCIP_RELAY_HEALTH_URL=http://192.168.11.11:9860/healthz
CCIP_RELAY_HEALTH_URLS=mainnet-weth=http://192.168.11.11:9860/healthz,mainnet-cw=http://192.168.11.11:9863/healthz,bsc=http://192.168.11.11:9861/healthz,avax=http://192.168.11.11:9862/healthz
```

Recommended systemd ports when running multiple relay workers on the same host:

- Mainnet WETH (default `.env`): `9860`
- Mainnet cW (`ccip-relay-mainnet-cw.service`): `9863`
- BSC WETH: `9861`
- BSC cW (`ccip-relay-bsc-cw.service` on r630-04): `9867`
- Avalanche: `9862`

### BSC profile (`start-relay.sh bsc`)

- **Source:** Chain 138 public RPC (`RPC_URL_138` in `.env.bsc`).
- **Destination:** `BSC_RELAY_RPC_URL` in `smom-dbis-138/.env` (Infura BSC; defaults to `BSC_MAINNET_RPC` / `BSC_RPC_URL`).
- **Upstream (not used for relay txs):** `BSC_RPC_URL` / Infura — for operator `cast` and health cross-checks.
- Sync + restart on r630-01: `../../../../scripts/deployment/sync-ccip-relay-bsc-r630-01.sh`
- Verify: `../../../../scripts/verify/check-bsc-relay-rpc.sh`

## Critical Requirements

- Relayer key must hold native gas on destination chain.
- Destination relay bridge must hold enough WETH for payouts.
- Explicit profile token overrides like `DEST_WETH9_ADDRESS` win over the generic multichain token map. This keeps relay-backed destinations pointed at their bridge-managed wrapped token instead of a public native wrapped asset.
- Source bridge destination mapping must point to the correct destination relay bridge.
- Source router `feeToken()` must be a deployed ERC20 with sufficient deployer balance.

## Fast Status Checks

Check source destination mappings:
```bash
cast call 0xcacfd227A040002e49e2e01626363071324f820a "destinations(uint64)" 11344663589394136015 --rpc-url https://rpc.public-0138.defi-oracle.io
cast call 0xcacfd227A040002e49e2e01626363071324f820a "destinations(uint64)" 6433500567565415381 --rpc-url https://rpc.public-0138.defi-oracle.io
```

Check message settlement:
```bash
cast call 0x886C6A4ABC064dbf74E7caEc460b7eeC31F1b78C "processedTransfers(bytes32)(bool)" <bsc_message_id> --rpc-url https://bsc.publicnode.com
cast call 0x3f8C409C6072a2B6a4Ff17071927bA70F80c725F "processedTransfers(bytes32)(bool)" <avax_message_id> --rpc-url https://avalanche-c-chain.publicnode.com
```

Check destination bridge liquidity:
```bash
cast call <dest_weth> "balanceOf(address)(uint256)" <dest_relay_bridge> --rpc-url <dest_rpc>
```