14 KiB
Operations Runbook - Complete System
Date: Operations Runbook
Status: ✅ COMPLETE
Overview
This runbook provides operational procedures for:
- Vault System Operations
- ISO-4217 W Token System Operations
- Bridge System Operations
- Emergency Procedures
1. Daily Operations
1.1 Vault System Monitoring
Health Check
# Check vault health ratios
cast call $LEDGER_ADDRESS "getVaultHealth(address)" $VAULT_ADDRESS --rpc-url $RPC_URL
# Check total collateral
cast call $LEDGER_ADDRESS "totalCollateral(address)" $ASSET_ADDRESS --rpc-url $RPC_URL
# Check total debt
cast call $LEDGER_ADDRESS "totalDebt(address)" $CURRENCY_ADDRESS --rpc-url $RPC_URL
Alert Thresholds
- Health Ratio < 120%: Warning alert
- Health Ratio < 110%: Critical alert (liquidation threshold)
- Debt Ceiling > 90%: Warning alert
- Oracle Staleness > 1 hour: Critical alert
1.2 ISO-4217 W Token Monitoring
Reserve Verification
# Check reserve sufficiency for USDW
cast call $USDW_ADDRESS "isReserveSufficient()" --rpc-url $RPC_URL
# Get reserve balance
cast call $USDW_ADDRESS "verifiedReserve()" --rpc-url $RPC_URL
# Get total supply
cast call $USDW_ADDRESS "totalSupply()" --rpc-url $RPC_URL
# Calculate reserve ratio
# Reserve Ratio = (verifiedReserve / totalSupply) * 100
Daily Reserve Check
-
Check Reserve Oracle Reports
cast call $RESERVE_ORACLE "getVerifiedReserve(address)" $USDW_ADDRESS --rpc-url $RPC_URL -
Verify Quorum
cast call $RESERVE_ORACLE "isQuorumMet(address)" $USDW_ADDRESS --rpc-url $RPC_URL -
Check for Stale Reports
- Reports older than 1 hour should be removed
- If quorum not met, investigate oracle issues
Alert Thresholds
- Reserve Ratio < 100%: CRITICAL - Minting must halt
- Reserve Ratio < 105%: Warning alert
- Oracle Quorum Not Met: Critical alert
- Stale Reports Detected: Warning alert
1.3 Bridge System Monitoring
Bridge Health Metrics
# Check bridge success rate
# Query bridge events for success/failure counts
# Check settlement times
# Monitor TransferStatusUpdated events
# Check reserve verification failures
# Monitor ReserveVerified events with sufficient=false
Alert Thresholds
- Success Rate < 95%: Warning alert
- Success Rate < 90%: Critical alert
- Settlement Time > 1 hour: Warning alert
- Reserve Verification Failures: Critical alert
- Compliance Violations: Critical alert
1.4 Reserve and Stabilization Policies (VAULT_SYSTEM_MASTER_TECHNICAL_PLAN)
The following formulas and checklists are from VAULT_SYSTEM_MASTER_TECHNICAL_PLAN. Use them for sizing and operational verification.
Reserve Sizing Model
- Variables: PeakMinuteOutflow = P, StabilizationWindow = T (minutes).
- Required reserve: Reserve ≥ P × T.
- Recommended safety factor: 3–5× peak minute outflow.
- Example: P = 10,000, T = 5 min → Reserve ≥ 50,000; with 3× safety → 150,000.
Cantilever Stabilization Model
- Condition: s × f ≥ Δ (s = micro trade size, f = micro trade frequency, Δ = net imbalance per minute).
- Dynamic rule: If deviation > θ, set s = k × deviation (eliminates fixed frequency dependency).
- Use: Size and frequency of stabilization trades so throughput offsets macro flow.
Bridge Liquidity Buffer
- Rule: BridgeReserve ≥ PeakBridgeOutflow × Latency (where Latency = bridge settlement time).
- Use: Ensure cross-chain bridge buffers satisfy this so outflows do not exhaust reserves during settlement.
Cross-chain parity and bridge buffer
- Objective: Maintain |Price138 − Price651940| < ArbitrageThreshold (see CROSS_CHAIN_ARBITRAGE_DESIGN). Cross-chain private arbitrage bots execute when deviation exceeds threshold; bridge reserve must be sized so outflows do not exhaust reserves during settlement.
- Bridge buffer formula: BridgeReserve ≥ PeakBridgeOutflow × Latency.
- PeakBridgeOutflow: Measure from bridge events (e.g. lock/release or TransferInitiated volume) over a rolling window (e.g. peak hourly or daily outflow in USD or token units).
- Latency: Bridge settlement time (e.g. typical time from lock on source chain to release on destination, in minutes or blocks). Use historical median or P95.
- Sizing steps: (1) Query bridge contract events for initiated/released amounts per time window; (2) compute peak outflow; (3) measure typical settlement latency; (4) set minimum reserve = Peak × Latency; (5) add safety factor (e.g. 1.5–2×) and document in runbook.
- Alert when reserve below: If BridgeReserve < PeakBridgeOutflow × Latency (or below safety threshold), trigger Warning alert. If reserve is falling and may breach within one settlement window, escalate to Critical. Integrate with existing monitoring (e.g. Prometheus + PagerDuty when monitoring stack is deployed). See VAULT_SYSTEM_MASTER_TECHNICAL_PLAN §9.
Flash Loan Containment Checklist
- Use TWAP deviation detection (not single-block price).
- Ignore single-block imbalance for stabilizer triggers.
- Require sustained deviation for N blocks before rebalancing.
- Cap per-block stabilization volume to limit flash-driven execution.
- Target: Flash drain recovery <3 blocks (per Master Plan §16).
- On-chain: The Stabilizer (Phase 3 + 6) implements block delay, sustained-deviation buffer, per-block volume cap, and slippage/gas checks; deploy and configure per CONTRACT_DEPLOYMENT_RUNBOOK § Stabilizer.
2. Weekly Operations
2.1 Reserve Attestation
Weekly Reserve Report
-
Collect Custodial Balances
- USDW: Check USD custodial account
- EURW: Check EUR custodial account
- GBPW: Check GBP custodial account
-
Submit Oracle Reports
reserveOracle.submitReserveReport( tokenAddress, reserveBalance, block.timestamp ); -
Verify Consensus
- Ensure quorum is met
- Verify consensus matches custodial balance
-
Publish Proof-of-Reserves
- Generate Merkle tree of reserves
- Publish on-chain hash
- Update public dashboard
2.2 System Health Review
Review Metrics
- Total vaults created
- Total collateral locked
- Total debt issued
- W token supply per currency
- Reserve ratios
- Bridge operations count
- Success rates
Generate Report
- Weekly operations report
- Reserve attestation report
- Compliance status report
3. Monthly Operations
3.1 Security Review
Access Control Audit
- Review all role assignments
- Verify principle of least privilege
- Check for unused roles
- Review multi-sig configurations
Compliance Audit
- Verify money multiplier = 1.0 (all W tokens)
- Verify GRU isolation (no GRU conversions)
- Verify ISO-4217 compliance
- Review reserve attestations
Code Review
- Review recent changes
- Check for security updates
- Review dependency updates
- Verify test coverage
3.2 Performance Review
Gas Optimization
- Review gas usage trends
- Identify optimization opportunities
- Test optimization proposals
System Performance
- Review transaction throughput
- Check oracle update frequency
- Review bridge settlement times
- Analyze user patterns
4. Emergency Procedures
4.1 Reserve Shortfall (W Tokens)
Symptoms
- Reserve < Supply for any W token
- Money multiplier < 1.0
- Reserve verification fails
Immediate Actions
-
Halt Minting
// Disable mint controller mintController.revokeRole(keccak256("MINTER_ROLE"), minterAddress); -
Alert Team
- Notify operations team
- Notify compliance team
- Prepare public statement
-
Investigate
- Check custodial account balance
- Verify oracle reports
- Check for accounting errors
-
Remediation
- If accounting error: Correct and resume
- If actual shortfall: Add reserves or halt operations
- If oracle issue: Fix oracle and resume
Recovery Steps
- Verify reserve restored
- Re-enable minting
- Resume normal operations
- Post-mortem review
4.2 Vault Liquidation Event
Symptoms
- Vault health ratio < 110%
- Liquidation triggered
Immediate Actions
-
Verify Liquidation
cast call $LIQUIDATION_ADDRESS "canLiquidate(address)" $VAULT_ADDRESS --rpc-url $RPC_URL -
Monitor Liquidation
- Track liquidation events
- Verify collateral seized
- Verify debt repaid
-
Post-Liquidation
- Check remaining vault health
- Verify system stability
- Notify vault owner
4.3 Bridge Failure
Symptoms
- Bridge transaction fails
- Settlement timeout
- Reserve verification fails on bridge
Immediate Actions
-
Check Bridge Status
cast call $BRIDGE_REGISTRY "destinations(uint256)" $CHAIN_ID --rpc-url $RPC_URL -
Investigate Failure
- Check transaction logs
- Verify destination chain status
- Check reserve verification
-
Initiate Refund (if timeout)
bridgeEscrowVault.initiateRefund(refundRequest, hsmSigner); bridgeEscrowVault.executeRefund(transferId); -
Resume Operations
- Fix underlying issue
- Re-enable bridge route
- Resume normal operations
4.4 Oracle Failure
Symptoms
- Oracle staleness detected
- Quorum not met
- Price feed failure
Immediate Actions
-
Check Oracle Status
cast call $XAU_ORACLE "isFrozen()" --rpc-url $RPC_URL cast call $RESERVE_ORACLE "isQuorumMet(address)" $TOKEN_ADDRESS --rpc-url $RPC_URL -
Freeze System (if critical)
xauOracle.freeze(); // Pause vault operations if needed -
Fix Oracle
- Add new oracle feeds
- Remove stale reports
- Restore quorum
-
Resume Operations
xauOracle.unfreeze();
4.5 Compliance Violation
Symptoms
- Money multiplier > 1.0 detected
- GRU conversion detected
- ISO-4217 violation
Immediate Actions
-
Halt Operations
- Pause minting
- Pause bridging
- Freeze affected tokens
-
Investigate
- Review transaction history
- Identify violation source
- Check compliance guard logs
-
Remediation
- Fix violation
- Restore compliance
- Resume operations
-
Post-Mortem
- Document violation
- Update compliance rules
- Prevent recurrence
5. Incident Response
5.1 Incident Classification
Severity Levels
CRITICAL (P0):
- Reserve < Supply (money multiplier violation)
- System compromise
- Complete system failure
HIGH (P1):
- Reserve ratio < 105%
- Bridge failures > 10%
- Oracle quorum failure
MEDIUM (P2):
- Reserve ratio < 110%
- Bridge failures 5-10%
- Single oracle failure
LOW (P3):
- Minor performance issues
- Non-critical alerts
- Documentation updates
5.2 Incident Response Process
Step 1: Detection
- Monitor alerts
- Review logs
- User reports
Step 2: Assessment
- Classify severity
- Assess impact
- Identify root cause
Step 3: Containment
- Apply emergency procedures
- Halt affected operations
- Isolate issue
Step 4: Resolution
- Fix root cause
- Restore operations
- Verify fix
Step 5: Post-Mortem
- Document incident
- Identify improvements
- Update procedures
6. Backup & Recovery
6.1 Backup Procedures
Daily Backups
- Contract state snapshots
- Configuration backups
- Access control backups
Weekly Backups
- Complete system state
- Oracle configuration
- Compliance rules
Monthly Backups
- Full system archive
- Historical data
- Audit logs
6.2 Recovery Procedures
Contract State Recovery
- Identify backup point
- Restore contract state
- Verify restoration
- Resume operations
Configuration Recovery
- Restore configuration files
- Verify settings
- Test functionality
- Resume operations
7. Monitoring Setup
7.1 Key Metrics
Vault System Metrics
- Total vaults
- Total collateral (by asset)
- Total debt (by currency)
- Average health ratio
- Liquidation events
W Token Metrics
- Supply per token (USDW, EURW, etc.)
- Reserve balance per token
- Reserve ratio per token
- Mint/burn events
- Redemption events
Bridge Metrics
- Bridge success rate
- Average settlement time
- Reserve verification success rate
- Compliance check success rate
- Transfer volume
7.2 Alert Configuration
Critical Alerts
- name: Reserve Shortfall
condition: reserveRatio < 100%
action: halt_minting
- name: Money Multiplier Violation
condition: reserve < supply
action: emergency_pause
- name: Bridge Failure Rate High
condition: successRate < 90%
action: alert_team
Warning Alerts
- name: Reserve Ratio Low
condition: reserveRatio < 105%
action: alert_team
- name: Vault Health Low
condition: healthRatio < 120%
action: alert_team
- name: Oracle Staleness
condition: reportAge > 1hour
action: alert_team
8. Operational Checklists
8.1 Daily Checklist
- Check all reserve ratios (W tokens)
- Verify oracle quorum status
- Check vault health ratios
- Review bridge success rates
- Check for critical alerts
- Review error logs
8.2 Weekly Checklist
- Submit reserve attestations
- Review system metrics
- Check access control roles
- Review compliance status
- Generate weekly report
- Update documentation
8.3 Monthly Checklist
- Security review
- Compliance audit
- Performance review
- Backup verification
- Update procedures
- Team training
9. Contact Information
Emergency Contacts
- Operations Team: [Contact Info]
- Security Team: [Contact Info]
- Compliance Team: [Contact Info]
- On-Call Engineer: [Contact Info]
Escalation Path
- Operations Team (First Response)
- Security Team (Security Issues)
- Compliance Team (Compliance Issues)
- Management (Critical Issues)
Last Updated: Operations Runbook Complete