Files
proxmox/scripts/maintenance
defiQUG dbd517b279 Sync workspace: config, docs, scripts, CI, operator rules, and submodule pointers.
- Update dbis_core, cross-chain-pmm-lps, explorer-monorepo, metamask-integration, pr-workspace/chains
- Omit embedded publish git dirs and empty placeholders from index

Made-with: Cursor
2026-04-12 06:12:20 -07:00
..

Maintenance Scripts

health-check-rpc-2101.sh — Health check for Besu RPC on VMID 2101: container status, besu-rpc service, port 8545, eth_chainId, eth_blockNumber. Run from project root (LAN). See docs/09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md.

fix-core-rpc-2101.sh — One-command fix for Core RPC 2101: start CT if stopped, restart Besu, verify RPC. Options: --dry-run, --apply (mutations when PROXMOX_SAFE_DEFAULTS=1), --restart-only. Optional PROXMOX_OPS_ALLOWED_VMIDS. If Besu fails with JNA/NoClassDefFoundError, run fix-rpc-2101-jna-reinstall.sh first.

fix-rpc-2101-jna-reinstall.sh — Reinstall Besu in CT 2101 to fix JNA/NoClassDefFoundError; then re-run fix-core-rpc-2101.sh. Use --dry-run to print steps only.

check-disk-all-vmids.sh — Check root disk usage in all running containers on ml110, r630-01, r630-02. Use --csv for tab-separated output. For prevention and audits.

run-all-maintenance-via-proxmox-ssh.sh — Run all maintenance/fix scripts that use SSH to Proxmox VE (r630-01, ml110, r630-02). Runs make-rpc-vmids-writable-via-ssh.sh --apply first (so 2101, 2500-2505 are writable), then resolve-and-fix-all, fix-rpc-2101-jna-reinstall, install-besu-permanent-on-missing-nodes, address-all-remaining-502s; optional E2E with --e2e. Use --no-npm to skip NPM proxy update, --dry-run to print steps only, --verbose to show all step output (no stderr hidden). Step 2 (2101 fix) has optional timeout: STEP2_TIMEOUT=900 (default) or STEP2_TIMEOUT=0 to disable. Run from project root (LAN).

make-rpc-vmids-writable-via-ssh.sh — SSHs to r630-01 and for each VMID (default 2101; override with BESU_WRITABLE_VMIDS): stops the CT, runs e2fsck -f -y on the rootfs LV, starts the CT. Use before fix-rpc-2101 or install-besu-permanent when CTs are read-only. --dry-run / --apply; with PROXMOX_SAFE_DEFAULTS=1, default is dry-run unless --apply or PROXMOX_OPS_APPLY=1. Optional PROXMOX_OPS_ALLOWED_VMIDS. Run from project root (LAN).

make-validator-vmids-writable-via-ssh.sh — SSHs to r630-01 (1000, 1001, 1002) and r630-03 (1003, 1004); stops each validator CT, runs e2fsck -f -y on rootfs, starts the CT. Fixes "Read-only file system" / JNA crash loop on validators. Then run fix-all-validators-and-txpool.sh. See docs/08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.

Sentries 15001502 (r630-01) — If deploy-besu-node-lists or set-all-besu-max-peers-32 reports Skip/fail or "Read-only file system" for 15001502, they have the same read-only root issue. On the host: pct stop 1500; e2fsck -f -y /dev/pve/vm-1500-disk-0; pct start 1500 (repeat for 1501, 1502). Then re-run deploy and max-peers/restart.

address-all-remaining-502s.sh — One flow to address remaining E2E 502s: runs fix-all-502s-comprehensive.sh, then (if NPM_PASSWORD set) NPMplus proxy update, then RPC diagnostics (diagnose-rpc-502s.sh), optionally fix-all-besu-nodes.sh and E2E. Use --no-npm, --run-besu-fix, --e2e, --dry-run (print steps only). Run from LAN.

diagnose-rpc-502s.sh — Collects for VMIDs 2101 and 25002505: ss -tlnp and journalctl -u besu-rpc / besu. Pipe to a file or use from address-all-remaining-502s.sh.

fix-all-502s-comprehensive.sh — Starts/serves backends for 10130, 10150/10151, 2101, 25002505, Cacti (Python stubs if needed). Use --dry-run to print actions without SSH. Does not update NPMplus; use update-npmplus-proxy-hosts-api.sh from LAN for that.

daily-weekly-checks.sh — Daily (explorer, indexer lag, RPC) and weekly (config API, thin pool, log reminder).
schedule-daily-weekly-cron.sh — Install cron: daily 08:00, weekly Sun 09:00. Run from a persistent host checkout; set CRON_PROJECT_ROOT=/srv/proxmox when installing on a Proxmox node.

ensure-firefly-primary-via-ssh.sh — SSHs to r630-02 and normalizes /opt/firefly/docker-compose.yml on VMID 6200, installs an idempotent helper-backed firefly.service, and verifies /api/v1/status. It is safe for the current mixed stack where firefly-core already exists outside compose while Postgres and IPFS remain compose-managed. Use --dry-run to print actions only.

ensure-fabric-sample-network-via-ssh.sh — SSHs to r630-02 and ensures VMID 6000 has nested-LXC features, a boot-time fabric-sample-network.service, and a queryable mychannel. Use --dry-run to print actions only.

ensure-legacy-monitor-networkd-via-ssh.sh — SSHs to r630-01 and fixes the legacy 3000-3003 monitor/RPC-adjacent LXCs so systemd-networkd is enabled host-side and started in-guest. This is the safe path for unprivileged guests where systemctl enable fails from inside the CT. --dry-run / --apply; same PROXMOX_SAFE_DEFAULTS behavior as other guarded maintenance scripts.

check-and-fix-explorer-lag.sh — Checks RPC vs Blockscout block; if lag > threshold (default 500), runs fix-explorer-indexer-lag.sh (restart Blockscout).
schedule-explorer-lag-cron.sh — Install cron for lag check-and-fix: every 6 hours (0, 6, 12, 18). Log: logs/explorer-lag-fix.log. Use --show to print the line, --install to add to crontab, --remove to remove. Run from a persistent host checkout; set CRON_PROJECT_ROOT=/srv/proxmox when installing on a Proxmox node.

All schedule-*.sh installers — Refuse transient roots such as /tmp/.... Install from a persistent checkout only.

Optional: Alerting on failures

The daily/weekly script writes a metric file when run (if MAINTENANCE_METRIC_FILE is set or default logs/maintenance-checks.metric):

maintenance_checks_failed 0
maintenance_checks_timestamp 1739123456
  • Use in cron: After the check, if maintenance_checks_failed > 0, send alert.
  • Example wrapper (email on failure):
    cd /path/to/proxmox && bash scripts/maintenance/daily-weekly-checks.sh daily >> logs/daily-weekly-checks.log 2>&1
    FAILED=$(grep '^maintenance_checks_failed' logs/maintenance-checks.metric 2>/dev/null | awk '{print $2}')
    [ -n "$FAILED" ] && [ "$FAILED" -gt 0 ] && echo "Maintenance checks failed: $FAILED" | mail -s "Explorer/maintenance alert" ops@example.com
    
  • Slack: Use a small script that reads the metric file and posts to a webhook when maintenance_checks_failed > 0.
  • Prometheus/Grafana: Scrape the metric file or run a node_exporter textfile collector on logs/maintenance-checks.metric.

To disable the metric file, set MAINTENANCE_METRIC_FILE= (empty) before running the script.