Files
proxmox/docs/00-meta/502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md
defiQUG 3f76bc9507
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: update master documentation and push to Gitea (2026-03-06)
- MASTER_INDEX: Last Updated 2026-03-06; status 59/59 contracts; add NEXT_STEPS_LIST, CONTRACT_NEXT_STEPS_LIST
- docs/README, NEXT_STEPS_INDEX, 06-besu/MASTER_INDEX: Last Updated 2026-03-06
- Contract check script: 59 addresses (PMM, vault/reserve, CompliantFiatTokens); canonical CCIP/router
- New docs: EXECUTION_CHECKLIST, NEXT_STEPS_LIST, DOTENV_AUDIT, ADDITIONAL_PATHS, deployer gas runbook, WEMIX_ACQUISITION_TABLED, etc.
- Config: deployer-gas-routes, cro-wemix-swap-routes, routing-registry, token-mapping
- Scripts: check-contracts-on-chain-138, check-pmm-pool-balances-chain138, deployer-gas-auto-route, acquire-cro-and-wemix-gas
- Operator rule: operator-lan-access-check.mdc

Made-with: Cursor
2026-03-06 19:11:25 -08:00

14 KiB
Raw Permalink Blame History

502 Deep Dive: Root Causes and Fixes

Last updated: 2026-02-14

This document maps each E2E 502 to its backend, root cause, and fix. Use from LAN with SSH to Proxmox.

Full maintenance (all RPC + 502 in one run)

From project root on LAN (SSH to r630-01, ml110, r630-02):

./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e

This runs in order: (0) make RPC VMIDs 2101, 25002505 writable (e2fsck); (1) resolve-and-fix (Dev VM IP, start containers, DBIS); (2) fix 2101 JNA reinstall; (3) install Besu on missing nodes (25002505, 15051508); (4) address-all-502s (backends + NPM + RPC diagnostics); (5) E2E verification. Use --verbose to see all step output; STEP2_TIMEOUT=0 to disable step-2 timeout. See MAINTENANCE_SCRIPTS_REVIEW.md and CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md §9.

Backend map (domain → IP:port → VMID, host)

Domain(s) Backend VMID Proxmox host Service to start
dbis-admin.d-bis.org, secure.d-bis.org 192.168.11.130:80 10130 r630-01 (192.168.11.11) nginx
dbis-api.d-bis.org, dbis-api-2.d-bis.org 192.168.11.155:3000, .156:3000 10150, 10151 r630-01 node
rpc-http-prv.d-bis.org, rpc-ws-prv.d-bis.org 192.168.11.211:8545/8546 2101 r630-01 besu
mim4u.org, www.mim4u.org, secure.mim4u.org, training.mim4u.org 192.168.11.37:80 7810 r630-02 (192.168.11.12) nginx (or python stub in fix-all-502s-comprehensive.sh)
rpc-alltra*.d-bis.org (3) 192.168.11.172/173/174:8545 2500, 2501, 2502 r630-01 besu
rpc-hybx*.d-bis.org (3) 192.168.11.246/247/248:8545 2503, 2504, 2505 r630-01 or ml110 besu
cacti-1 (if proxied) 192.168.11.80:80 5200 r630-02 nginx/apache2
cacti-alltra.d-bis.org 192.168.11.177:80 5201 r630-02 nginx/apache2
cacti-hybx.d-bis.org 192.168.11.251:80 5202 r630-02 nginx/apache2

One-command: address all remaining 502s

From a host on the LAN (can reach NPMplus and Proxmox):

# Full flow: backends + NPMplus proxy update (if NPM_PASSWORD set) + RPC diagnostics
./scripts/maintenance/address-all-remaining-502s.sh

# Skip NPMplus update (e.g. no .env yet)
./scripts/maintenance/address-all-remaining-502s.sh --no-npm

# Also run Besu mass-fix (config + restart) and E2E at the end
./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix --e2e

This runs in order: (1) fix-all-502s-comprehensive.sh, (2) NPMplus proxy update when NPM_PASSWORD is set, (3) diagnose-rpc-502s.sh (saves report under docs/04-configuration/verification-evidence/), (4) optional fix-all-besu-nodes.sh, (5) optional E2E.

Per-step diagnose and fix

From a host that can SSH to Proxmox (r630-01, r630-02, ml110):

# Comprehensive fix (DBIS 10130 Python, dbis-api, 2101, 2500-2505 Besu, Cacti Python)
./scripts/maintenance/fix-all-502s-comprehensive.sh

# RPC diagnostics only (2101, 2500-2505): ss -tlnp + journalctl, to file
./scripts/maintenance/diagnose-rpc-502s.sh | tee docs/04-configuration/verification-evidence/rpc-502-diagnostics.txt

# Diagnose only (no starts)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh --diagnose-only

# Apply fixes per-backend (start containers + nginx/node/besu)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh

The comprehensive fix script will:

  • For each backend: SSH to the host, check pct status <vmid>, start container if stopped.
  • If container is running: curl from host to backend IP:port; if 000/fail, run systemctl start nginx / node / besu as appropriate and show in-CT ss -tlnp.
  • HYBX (25032505): if ML110 has no such VMID, try r630-01.
  • Cacti: VMID 5200 (cacti-1), 5201 (cacti-alltra), 5202 (cacti-hybx) on r630-02 (migrated 2026-02-15).

Root cause summary (typical)

502 Typical cause Fix
dbis-admin, secure Container 10130 stopped or nginx not running pct start 10130 on r630-01; inside CT: systemctl start nginx
dbis-api, dbis-api-2 Containers 10150/10151 stopped or Node app not running pct start on r630-01; inside CT: systemctl start node
rpc-http-prv Container 2101 stopped or Besu not listening on 8545 pct start 2101; inside CT: systemctl start besu (allow 3060s)
rpc-alltra*, rpc-hybx* Containers 25002505 stopped or Besu not running Same: pct start <vmid>; inside CT: systemctl start besu
cacti-alltra, cacti-hybx, cacti-1 5200/5201/5202 stopped or web server not running On r630-02: pct start 5200/5201/5202; inside CT: systemctl start nginx or apache2
mim4u.org, www/secure/training.mim4u.org Container 7810 stopped or nothing on port 80 On r630-02: pct start 7810; inside CT: systemctl start nginx or run python stub on 80 (see fix-all-502s-comprehensive.sh)

VMID 2400 (ThirdWeb RPC primary, 192.168.11.240)

Host: ml110 (192.168.11.10). Service: besu-rpc (config: /etc/besu/config-rpc-thirdweb.toml). Nginx on 443/80.

Intermittent RPC timeouts: If eth_chainId to :8545 sometimes fails, Besu may be hitting Vert.x BlockedThreadChecker (worker thread blocked >60s during heavy ops). Fix applied: In /etc/systemd/system/besu-rpc.service, BESU_OPTS was extended with -Dvertx.options.blockedThreadCheckInterval=120000 (120s) so occasional slow operations (e.g. trace, compaction) dont trigger warnings as quickly. Restart: pct exec 2400 -- systemctl restart besu-rpc.service. After a restart, Besu may run RocksDB compaction before binding 8545; allow 515 minutes then re-check RPC. Config already has host-allowlist=["*"]. If the node is down, check: pct exec 2400 -- journalctl -u besu-rpc -n 30 (look for "Compacting database" or "JSON-RPC service started").

If 502 persists after running the script

  1. Backends verified in-container but public still 502 (dbis-admin, secure, dbis-api, dbis-api-2):
    The origin (76.53.10.36) routes by hostname. Refresh NPMplus proxy targets from LAN so the proxy forwards to 130:80 and 155/156:3000:
    NPM_PASSWORD=xxx ./scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh
    Then purge Cloudflare cache for those hostnames if needed.

  2. From the Proxmox host (e.g. SSH to 192.168.11.11):

    • pct exec <vmid> -- ss -tlnp — see what is listening.
    • pct exec <vmid> -- systemctl status nginx (or node, besu) — check unit name and errors.
  3. NPMplus must be able to reach the backend IP. From the NPMplus host: curl -s -o /dev/null -w '%{http_code}' http://<backend_ip>:<port>/.

  4. RPC (2101, 25002505): If Besu still does not respond after 90s:

    • Run ./scripts/maintenance/diagnose-rpc-502s.sh and check the report (or pct exec <vmid> -- ss -tlnp and journalctl -u besu-rpc / besu).
    • Fix config/nodekey/genesis per journal errors.
    • Run ./scripts/besu/fix-all-besu-nodes.sh from project root (optionally --no-restart first to only fix configs), or use ./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix.

Known infrastructure causes and fixes:

  • 2101: If journal shows NoClassDefFoundError: com.sun.jna.Native or "JNA/Udev" or "Read-only file system" for JNA/libjnidispatch, run from project root (LAN):
    ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh — reinstalls Besu and sets JNA to use /data/besu/tmp. If the script exits with "Container … /tmp is not writable", make the CT writable (see Read-only CT below) then re-run.
    The fix script also sets p2p-host in /etc/besu/config-rpc.toml to 192.168.11.211 (RPC_CORE_1). If 2101 had p2p-host="192.168.11.250" (RPC_ALLTRA_1), other nodes would see the wrong advertised address; correct node lists are in repo config/besu-node-lists/static-nodes.json and permissions-nodes.toml (2101 = .211).
  • 25002505: If journal shows "Failed to locate executable /opt/besu/bin/besu", install Besu in each CT:
    ./scripts/besu/install-besu-permanent-on-missing-nodes.sh — installs Besu (23.10.3) in 15051508 and 25002505 where missing, deploys config/genesis/node lists, enables and starts the service. Allow ~510 minutes per node. Use --dry-run to see which VMIDs would be updated. If install fails with "Read-only file system", make the CT writable first.

VMID 2101: checklist of causes

When 2101 (Core RPC at 192.168.11.211) is down or crash-looping, check in order:

Cause What to check Fix
Read-only root (emergency_ro) pct exec 2101 -- mount | grep 'on / ' — if ro or emergency_ro, root is read-only (e.g. after ext4 errors). Run ./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh (stops 2101, e2fsck on host, starts CT). Or on host: stop 2101, e2fsck -f -y /dev/pve/vm-2101-disk-0, start 2101.
Wrong p2p-host pct exec 2101 -- grep p2p-host /etc/besu/config-rpc.toml — must be 192.168.11.211 (not .250). Run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (it sets p2p-host to RPC_CORE_1). Or manually: `sed -i 's
Static / permissioned node lists In CT: /etc/besu/static-nodes.json and /etc/besu/permissions-nodes.toml should list 2101 as ...@192.168.11.211:30303. Repo: config/besu-node-lists/. Deploy from repo: the fix script copies static-nodes.json and permissions-nodes.toml when present. Or run ./scripts/deploy-besu-node-lists-to-all.sh.
No space / RocksDB compaction Journal: "No space left on device" during "Compacting database". Host thin pool: lvs on r630-01. Free thin pool (see LVM thin pool full below). If root was emergency_ro, fix that first; then restart besu-rpc. Optionally start with fresh /data/besu to resync.
JNA / Besu binary Journal: NoClassDefFoundError: com.sun.jna.Native or missing /opt/besu/bin/besu. Run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (reinstalls Besu in CT).

After any fix: pct exec 2101 -- systemctl restart besu-rpc then wait ~60s and curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' http://192.168.11.211:8545/.

Read-only CT (2101, 25002505)

If fix or install scripts fail with "Read-only file system" (e.g. when creating files in /root, /tmp, or /opt), the containers root (or key mounts) are read-only. Besu/JNA also needs a writable java.io.tmpdir (e.g. /data/besu/tmp); the install and fix scripts set that when they can write to the CT.

Make all RPC VMIDs writable in one go (from project root, LAN):
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh — SSHs to r630-01 and for each of 2101, 25002505 runs: stop CT, e2fsck -f -y on rootfs LV, start CT. Then re-run the fix or install script. The full maintenance runner (run-all-maintenance-via-proxmox-ssh.sh) runs this step first automatically.

Make a single CT writable (from the Proxmox host):

  1. Check mount: pct exec <vmid> -- mount | grep 'on / ' — if you see ro, then root is mounted read-only.
  2. Remount from inside (if allowed): pct exec <vmid> -- mount -o remount,rw /
    If that fails (e.g. "Operation not permitted"), the CT may be running with a read-only rootfs by design.
  3. From the host: Inspect the CT config: pct config <vmid>. If rootfs has an option making it read-only, remove or change it (Proxmox UI: CT → Hardware → Root disk; or pct set <vmid> --rootfs <storage>:<size> to recreate only if you have a backup).
  4. Alternative: Ensure at least /tmp and /opt are writable (e.g. bind-mount writable storage or tmpfs for /tmp). Then re-run the fix/install script.

After the CT is writable, run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (2101) or ./scripts/besu/install-besu-permanent-on-missing-nodes.sh (25002505) again.

LVM thin pool full (2101 / 25002505 "No space left on device")

If Besu fails with "No space left on device" on /data/besu/database/*.dbtmp while df inside the CT shows free space, the host LVM thin pool is full. The CTs disk is thin-provisioned; writes fail when the pool has no free space.

Check on the Proxmox host (e.g. r630-01):

lvs -o lv_name,data_percent,metadata_percent  # data at 100% = pool full

Fix: Free space in the thin pool on that host:

  • Remove or shrink unused CT/VM disks, or move VMs to another storage.
  • Optionally expand the thin pool (add PV or resize).
  • After freeing space, restart the affected service: pct exec <vmid> -- systemctl restart besu-rpc (or besu).

Until the pool has free space, Besu on 2101 (and any other CT on that host that does large writes) will keep failing with "No space left on device".

2026-02-15 actions on r630-01: Ran fstrim in all running CTs (pool 100% → 98.33%). Destroyed six stopped CTs to free thin pool space: 106, 107, 108, 10000, 10001, 10020 (purge). Migrated 52005202, 60006002, 64006402, 5700 to r630-02. Pool 74.48%. If 2101 still crash-loops during RocksDB compaction, retry systemctl restart besu-rpc or start Besu with a fresh /data/besu (resync). See MIGRATE_CT_R630_01_TO_R630_02.md.

Re-run E2E after fixes

./scripts/verify/verify-end-to-end-routing.sh

Report: docs/04-configuration/verification-evidence/e2e-verification-<timestamp>/verification_report.md.

To allow exit 0 when only 502s remain (e.g. CI):
E2E_ACCEPT_502_INTERNAL=1 ./scripts/verify/verify-end-to-end-routing.sh

See also: NEXT_STEPS_FOR_YOU.md §3 (LAN steps), STEPS_FROM_PROXMOX_OR_LAN_WITH_SECRETS.md §3 (fix 502s), NEXT_STEPS_OPERATOR.md (quick commands).