Files

Deploy to Phoenix / deploy (push) Has been cancelled

Details

docs: update master documentation and push to Gitea (2026-03-06)

- MASTER_INDEX: Last Updated 2026-03-06; status 59/59 contracts; add NEXT_STEPS_LIST, CONTRACT_NEXT_STEPS_LIST
- docs/README, NEXT_STEPS_INDEX, 06-besu/MASTER_INDEX: Last Updated 2026-03-06
- Contract check script: 59 addresses (PMM, vault/reserve, CompliantFiatTokens); canonical CCIP/router
- New docs: EXECUTION_CHECKLIST, NEXT_STEPS_LIST, DOTENV_AUDIT, ADDITIONAL_PATHS, deployer gas runbook, WEMIX_ACQUISITION_TABLED, etc.
- Config: deployer-gas-routes, cro-wemix-swap-routes, routing-registry, token-mapping
- Scripts: check-contracts-on-chain-138, check-pmm-pool-balances-chain138, deployer-gas-auto-route, acquire-cro-and-wemix-gas
- Operator rule: operator-lan-access-check.mdc

Made-with: Cursor

2026-03-06 19:11:25 -08:00

14 KiB

Raw Permalink Blame History

502 Deep Dive: Root Causes and Fixes

Last updated: 2026-02-14

This document maps each E2E 502 to its backend, root cause, and fix. Use from LAN with SSH to Proxmox.

Full maintenance (all RPC + 502 in one run)

From project root on LAN (SSH to r630-01, ml110, r630-02):

./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e

This runs in order: (0) make RPC VMIDs 2101, 2500–2505 writable (e2fsck); (1) resolve-and-fix (Dev VM IP, start containers, DBIS); (2) fix 2101 JNA reinstall; (3) install Besu on missing nodes (2500–2505, 1505–1508); (4) address-all-502s (backends + NPM + RPC diagnostics); (5) E2E verification. Use --verbose to see all step output; STEP2_TIMEOUT=0 to disable step-2 timeout. See MAINTENANCE_SCRIPTS_REVIEW.md and CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md §9.

Backend map (domain → IP:port → VMID, host)

Domain(s)	Backend	VMID	Proxmox host	Service to start
dbis-admin.d-bis.org, secure.d-bis.org	192.168.11.130:80	10130	r630-01 (192.168.11.11)	nginx
dbis-api.d-bis.org, dbis-api-2.d-bis.org	192.168.11.155:3000, .156:3000	10150, 10151	r630-01	node
rpc-http-prv.d-bis.org, rpc-ws-prv.d-bis.org	192.168.11.211:8545/8546	2101	r630-01	besu
mim4u.org, www.mim4u.org, secure.mim4u.org, training.mim4u.org	192.168.11.37:80	7810	r630-02 (192.168.11.12)	nginx (or python stub in fix-all-502s-comprehensive.sh)
rpc-alltra*.d-bis.org (3)	192.168.11.172/173/174:8545	2500, 2501, 2502	r630-01	besu
rpc-hybx*.d-bis.org (3)	192.168.11.246/247/248:8545	2503, 2504, 2505	r630-01 or ml110	besu
cacti-1 (if proxied)	192.168.11.80:80	5200	r630-02	nginx/apache2
cacti-alltra.d-bis.org	192.168.11.177:80	5201	r630-02	nginx/apache2
cacti-hybx.d-bis.org	192.168.11.251:80	5202	r630-02	nginx/apache2

One-command: address all remaining 502s

From a host on the LAN (can reach NPMplus and Proxmox):

# Full flow: backends + NPMplus proxy update (if NPM_PASSWORD set) + RPC diagnostics
./scripts/maintenance/address-all-remaining-502s.sh

# Skip NPMplus update (e.g. no .env yet)
./scripts/maintenance/address-all-remaining-502s.sh --no-npm

# Also run Besu mass-fix (config + restart) and E2E at the end
./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix --e2e

This runs in order: (1) fix-all-502s-comprehensive.sh, (2) NPMplus proxy update when NPM_PASSWORD is set, (3) diagnose-rpc-502s.sh (saves report under docs/04-configuration/verification-evidence/), (4) optional fix-all-besu-nodes.sh, (5) optional E2E.

Per-step diagnose and fix

From a host that can SSH to Proxmox (r630-01, r630-02, ml110):

# Comprehensive fix (DBIS 10130 Python, dbis-api, 2101, 2500-2505 Besu, Cacti Python)
./scripts/maintenance/fix-all-502s-comprehensive.sh

# RPC diagnostics only (2101, 2500-2505): ss -tlnp + journalctl, to file
./scripts/maintenance/diagnose-rpc-502s.sh | tee docs/04-configuration/verification-evidence/rpc-502-diagnostics.txt

# Diagnose only (no starts)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh --diagnose-only

# Apply fixes per-backend (start containers + nginx/node/besu)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh

The comprehensive fix script will:

For each backend: SSH to the host, check pct status <vmid>, start container if stopped.
If container is running: curl from host to backend IP:port; if 000/fail, run systemctl start nginx / node / besu as appropriate and show in-CT ss -tlnp.
HYBX (2503–2505): if ML110 has no such VMID, try r630-01.
Cacti: VMID 5200 (cacti-1), 5201 (cacti-alltra), 5202 (cacti-hybx) on r630-02 (migrated 2026-02-15).

Root cause summary (typical)

502	Typical cause	Fix
dbis-admin, secure	Container 10130 stopped or nginx not running	`pct start 10130` on r630-01; inside CT: `systemctl start nginx`
dbis-api, dbis-api-2	Containers 10150/10151 stopped or Node app not running	`pct start` on r630-01; inside CT: `systemctl start node`
rpc-http-prv	Container 2101 stopped or Besu not listening on 8545	`pct start 2101`; inside CT: `systemctl start besu` (allow 30–60s)
rpc-alltra, rpc-hybx	Containers 2500–2505 stopped or Besu not running	Same: `pct start <vmid>`; inside CT: `systemctl start besu`
cacti-alltra, cacti-hybx, cacti-1	5200/5201/5202 stopped or web server not running	On r630-02: `pct start 5200/5201/5202`; inside CT: `systemctl start nginx` or `apache2`
mim4u.org, www/secure/training.mim4u.org	Container 7810 stopped or nothing on port 80	On r630-02: `pct start 7810`; inside CT: `systemctl start nginx` or run python stub on 80 (see fix-all-502s-comprehensive.sh)

VMID 2400 (ThirdWeb RPC primary, 192.168.11.240)

Host: ml110 (192.168.11.10). Service: besu-rpc (config: /etc/besu/config-rpc-thirdweb.toml). Nginx on 443/80.

Intermittent RPC timeouts: If eth_chainId to :8545 sometimes fails, Besu may be hitting Vert.x BlockedThreadChecker (worker thread blocked >60s during heavy ops). Fix applied: In /etc/systemd/system/besu-rpc.service, BESU_OPTS was extended with -Dvertx.options.blockedThreadCheckInterval=120000 (120s) so occasional slow operations (e.g. trace, compaction) don’t trigger warnings as quickly. Restart: pct exec 2400 -- systemctl restart besu-rpc.service. After a restart, Besu may run RocksDB compaction before binding 8545; allow 5–15 minutes then re-check RPC. Config already has host-allowlist=["*"]. If the node is down, check: pct exec 2400 -- journalctl -u besu-rpc -n 30 (look for "Compacting database" or "JSON-RPC service started").

If 502 persists after running the script

Backends verified in-container but public still 502 (dbis-admin, secure, dbis-api, dbis-api-2):
The origin (76.53.10.36) routes by hostname. Refresh NPMplus proxy targets from LAN so the proxy forwards to 130:80 and 155/156:3000:
NPM_PASSWORD=xxx ./scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh
Then purge Cloudflare cache for those hostnames if needed.
From the Proxmox host (e.g. SSH to 192.168.11.11):
- pct exec <vmid> -- ss -tlnp — see what is listening.
- pct exec <vmid> -- systemctl status nginx (or node, besu) — check unit name and errors.
NPMplus must be able to reach the backend IP. From the NPMplus host: curl -s -o /dev/null -w '%{http_code}' http://<backend_ip>:<port>/.
RPC (2101, 2500–2505): If Besu still does not respond after 90s:
- Run ./scripts/maintenance/diagnose-rpc-502s.sh and check the report (or pct exec <vmid> -- ss -tlnp and journalctl -u besu-rpc / besu).
- Fix config/nodekey/genesis per journal errors.
- Run ./scripts/besu/fix-all-besu-nodes.sh from project root (optionally --no-restart first to only fix configs), or use ./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix.

Known infrastructure causes and fixes:

2101: If journal shows NoClassDefFoundError: com.sun.jna.Native or "JNA/Udev" or "Read-only file system" for JNA/libjnidispatch, run from project root (LAN):
./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh — reinstalls Besu and sets JNA to use /data/besu/tmp. If the script exits with "Container … /tmp is not writable", make the CT writable (see Read-only CT below) then re-run.
The fix script also sets p2p-host in /etc/besu/config-rpc.toml to 192.168.11.211 (RPC_CORE_1). If 2101 had p2p-host="192.168.11.250" (RPC_ALLTRA_1), other nodes would see the wrong advertised address; correct node lists are in repo config/besu-node-lists/static-nodes.json and permissions-nodes.toml (2101 = .211).
2500–2505: If journal shows "Failed to locate executable /opt/besu/bin/besu", install Besu in each CT:
./scripts/besu/install-besu-permanent-on-missing-nodes.sh — installs Besu (23.10.3) in 1505–1508 and 2500–2505 where missing, deploys config/genesis/node lists, enables and starts the service. Allow ~5–10 minutes per node. Use --dry-run to see which VMIDs would be updated. If install fails with "Read-only file system", make the CT writable first.

VMID 2101: checklist of causes

When 2101 (Core RPC at 192.168.11.211) is down or crash-looping, check in order:

Cause	What to check	Fix
Read-only root (emergency_ro)	`pct exec 2101 -- mount \| grep 'on / '` — if `ro` or `emergency_ro`, root is read-only (e.g. after ext4 errors).	Run `./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` (stops 2101, e2fsck on host, starts CT). Or on host: stop 2101, `e2fsck -f -y /dev/pve/vm-2101-disk-0`, start 2101.
Wrong p2p-host	`pct exec 2101 -- grep p2p-host /etc/besu/config-rpc.toml` — must be `192.168.11.211` (not .250).	Run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (it sets p2p-host to RPC_CORE_1). Or manually: `sed -i 's
Static / permissioned node lists	In CT: `/etc/besu/static-nodes.json` and `/etc/besu/permissions-nodes.toml` should list 2101 as `...@192.168.11.211:30303`. Repo: `config/besu-node-lists/`.	Deploy from repo: the fix script copies `static-nodes.json` and `permissions-nodes.toml` when present. Or run `./scripts/deploy-besu-node-lists-to-all.sh`.
No space / RocksDB compaction	Journal: "No space left on device" during "Compacting database". Host thin pool: `lvs` on r630-01.	Free thin pool (see LVM thin pool full below). If root was emergency_ro, fix that first; then restart `besu-rpc`. Optionally start with fresh `/data/besu` to resync.
JNA / Besu binary	Journal: `NoClassDefFoundError: com.sun.jna.Native` or missing `/opt/besu/bin/besu`.	Run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (reinstalls Besu in CT).

After any fix: pct exec 2101 -- systemctl restart besu-rpc then wait ~60s and curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' http://192.168.11.211:8545/.

Read-only CT (2101, 2500–2505)

If fix or install scripts fail with "Read-only file system" (e.g. when creating files in /root, /tmp, or /opt), the container’s root (or key mounts) are read-only. Besu/JNA also needs a writable java.io.tmpdir (e.g. /data/besu/tmp); the install and fix scripts set that when they can write to the CT.

Make all RPC VMIDs writable in one go (from project root, LAN):
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh — SSHs to r630-01 and for each of 2101, 2500–2505 runs: stop CT, e2fsck -f -y on rootfs LV, start CT. Then re-run the fix or install script. The full maintenance runner (run-all-maintenance-via-proxmox-ssh.sh) runs this step first automatically.

Make a single CT writable (from the Proxmox host):

Check mount: pct exec <vmid> -- mount | grep 'on / ' — if you see ro, then root is mounted read-only.
Remount from inside (if allowed): pct exec <vmid> -- mount -o remount,rw /
If that fails (e.g. "Operation not permitted"), the CT may be running with a read-only rootfs by design.
From the host: Inspect the CT config: pct config <vmid>. If rootfs has an option making it read-only, remove or change it (Proxmox UI: CT → Hardware → Root disk; or pct set <vmid> --rootfs <storage>:<size> to recreate only if you have a backup).
Alternative: Ensure at least /tmp and /opt are writable (e.g. bind-mount writable storage or tmpfs for /tmp). Then re-run the fix/install script.

After the CT is writable, run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (2101) or ./scripts/besu/install-besu-permanent-on-missing-nodes.sh (2500–2505) again.

LVM thin pool full (2101 / 2500–2505 "No space left on device")

If Besu fails with "No space left on device" on /data/besu/database/*.dbtmp while df inside the CT shows free space, the host LVM thin pool is full. The CT’s disk is thin-provisioned; writes fail when the pool has no free space.

Check on the Proxmox host (e.g. r630-01):

lvs -o lv_name,data_percent,metadata_percent  # data at 100% = pool full

Fix: Free space in the thin pool on that host:

Remove or shrink unused CT/VM disks, or move VMs to another storage.
Optionally expand the thin pool (add PV or resize).
After freeing space, restart the affected service: pct exec <vmid> -- systemctl restart besu-rpc (or besu).

Until the pool has free space, Besu on 2101 (and any other CT on that host that does large writes) will keep failing with "No space left on device".

2026-02-15 actions on r630-01: Ran fstrim in all running CTs (pool 100% → 98.33%). Destroyed six stopped CTs to free thin pool space: 106, 107, 108, 10000, 10001, 10020 (purge). Migrated 5200–5202, 6000–6002, 6400–6402, 5700 to r630-02. Pool 74.48%. If 2101 still crash-loops during RocksDB compaction, retry systemctl restart besu-rpc or start Besu with a fresh /data/besu (resync). See MIGRATE_CT_R630_01_TO_R630_02.md.

Re-run E2E after fixes

./scripts/verify/verify-end-to-end-routing.sh

Report: docs/04-configuration/verification-evidence/e2e-verification-<timestamp>/verification_report.md.

To allow exit 0 when only 502s remain (e.g. CI):
E2E_ACCEPT_502_INTERNAL=1 ./scripts/verify/verify-end-to-end-routing.sh

See also: NEXT_STEPS_FOR_YOU.md §3 (LAN steps), STEPS_FROM_PROXMOX_OR_LAN_WITH_SECRETS.md §3 (fix 502s), NEXT_STEPS_OPERATOR.md (quick commands).

14 KiB Raw Permalink Blame History Unescape Escape