#588 lezama hub tunnel stale-bind restart thrash (llmmsg-lezama-tunnel.service, counter 3124)

#588 · llmmsg-srv

lezama hub tunnel stale-bind restart thrash (llmmsg-lezama-tunnel.service, counter 3124)

Ref: #588 (#588)
Project: llmmsg-srv
Status: done
Priority: high
Type: bug
Assigned: pm-llmmsgsrv-cc
Created by: —
Created: 2026-05-29T09:36:24.886Z
Updated: 2026-06-06T07:00:43.990Z
Closed: 2026-06-06T07:00:43.984Z

Sub-items

No sub-items.

+ Add sub-item

Questions

No questions.

Event log

2026-05-29T09:36:24.886Z · created

wi cli
2026-05-29T09:36:34.338Z · progress

ROOT CAUSE (pm diag 2026-05-29, Elazar hit intermittent hub timeout from lezama): lezama's ONLY hub path = whey->lezama ssh -R reverse tunnel (llmmsg-lezama-tunnel.service, user unit on whey, one-hub cutover 2026-05-21). Verified lezama has NO alt route: ZeroTier/public DNS llmmsg-hub.pensanta.com->172.27.178.27 GCABA-blocked (ping+curl timeout); whey ppp0 10.105.0.8 unreachable from lezama (100% loss). Tunnel thrashes: when ssh -R drops, lezama sshd keeps :9703 bound (orphan), restart can't rebind -> 'remote port forwarding failed' exit 255 -> Restart=always/RestartSec=5 loop, counter 3124, until orphan TCP-times-out (minutes) = the outage window. NOTE: lezama-watchdog.sh comment claims '9703 forward dropped 2026-05-26, lezama uses public DNS' - that's WRONG, public DNS is GCABA-unreachable; the tunnel unit is the real+only path and was never actually retired. FIX 2-part: (1) lezama sshd ClientAliveInterval 15/CountMax 3 (GCABA root, Elazar GO pending) so orphan reaped in ~45s; (2) whey unit ExecStartPre stale-bind clear + RestartSec backoff (hub-llmmsgsrv-cc). Parent epic #531.
2026-05-29T09:44:12.491Z · decision

FINDING (hub-llmmsgsrv-cc): the lezama:9703 orphan listener is a ROOT-OWNED sshd privsep child - rob cannot discover its pid without sudo (lezama sudo needs password, no NOPASSWD), so a no-root whey-side ExecStartPre fuser/kill is IMPOSSIBLE. Therefore the lezama sshd ClientAliveInterval cure is MANDATORY (the only thing that frees the port), not optional; whey-unit changes can only reduce thrash rate. APPROVED + applying: RestartSec 5->20, StartLimitIntervalSec=300/Burst=20 (never-trips = stays self-recovering, no failed-state since nothing re-arms this user unit), drop misleading keepalive comment. Unit gets a repo home: /opt/llmmsg-srv/scripts/llmmsg-lezama-tunnel.service. Also authorized hub to fix the stale lezama-watchdog.sh L14-15 comment (the '9703 dropped/use public DNS' line that caused this), comment-only. STILL OPEN: Elazar GO on lezama sshd ClientAliveInterval 15/CountMax 3 (GCABA root) = the real cure.
2026-05-29T09:46:31.863Z · progress

WHEY-SIDE DONE (hub-llmmsgsrv-cc, commit cd048f5 verified: scripts/llmmsg-lezama-tunnel.service tracked + scripts/lezama-watchdog.sh stale-comment fix; unit active, NRestarts=0). This is a thrash-softener only (RestartSec 5->20). WI stays OPEN pending the actual cure: lezama sshd ClientAliveInterval 15/CountMax 3 (GCABA root) - Elazar GO requested with reload-safe one-liner; the root-owned-orphan finding makes it mandatory.
2026-06-05T09:08:52.759Z · note · wi-cli-whey

LIVE RECURRENCE 2026-06-05 ~06:00-06:06: zombie :9703 on lezama caused llmmsg-lezama-tunnel.service thrash (restart counter 108, 'remote port forwarding failed'). Whole lezama fleet fell off roster; Elazar saw pm-mba-l idle post-/compact and reported 'stuck again' (red herring - it was transport, not compact). Tunnel rebound 06:06:32 when zombie expired (luck, not intervention); pm-mba-l 2.2.0 shim self-recovered + drained stranded DM the instant transport returned. Assigned hub-llmmsgsrv-cc the permanent fix: lift the 9704 watchdog's zombie-kill (sudo kill PID holding :9703) into the 9703 tunnel path as ExecStartPre/wrapper, so it actively clears the stale bind instead of thrashing until ClientAlive expiry (GCABA-root, unreachable). Diff review before deploy.
2026-06-05T09:17:02.534Z · note · wi-cli-whey

RESOLUTION DIRECTION 2026-06-05: May-29 'root-owned unkillable orphan' finding CORRECTED - the :9703 forward socket is owned by rob's login session (uid 1000, ss -lntpe cgroup session-NNN.scope), not the root daemon. Two cures now in motion: (a) NO-ROOT: loginctl terminate-session <holder> - validated live (throwaway -R :19703, rc=0, port released, zero sudo); hub-llmmsgsrv-cc drafting whey-unit ExecStartPre (runs before own ssh -R binds = any holder is stale). (b) SYSTEMIC: lezama sshd ClientAliveInterval 15/CountMax 3 drop-in (reaps all dead sessions incl 9704 in ~45s) - nw-lezama-cc HAS the root pw, Elazar directed it to apply now. Plan: (b) primary systemic cure, (a) ships as defense-in-depth behind it (survives GCABA reimage, no GCABA dependency). No longer blocked on Elazar GCABA-GO - the agent holds the pw.
2026-06-05T09:19:31.185Z · note · wi-cli-whey

PRIMARY CURE LIVE 2026-06-05 06:18 (pm-llmmsgsrv-cc applied directly, sudo pw from ltm): /etc/ssh/sshd_config.d/99-clientalive.conf (ClientAliveInterval 15/CountMax 3) written on lezama, ssh reloaded, VERIFIED via sshd -T (clientaliveinterval 15, clientalivecountmax 3, service active, no lockout). lezama sshd now reaps dead sessions in ~45s -> orphaned :9703 releases fast, hub tunnel rebinds in ~45s vs multi-minute; also covers the 9704 chat-duo-web tunnel. The week-old 'Elazar GCABA-root GO' blocker is RESOLVED. Defense-in-depth (hub-llmmsgsrv-cc): scripts/llmmsg-lezama-zombie-sweep.sh (no-sudo loginctl terminate-session) + ExecStartPre in llmmsg-lezama-tunnel.service v2->v3 - committed to repo, deploy HELD to avoid churning recovered lezama (deploy in quiet window). #588 primary cure done; only the optional DiD restart-deploy remains.
2026-06-06T07:00:29.328Z · commented · wi-cli-whey

Whey-side DiD layer DEPLOYED + verified 2026-06-06: installed v3 llmmsg-lezama-tunnel.service with ExecStartPre=-llmmsg-lezama-zombie-sweep.sh (pings lezama, ss -lntpe finds the :9703-holding session, loginctl terminate-session no-root). ExecStartPre ran status=0 (non-fatal -prefix); tunnel healthy post-restart, lezama curls hub /health ok (v2.9.24, roster 39). Sweep correctly no-op'd on the clean restart (no orphan). Discovery live-confirmed read-only: session 1768 = rob/Remote/sshd, terminable no-root. NOTE: the previously-INSTALLED v2 unit carried a wrong comment ('holder root-owned, not no-root clearable'); live evidence + repo v3 disprove it - holder is uid:1000 session-scope. lezama sshd ClientAliveInterval drop-in (server-side layer) already in since Jun 5, so both DiD layers now live.
2026-06-06T07:00:43.990Z · completed · wi-cli-whey

Both DiD layers live: (1) lezama sshd ClientAliveInterval=15/CountMax=3 drop-in (server-side, reaps stale 9703+9704 binds ~45s, installed Jun 5 by nw-lezama-cc); (2) whey-side ExecStartPre zombie-sweep on llmmsg-lezama-tunnel.service v3 (deployed+verified 2026-06-06, terminates a stale :9703 session no-root before re-bind). Restart counter already fell 3124->749 under v2 backoff; v3+sshd now self-heal the stale bind that caused the thrash. Reopen if counter climbs again (watchdog #530 alerts cover it).