#589 ·
llmmsg-srvlezama split-brain: island-b hub (llmmsg-srv.service) auto-started, isolated lezama agents from whey hub
- Ref
#589(#589)- Project
llmmsg-srv- Status
- backlog
- Priority
- high
- Type
- bug
- Assigned
- pm-llmmsgsrv-cc
- Created by
- —
- Created
- 2026-05-29T17:42:05.626Z
- Updated
- 2026-05-29T17:45:58.072Z
Questions
No questions.
Event log
-
wi cli
-
INCIDENT 2026-05-29 14:39 (Elazar: 'most agents in lezama cant contact me'). ROOT CAUSE: a full llmmsg-srv hub is running ON lezama - systemd llmmsg-srv.service 'Island B - lezama', node /opt/llmmsg-srv/hub/hub.mjs pid 1403248, started ~11:52 (likely reboot auto-start), own DB /var/lib/llmmsg-srv/island-b.sqlite, health {version 2.9.0, island lezama, roster 11, messages 540}. SPLIT-BRAIN: lezama agents registered to island-b, not whey's hub, so their msgs never reached Elazar (his chat-duo handle is on whey hub). Confirmed by pm-mba-l-cc transcript: its roster shows 'only 11 agents, no elazar-*'. SECONDARY: island-b binds 0.0.0.0:9703 -> blocks the whey->lezama reverse tunnel (llmmsg-lezama-tunnel.service thrashing, NRestarts 232) from binding. This is DRIFT vs one-hub cutover 2026-05-21 (lezama = thin client, hosts NO hub). FIX dispatched to Elazar (GCABA root): 'sudo systemctl disable --now llmmsg-srv.service' - frees port, tunnel binds, lezama agents reconnect to whey hub. island-b.sqlite preserved. FOLLOWUP: find WHY the island unit was still enabled/started post-cutover (should have been disabled 2026-05-21); ensure it stays disabled across reboots. Relates #588 (tunnel).
-
RESOLVED 14:44 - Elazar ran 'sudo systemctl disable --now llmmsg-srv.service' on lezama (STOPPED-AND-DISABLED). Verified: whey->lezama tunnel bound (llmmsg-lezama-tunnel.service active, ssh -R 9703 pid live, NRestarts stopped at 243); lezama curl 127.0.0.1:9703 now returns whey hub (island 'main', v2.sqlite, 27ms tunnel hop, not 0.7ms local island-b). Lezama agents auto-reattaching as shims reconnect: coder-mba-l-cc + coderhelp01-mba-l back ONLINE on whey within seconds; rest follow on reconnect/next-prompt (bootstrap hook re-registers). Elazar confirmed 'it was like that' = he didn't start island-b; it auto-started ~11:52 (likely reboot). island-b.sqlite (540 msgs) preserved. REMAINING FOLLOWUP: (1) disable alone stops boot auto-start but consider 'systemctl mask llmmsg-srv.service' on lezama for bulletproof prevention; (2) determine WHY the unit was still enabled post-2026-05-21 cutover (leftover two-island install) + whether venus has the same latent unit; (3) confirm always-on nw-lezama-cc + cc-context-monitor-lezama re-register on whey.