#454 ·
agent-opsllmmsg-srv roster & PM resilience: pruneStale soft-offline, offline buffering, PM re-election
- Ref
#454(#454)- Project
agent-ops- Status
- inProgress
- Priority
- high
- Type
- feature
- Assigned
- llmmsg-srv-cc pm
- Created by
- —
- Created
- 2026-05-21T16:24:25.376Z
- Updated
- 2026-05-22T03:52:27.230Z
Sub-items (2/3 done · 67%)
| ref | title | status | priority | assignee |
|---|---|---|---|---|
| #455 | PM-liveness watchdog: cc-context-monitor detects a PM-less ARO, DMs Elazar | blocked | high | bin-whey-cc |
| #456 | chat-duo: render origin_aro-tagged messages in their ARO room, stop spawning a room per DM | done | high | coder-chatduo-cc |
| #462 | chat-duo side of always-a-PM (brainstorm items 1+7): drop client-side PM resolution + aro_config write, send to=pm:X, render resolved_pm in bubble, queued=true inline confirmation | done | normal | coder-chatduo-cc |
Questions
No questions.
Event log
-
wi cli
-
assigned to llmmsg-srv-cc
-
Consolidates the converged brainstorm (nw-venus-cc, coder-chatduo-cc, bin-whey-cc; aro:llmmsg-srv-engineering, 2026-05-21). Greenlit by Elazar 2026-05-21. Design note: /gdrive/llmmsg-srv-roster-resilience-design.md Root cause (source-grounded, coder-chatduo-cc read hub.mjs): the roster IS a persistent v2.sqlite table (hub.mjs:52), survives restarts. Culprit is pruneStale() (hub.mjs:249) hard-DELETEing stale agents + their ARO memberships (stmtDeleteRosterRow hub.mjs:179; hub.mjs:246). A stale agent's identity is erased, not just marked offline. Scope - 6 items: 1. [hub] pruneStale: soft-offline (keep row + ARO memberships + PM role + offline flag), not hard-DELETE. LINCHPIN. 2. [hub] send buffers for any agent with a roster row incl. soft-offline; 400 only a name that never had a row. 3. [hub] Sweeper soft-offlining a PM triggers PM re-election (today auto-election fires only on aro_leave, never sent by a pruned agent). 4. [hub] Explicit decommission/unregister that genuinely DELETEs - roster growth control. 5. [MCP shim] MCP-client auto re-register: cache name+AROs, transparently re-register + re-join on a not_registered error. Complements item 1. 6. [bin-whey-cc] PM-liveness watchdog in cc-context-monitor.sh - detect a PM-less ARO, DM Elazar. Tracked as child WI 455. Lanes: items 1-5 = llmmsg-srv-cc/ca; item 6 = bin-whey-cc (WI 455). Dependency: item 1 is prerequisite for 2 and 3. Sequence: 1 -> (2,3) -> 4; 5 and 6 parallel. First task: manual relaunch of llmmsg-srv-cc - every hub/MCP item needs it and it is currently the stale/pruned agent; its first task is item 1. Authored by nw-venus-cc; greenlit by Elazar 2026-05-21; filed by nw-whey-cc.
-
Scope update (nw-venus-cc, 2026-05-21 13:27): added item 7 - [hub] ARO->PM HTTP endpoint. Add an HTTP route (extend /roster, or a focused /aro_pm, or HTTP /aro_list) returning per-ARO the PM agent + its liveness (online/soft-offline). Needed because cc-context-monitor.sh is curl-only and aro_list is MCP-only (404 over HTTP). Item 7 is hub lane (llmmsg-srv-cc/ca). Parent WI 454 now carries items 1-5 + 7; item 6 = child WI 455. Item 6 (watchdog) depends on item 1 AND item 7 - it is NOT parallel (correction to the original note). Design note /gdrive/llmmsg-srv-roster-resilience-design.md confirmed on the gdrive remote, 6464 bytes.
-
Scope addition - item 8 (hub DM-routing policy), 2026-05-21 ~13:36, from the aro:llmmsg-srv-engineering brainstorm with Elazar (he reported 3 stray DM rooms in chat-duo from proxy-mba-w-cc / coder-chatduo-cc / researcher-ha-cc; 12h of prose 'fixes' failed). Item 8 [hub] - hub-enforced agent->Elazar DM policy. The hub must gate sends with to=elazar-the-user-human-llmmsg-srv: allow only a reply (re=) to an Elazar-originated message, or a genuine decision request; reject/reroute everything else (status, ack, unsolicited DM) to the sender's origin_aro. Prose routing rules in CLAUDE.md/llmmsg-protocol.md cannot bind autonomous agents - the hub is the only chokepoint that enforces. Reject-vs-reroute pending Elazar's call; default reroute (content not lost). Lane: llmmsg-srv-cc/ca. Diagnostic finding (nw-whey-cc, aro_list check 13:36): nw-venus-cc's hypothesis that the DMs were a FALLBACK from agents dropped out of their AROs is DISPROVEN - proxy-mba-w-cc is in aro:mba-l, coder-chatduo-cc in aro:10-nightwatch + aro:llmmsg-srv-engineering, researcher-ha-cc in aro:ha. All three had a working ARO and DMed Elazar by choice. So WI 454 item 1 (soft-offline keeps memberships) does NOT address this symptom - item 8 is the actual and only fix. Item 9 [chat-duo] - tracked as child WI 456: chat-duo must render any message carrying origin_aro inside that ARO room, not as a per-DM room. Movable now (coder-chatduo-cc online); hub item 8 is the bootstrap-blocked piece.
-
Item 7 note (bin-whey-cc finding, 2026-05-21): aro_list (MCP) ALREADY returns aros_pm - the ARO->PM map with PM names. Item 7 (ARO->PM HTTP endpoint) is therefore trivial - just expose the existing aros_pm over an HTTP route, no new computation.
-
Item 8 (hub kind-gating of non-sanctioned agent->elazar DMs): Elazar decided REROUTE, not hard-reject. Spec locked: a non-sanctioned agent->elazar DM is rerouted to the sender's origin_aro. Implementer llmmsg-srv-cc to build as reroute.
-
Item-7 (always-a-PM railguard) ratified design folded into this WI. Hub contract: new to=pm:<aro> send target - hub resolves aro_config.pm_agent, elects most-recently-active non-spectator if unset/stale, queues per-ARO + flushes on join if no eligible member, never errors. /send returns resolved_pm or queued=true. Election also runs as a membership invariant on aro_join/aro_leave. chat-duo drops client-side PM resolution (gui.py:1192/1193 raises) and sends to=pm:X. Source: aro:chatduo-engineering brainstorm 2026-05-22, PM llmmsg-srv-cc + coder-chatduo-cc.
-
assigned to llmmsg-srv-cc
-
Starting hub to=pm:X contract (item-7 acceptance spec): new send target, election-on-send, membership-invariant election, per-ARO pending queue + flush-on-join, resolved_pm/queued response fields.
-
Item-7 hub side IMPLEMENTED + tested live on whey (hub v2.4.1). Added: to=pm:X send target, electPm/resolvePm/flushPmQueue, pm_pending_queue table, election-on-aro_join (+flush) and re-election-on-aro_leave. 6/6 live tests pass: queued path (no member -> queued:true + row), join elects PM + flushes queue (flushed:1), resolved_pm returned on populated ARO, leave-by-PM re-elects (empty -> pm:null). Deployed (hub restarted), NOT yet committed. chat-duo side = WI 462.
-
Item-7 always-a-PM railguard SHIPPED in commit 5741b17 (hub.mjs + init-db.sh), live + 6/6 tested. WI stays open for remaining resilience scope: pruneStale soft-offline, offline buffering.