#634 ·
llmmsg-srvDurable GC for stale bridge registrations (TTL prune) - no more retired-codex corpses in registrations.json
- Ref
#634(#634)- Project
llmmsg-srv- Status
- deferred
- Priority
- normal
- Type
- task
- Assigned
- coder-llmmsgsrv-cc coder
- Created by
- —
- Created
- 2026-06-04T15:40:33.051Z
- Updated
- 2026-06-15T08:39:09.338Z
Questions
No questions.
Event log
-
wi cli
-
Ties to WI 471 (retired llmmsg-srv-ca) + 572 (codex ca= revival). Today's one-shot: purged the dead llmmsg-srv-ca key (30.5d) from venus:/opt/llmmsg-srv/bridge/registrations.json (now {}, backup .bak-20260604-154018 kept). Whey's copy was already trashed; lezama unchecked. Root gap: NOTHING GCs bridge registrations - every retired/dead codex thread leaves a corpse that re-prints in llmlist forever. Durable fix (home = bridge or scripts/llmmsg-purge-ghosts.mjs, NOT llmlist - a lister must not mutate): prune registrations.json entries whose lastActivityAt exceeds TTL (default 7d, env-tunable). Wire it as: (a) bridge.mjs prune-on-startup + periodic sweep when the bridge is revived under #572, AND/OR (b) extend scripts/llmmsg-purge-ghosts.mjs (already GCs roster ghosts) to also sweep bridge registrations so it works while the bridge is dormant. Validated TTL logic already (the venus one-shot used the same >7d filter, preserves live keys). Low priority - bridge is dormant; do it as part of the #572 codex revival or sooner if corpses re-accumulate.
-
STRENGTHENED per Elazar (verbatim intent): 'if there is no -ca agent live, no -ca agents in the files' - registry must reflect reality; a codex/-ca thread not running leaves ZERO entry, promptly, not after a week. TTL alone is insufficient. 3-tier design (owner hub-llmmsgsrv-cc; home split below): 1. PRIMARY - prune-on-exit (in bridge.mjs): the bridge removes a thread's registrations.json entry the moment that codex App-Server thread exits/disconnects. Event-accurate: agent gone -> row gone immediately during normal operation. Lands with the #572 codex revival (bridge is dormant now). 2. PRIMARY - startup reconcile (in bridge.mjs): on bridge start, drop any registration whose thread isn't re-established (covers a crash that skipped prune-on-exit). 3. SECONDARY - liveness sweep (in scripts/llmmsg-purge-ghosts.mjs, already GCs roster ghosts, runs host-side independent of bridge): for each registration, if no live codex process/thread backs it, drop it. GUARD against revival races: only purge when non-liveness is positively confirmed (bridge reports thread unknown, OR bridge down AND entry past a short grace ~5-10min) - NOT a coarse 'bridge is up?' check that could delete an entry a thread just wrote. 4. FALLBACK - TTL (>7d, env-tunable): last-resort backstop for anything liveness can't classify. NOT in llmlist (lister must not mutate). Today's one-shot already purged the existing corpses fleet-wide (venus key dropped, whey/lezama clean). The liveness sweep (3) CAN ship independent of #572 via purge-ghosts.mjs; prune-on-exit (1/2) is coupled to the bridge being live (#572). Coordinate with WI #572.
-
Priority intent: Elazar wants prompt reflection, not weekly - treat as pri 2 coupled to #572, but the purge-ghosts liveness sweep (tier 3) may land sooner as a standing safety net.
-
Tier-3 LANDED: standing host-side liveness net live + recurring on venus. purge-ghosts.mjs --bridge (VERSION 2.0.0, 3312d1e) + .service/.timer (8084b1e/37a2941), synced venus. FOLD1 bridge-up=report-only/bridge-down=atomic-surgery (verified: dead-ca dropped, fresh-ca grace-kept, {} no-op). FOLD2 timer-follows-bridge-host documented (redeploy where #572 lands app-server). Ladder: 10min grace -> ps-attached-CLI -> app-server threads -> 7d TTL. Timer OnCalendar=*:0/5 lingering, auto-fired Result=success. Tiers 1-2 (prune-on-exit + startup-reconcile) remain coupled to #572.
-
Tier-3 done + recurring; tiers 1-2 deferred pending #572 (ca.sh codex revival - bridge prune-on-exit/startup-reconcile only meaningful once the app-server is back).
-
coder-llmmsgsrv-cc / coder
-
coder-llmmsgsrv-cc / coder