OPS-18 ·
agent-opsmonitor peer-watchdog: roster-presence→probe-age (post-wave-1 prune-churn fix)
- Ref
OPS-18(#1085)- Project
agent-ops- Status
- done
- Priority
- normal
- Type
- task
- Assigned
- nw-whey-cc
- Created by
- wi-cli-whey
- Created
- 2026-06-15T05:18:51.645Z
- Updated
- 2026-06-15T08:29:39.096Z
- Closed
- 2026-06-15T08:29:39.096Z
Questions
No questions.
Event log
-
Post-wave-1 (hub-t cutover) regression: cc-context-monitor peer-watchdog false-churns DOWN->RECOVERED each cycle on single-unit hosts (venus/lezama). Mechanism (hub-llmmsgsrv, authoritative): monitors register pinned=0; hub STALE_TTL_S=600s < monitor timer ~15min, so the roster row is pruned ~5min before the next run -> 'absent from roster = DOWN' fires. whey survives only because its drain unit polls frequently. host_probes is NEVER pruned. REJECTED fixes: (1) pin monitors pinned=1 -> row never prunes -> dead monitor stays present forever -> down-detection breaks (false-negatives on the safety check). (3) raise STALE_TTL_S -> it's GLOBAL, pollutes /online + prune semantics fleet-wide. ADOPTED fix (option 2): peer-watchdog reads probe-AGE, not roster-PRESENCE. Hub already serves it: GET /fleet_health (per-host green/yellow/red from probe-age tripwires) or GET /probes_latest (per-host minutes_ago, host_probes never prunes). Flag a peer DOWN only if its latest probe age > ~2x monitor interval (~35min). Reuses the hub's existing liveness logic; NO hub change needed. Lane: cc-context-monitor.sh change (nw-whey author / bin-whey sh.git owner). NOT a wave-3 blocker, not urgent (benign churn already suppressed by nw-whey). Refs: nw-whey-cc-mqerfgvgn5xn, hub-llmmsgsrv-cc-mqeridphrgnd, nw-venus-cc-mqerizljgsrj.
-
check_peer probe-age fix committed in cc-context-monitor v4.17 (56ebb7c sh.git). No more false DOWN on roster-prune between oneshot fires.