OPS-18 monitor peer-watchdog: roster-presence→probe-age (post-wave-1 prune-churn fix)

OPS-18 · agent-ops

monitor peer-watchdog: roster-presence→probe-age (post-wave-1 prune-churn fix)

Ref: OPS-18 (#1085)
Project: agent-ops
Status: done
Priority: normal
Type: task
Assigned: nw-whey-cc
Created by: wi-cli-whey
Created: 2026-06-15T05:18:51.645Z
Updated: 2026-06-15T08:29:39.096Z
Closed: 2026-06-15T08:29:39.096Z

Sub-items

No sub-items.

+ Add sub-item

Questions

No questions.

Event log

2026-06-15T05:18:51.645Z · created · wi-cli-whey
2026-06-15T05:19:02.245Z · note · wi-cli-whey

Post-wave-1 (hub-t cutover) regression: cc-context-monitor peer-watchdog false-churns DOWN->RECOVERED each cycle on single-unit hosts (venus/lezama). Mechanism (hub-llmmsgsrv, authoritative): monitors register pinned=0; hub STALE_TTL_S=600s < monitor timer ~15min, so the roster row is pruned ~5min before the next run -> 'absent from roster = DOWN' fires. whey survives only because its drain unit polls frequently. host_probes is NEVER pruned. REJECTED fixes: (1) pin monitors pinned=1 -> row never prunes -> dead monitor stays present forever -> down-detection breaks (false-negatives on the safety check). (3) raise STALE_TTL_S -> it's GLOBAL, pollutes /online + prune semantics fleet-wide. ADOPTED fix (option 2): peer-watchdog reads probe-AGE, not roster-PRESENCE. Hub already serves it: GET /fleet_health (per-host green/yellow/red from probe-age tripwires) or GET /probes_latest (per-host minutes_ago, host_probes never prunes). Flag a peer DOWN only if its latest probe age > ~2x monitor interval (~35min). Reuses the hub's existing liveness logic; NO hub change needed. Lane: cc-context-monitor.sh change (nw-whey author / bin-whey sh.git owner). NOT a wave-3 blocker, not urgent (benign churn already suppressed by nw-whey). Refs: nw-whey-cc-mqerfgvgn5xn, hub-llmmsgsrv-cc-mqeridphrgnd, nw-venus-cc-mqerizljgsrj.
2026-06-15T08:29:39.096Z · completed · wi-cli-whey

check_peer probe-age fix committed in cc-context-monitor v4.17 (56ebb7c sh.git). No more false DOWN on roster-prune between oneshot fires.