#608 ·
agent-opslezama host-dark recurrence: root-cause + harden against full-dark (2nd occurrence; died 2026-06-01 18:03 ART, clean cliff)
- Ref
#608(#608)- Project
agent-ops- Status
- backlog
- Priority
- high
- Type
- task
- Assigned
- nw-lezama-cc auditor
- Created by
- —
- Created
- 2026-06-02T00:43:53.837Z
- Updated
- 2026-06-02T00:45:20.455Z
Sub-items (2/8 done · 25%)
Questions
No questions.
Event log
-
wi cli
-
Data (whey hub v2.sqlite host_probes + ping): probes every 5min, latency flat 37-50ms, ssh_r_active=1, healthy to LAST probe 2026-06-01 18:03:07 ART, then abrupt silence (clean cliff = kernel panic OR power loss; small-ping 100% loss rules out PMTU + userland-zombie; whey ppp peer 200.16.89.2 stayed UP). ssh_r_active=1 was PHANTOM (whey had no actual reverse tunnel). ZeroTier path NEVER worked: zt_whey_visible=0 across all 5435 probes since 2026-05-23 -> single public-dns route, zero redundancy. Recurrence: prior ~2.8d gap 05-29 14:54->06-01 10:07, ran ~8h, died again. Detection gap ALREADY FIXED: hub v2.9.16 (#604/#531) yellow>1h/red>3h + zero-agent guard -> lezama now alerts RED in minutes. Contributors: nw-whey-cc, nw-venus-cc, support-llmmsgsrv-cc, coder-chatduo-cc. Elazar works on-site ~10:30 06-02.
-
RECONCILE with pm-llmmsgsrv's #531 set (created in parallel): #604 detection fix DONE (hub v2.9.16). Cross-team items #605 (host-dark ~15min alert), #606 (ZeroTier 2nd path + phantom-flag fix), #607 (OOB hardware reset) live under #531 - proposing pm reparent them under #608 for one incident tree. Unique on-site children here: #609 reboot forensics, #610 recovery-SW post-mortem, #613 root-cause-gated remediation (blocked on #609). My #611/#612 canceled as dups of #606/#607.