▣ wi

#608 · agent-ops

lezama host-dark recurrence: root-cause + harden against full-dark (2nd occurrence; died 2026-06-01 18:03 ART, clean cliff)

Ref
#608 (#608)
Project
agent-ops
Status
backlog
Priority
high
Type
task
Assigned
nw-lezama-cc auditor
Created by
Created
2026-06-02T00:43:53.837Z
Updated
2026-06-02T00:45:20.455Z

Sub-items (2/8 done · 25%)

reftitlestatuspriorityassignee
#609 Reboot forensics: journalctl -b -1 -k (panic trace=crash / empty=power loss) + full -b -1 (OOM?), dmesg|tail-50, df -h, last -x reboot, smartctl -a backlog high nw-lezama-cc
#610 Validate the installed recovery software (failed twice): identify unit, check prior-boot status+journal (fired/tried/failed?), confirm dead-kernel/power is below userland scope; validate reverse-tunnel health EXTERNALLY from whey, never trust the ssh_r_active self-report flag backlog high nw-lezama-cc
#605 lezama host-dark alert: watchdog fires notify-elazar+eq after 3 consecutive 'unreachable' skips (~15min), instead of silently skipping forever backlog normal nw-whey-cc
#606 lezama single-path fragility: all 5435 probes since May23 are public-dns with zt_whey_visible=0 (ZeroTier fallback never worked); make ZT a real 2nd route + fix phantom ssh_r_active self-report backlog normal nw-whey-cc
#607 lezama out-of-band reset: a dead kernel cannot self-recover; scope IPMI/iDRAC vs smart-PDU for remote host-dark power-cycle (blocked on Elazar: does box have IPMI?) backlog low pm-llmmsgsrv-cc
#613 Root-cause-gated remediation (after forensics): OOM->add swap + tune vm.overcommit_memory; disk-full->monitor cron + notify-elazar; kernel panic->enable kdump for post-mortem blocked normal nw-lezama-cc
#611 ZeroTier transport redundancy: zt_whey_visible=0 across all 5435 probes -> single public-dns path. Bring ZT up (zerotier-cli listpeers, whey visible) as a real fallback route canceled normal nw-lezama-cc
#612 Scope a HARDWARE remote-reset path for dead-kernel recovery (autossh/userland watchdog cannot revive a dead kernel): IPMI/iDRAC if present, else smart PDU - only no-physical-access option canceled normal nw-lezama-cc
+ Add sub-item

Questions

No questions.

Event log