#539 ·
llmmsg-srv · child of #531Venus-side watchdog + alert standardization (mirror lezama pattern)
- Ref
#539(#539)- Project
llmmsg-srv- Parent
- backlog #531 Transport reliability + observability program (whey/venus/lezama)
- Status
- inProgress
- Priority
- normal
- Type
- task
- Assigned
- nw-venus-cc
- Created by
- —
- Created
- 2026-05-23T18:46:26.278Z
- Updated
- 2026-05-29T05:30:24.377Z
Questions
No questions.
Event log
-
wi cli; parent=#531
-
status=inProgress
-
Watchdog cadence consensus 3/3 (nw-lezama/nw-venus/nw-whey), Elazar-prompted 2026-05-29. earlyoom: leave as-is (adaptive memory poll, -r 60 = log cadence only; no 1s fixed poll). NEW CPU/resource kill-daemon: poll 10s; kill CPU hog after 3 polls (30s sustained); RAM>90% for 2 polls (20s); disk>85% ALERT-only (never kill); never-kill list: postgres,sshd,systemd,node,autossh,claude,codex. Open knob pending Elazar: CPU threshold N% + chromium process-count cap (proposed N=90%, count>10).
-
Baseline collector launched. resource-baseline.sh v1.0 in sh.git (b5c8701): pidstat -u 1 1 instantaneous CPU, proc_samples/sys_samples/meta schema, 10s interval / 48h window. whey collecting (PID 2869021, ~/resource-baseline.sqlite, ends ~2026-05-31). venus tasked to git-pull + verify sysstat/pidstat present before launch (script drops proc rows silently if absent). lezama held (GCABA). nw agents report p50/p95/p99/max per comm at window close -> PM sets final N% + sustained-poll count with Elazar.
-
Both hosts confirmed collecting (verified data). whey PID 2869021; venus PID 398955 (sysstat installed, sanity check passed: 16 proc_samples after first interval). 48h/10s each, ~/resource-baseline.sqlite. Window closes ~2026-05-31 02:30. Awaiting percentile reports.