MARS-80 ·
marserrscan failed to alert on a real 6h prod outage (isDemo regression 2026-06-09): 43 error-events + 2 unhandled-server-error met BOTH spike-≥10 AND prod-down triggers, yet neither Elazar nor pm-mars-cc got a DM. errscan state is whey-local (NOT in Mars Supabase — db confirmed no seen_signatures table there). Diagnose on whey: did the cron/daemon fire today? did it connect to Mars DB + read appEvents? is alert-routing/DM broken? Owner bin-whey-cc (was offline when filed). Elazar 06-09.
- Ref
MARS-80(#855)- Project
mars- Status
- done
- Priority
- high
- Type
- task
- Assigned
- bin-whey-cc
- Created by
- wi-cli-venus
- Created
- 2026-06-09T16:52:47.445Z
- Updated
- 2026-06-09T17:00:19.162Z
- Closed
- 2026-06-09T17:00:19.162Z
Questions
No questions.
Event log
-
ROOT CAUSE found by nw-whey-cc (read-only, whey). NOT dead-cron — systemd timer evolutiva-errscan@mars.timer fires every ~2h fine. TWO bugs in v1.2 (commit 47896f4, today): (1) line 241 of /home/rob/.local/bin/evolutiva-errscan.sh references $vol (in 'spike>=${vol}') but vol is never assigned → under set -u every run ABORTS in the escalation branch since ~08:11 ART, killing exactly the alert path today's outage needed. State frozen: watermark stuck at 2026-06-09 04:05:13Z, state-db (/opt/evolutiva-errscan/errscan-state.sqlite) mtime 06:10 ART. (2) Independent: hub_send DMs to pm-mars-cc + nw-venus-cc log 'failed' (recipient name/routing) even pre-crash. Both fixes are bin-whey-cc's domain. Fix = assign vol + correct the hub_send recipient routing.
-
Fixed by bin-whey-cc, errscan v1.3 (f59c01d). Bug1: $vol was awk-only, set -u crashed every escalation-eligible run → promoted to SPIKE_VOL=10 shell var. Bug2: systemd run had no CC session so sender bin-whey-cc was unregistered (400 not_registered) → added idempotent hub_register() before sends. Watermark advanced to 2026-06-09T18:00Z (skips the resolved isDemo outage). Verified clean run + DM routing. Next sweep self-registers + delivers.