#509 ·
agent-opsevolutiva-backup down for mars+pluto: Supabase removed legacy db.<ref>.supabase.co direct hostnames (NXDOMAIN) - DSNs need pooler connection strings
- Ref
#509(#509)- Project
agent-ops- Status
- done
- Priority
- high
- Type
- bug
- Assigned
- nw-whey-cc db
- Created by
- —
- Created
- 2026-05-22T20:58:08.913Z
- Updated
- 2026-06-06T08:36:12.210Z
- Closed
- 2026-06-06T08:36:12.201Z
Questions
No questions.
Event log
-
wi cli
-
Diagnosis (nw-whey-cc 2026-05-22 ~18:05): SYMPTOM: evolutiva-backup pg_dump fails for BOTH mars and pluto — 'could not translate host name db.<ref>.supabase.co … Temporary failure in name resolution'. ROOT CAUSE: not a whey resolver glitch. dig returns authoritative NXDOMAIN (AUTHORITY:1 from supabase.co nameservers) for both: - mars db.ustenujpohwlkzfdso.supabase.co - pluto db.fdwjmzjwurbpkxersigg.supabase.co supabase.com and unrelated hosts resolve fine from whey. Both projects backed up OK earlier today (mars id 490 13:01, pluto id 489 13:00) then both started failing — a Supabase-side cutover, not two independent project deletions. Supabase removed the legacy direct-connection hostname db.<ref>.supabase.co; direct IPv4 connections now go through the Supavisor pooler. FIX: /etc/evolutiva-backup/secrets.env DATABASE_URL_MARS / DATABASE_URL_PLUTO must be updated from the legacy direct DSN (host db.<ref>.supabase.co:5432, user postgres) to the Session-mode pooler DSN: host = aws-0-<region>.pooler.supabase.com port 5432 user = postgres.<ref> New strings come from the Supabase dashboard → Project Settings → Database → Connection string → Session pooler (pg_dump needs session mode / port 5432, NOT transaction mode 6543). Region is per-project. LANE: secrets.env is root-owned /etc — outside nw-whey-cc's read-only/audit lane. A db-role agent or Elazar pulls the two pooler connection strings from the dashboard and updates secrets.env. run.sh itself needs NO change (it just consumes the DSN). IMPACT: mars + pluto Supabase backups have been down since ~13:00-15:00 today. Last good dumps: mars 2026-05-22 13:01 (3.83 MB), pluto 2026-05-22 13:00.
-
CORRECTION — revised diagnosis (nw-whey-cc 2026-05-22 ~18:15). My first diagnosis was WRONG; superseded by this. The cause is NOT a Supabase platform cutover removing the legacy db.<ref>.supabase.co hostname. Evidence against that: the legacy direct host still exists for ACTIVE projects (resolves to IPv6 since Jan 2024); whey has working IPv6 egress (2800:810:... , curl -6 OK), which is how these backups have run fine over the direct host all along. ACTUAL CAUSE: both mars and pluto are free-tier Supabase projects that hit the 7-day inactivity AUTO-PAUSE. A paused project shuts down compute AND its db.<ref> DNS record is removed → authoritative NXDOMAIN → pg_dump 'could not translate host name'. Documented behavior (Supabase GitHub issue #38345). Timeline fits: both backed up OK at ~13:00 today then failed after — the 7-day timer expired between the 13:00 and 15:00 runs. KEY FINDING: the every-2h backup pg_dump did NOT keep the projects alive. Supabase's pause timer counts qualifying DB activity; the backup connection/SELECT traffic evidently does not reset it. IMMEDIATE FIX (revised): NOT a DSN change. Restore each project from the Supabase dashboard (Project → Restore; free, ~1 min). Once active, db.<ref>.supabase.co resolves again over IPv6 and the existing secrets.env DSN works unchanged — no edit needed. LONG-TERM: prevent recurrence — see brainstorm options posted to Elazar (keepalive heartbeat write / Supabase Pro upgrade / better pause detection in run.sh + nightwatch). secrets.env still doesn't need editing; this WI's remaining scope is the long-term prevention work, pending Elazar's option pick.
-
DEFINITIVE diagnosis (nw-whey-cc 2026-05-22 ~18:18) — empirical, supersedes all earlier comments. The two projects are NOT the same problem; the unified 'platform cutover / hostname decommission' theory is wrong. MARS (ref ustenujpohwlkzfdso): authoritative NXDOMAIN on BOTH db.<ref>.supabase.co AND the bare REST host <ref>.supabase.co; curl to https://<ref>.supabase.co/rest/v1/ does not connect. The entire project endpoint is gone from DNS = the mars project is PAUSED/inactive. It backed up OK at 13:00 today so it is not deleted (90-day rule) — paused within the last few hours. Fix: Restore from the Supabase dashboard. Dashboard will show the pause reason (free-tier 7-day inactivity / resource limit / manual). PLUTO (ref fdwjmzjwurbpkxersigg): HEALTHY right now. db.<ref> resolves to an AAAA record (2600:1f1e:75b:4b14:...; IPv6-only since Supabase's Jan-2024 IPv4 change), getent OK, TCP/5432 OPEN, REST API returns 401 = project active. pluto's ~17:50 ART failure was a TRANSIENT name-resolution hiccup ('Temporary failure in name resolution' = glibc transient resolver error). No action needed — the next scheduled pluto backup should succeed on its own. NOT org-level billing: if the org were payment-paused, pluto would be down too. pluto is alive → the mars pause is project-specific. WHY whey could ever reach the IPv6-only direct host: whey has working IPv6 egress (global 2800:810:... , curl -6 OK). That is why these backups ran fine over db.<ref> all along and why pluto still works. secrets.env needs NO edit for either project. Remaining WI 509 scope = (1) restore mars [dashboard, Elazar/db-agent], (2) long-term prevention — see brainstorm to Elazar.
-
RESOLVED/verified 2026-06-06 05:33 (nw-whey-cc). mars+pluto+venus ALL backing up OK; zero fail rows in 24h. mars dumps 6.56MB hourly on disk at /gdrive/backup/evolutiva/mars/ (newest 2026-06-06-0530, plus 0500/0430...), pluto 0531, venus 0500. Fixed by the May-31 secrets.env edit (.bak-20260531 present) - mars DSN now resolves+connects. NOTE: the WI title's mars ref 'ustenujpohwlkzfdso' is a GARBLED ref (NXDOMAIN); the real working ref in secrets.env is ustenjufophwhlkzfdso, which resolves - so the original 'paused project' diagnosis was partly a transposed-ref artifact. Only blemish: the 05:30 mars run shows a dangling 'start' (no 'ok') because run.sh was interrupted AFTER writing the dump but BEFORE logging success - the 6.88MB dump IS on disk intact; caused by the concurrent host swap-thrash freeze (separate issue, being handled). No DSN/backup action needed.