fix(ops): sandbox refresh-from-prod actually restores (pg_restore -U + self-heal pause) #29

Merged
arcodange merged 1 commits from claude/sandbox-lifecycle-restore-fix into main 2026-06-30 07:00:04 +02:00
Owner

Surfaced while resetting the sandbox to a clean iso-prod checkpoint: refresh-from-prod was structurally broken and silently no-op'd the restore, leaving the sandbox DB empty after the wipe.

Two bugs

  1. pg_restore connected as root. It lacked -U, so the postgres image used its OS user (root) → password authentication failed for user "root". The failure was swallowed by … || echo "completed with ignorable warnings", so the script reported success after DROP OWNED had already emptied the DB. (E2's original seed was a manual process, so this code path had never really executed.)
    → pass -h $PGHOST -U $SB_PGUSER; don't trust pg_restore's exit code (it returns non-zero on the harmless schema "public" already exists notice) — verify by counting restored llx_* tables and fail the Job if < 250.
  2. ArgoCD self-heal fought the scale-down. erp-sandbox has selfHeal: true, which reverts kubectl scale --replicas=0 within seconds, so the seed ran with Dolibarr still connected (its table re-creation collided with the restore).
    → pause self-heal for the duration and re-arm it; app restore + self-heal + secret cleanup are guarded by an EXIT trap so an interrupt can't strand the sandbox at replicas=0 / self-heal off.

Validated end-to-end on the live sandbox

restore: pg_restore rc=1 — verifying by table count, not exit code
llx tables=295  company=Arcodange  lang=fr_FR  owner=erp_sandbox_role
→ self-heal re-armed (true), pod 1/1 Running, documents synced

README now documents the self-heal pause and the iso-prod consequence (a refresh wipes ai_agent_sandbox → re-run provisionSandbox.ts).

🤖 Generated with Claude Code

Surfaced while resetting the sandbox to a clean iso-prod checkpoint: `refresh-from-prod` was structurally broken and **silently no-op'd the restore**, leaving the sandbox DB empty after the wipe. ### Two bugs 1. **`pg_restore` connected as `root`.** It lacked `-U`, so the postgres image used its OS user (`root`) → `password authentication failed for user "root"`. The failure was swallowed by `… || echo "completed with ignorable warnings"`, so the script reported success *after* `DROP OWNED` had already emptied the DB. (E2's original seed was a manual process, so this code path had never really executed.) → pass `-h $PGHOST -U $SB_PGUSER`; **don't trust pg_restore's exit code** (it returns non-zero on the harmless `schema "public" already exists` notice) — verify by counting restored `llx_*` tables and fail the Job if `< 250`. 2. **ArgoCD self-heal fought the scale-down.** `erp-sandbox` has `selfHeal: true`, which reverts `kubectl scale --replicas=0` within seconds, so the seed ran with Dolibarr still connected (its table re-creation collided with the restore). → pause self-heal for the duration and re-arm it; app restore + self-heal + secret cleanup are guarded by an **EXIT trap** so an interrupt can't strand the sandbox at `replicas=0` / self-heal off. ### Validated end-to-end on the live sandbox ``` restore: pg_restore rc=1 — verifying by table count, not exit code llx tables=295 company=Arcodange lang=fr_FR owner=erp_sandbox_role → self-heal re-armed (true), pod 1/1 Running, documents synced ``` README now documents the self-heal pause and the iso-prod consequence (a refresh wipes `ai_agent_sandbox` → re-run `provisionSandbox.ts`). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
arcodange added 1 commit 2026-06-30 06:59:56 +02:00
refresh-from-prod was structurally broken and silently no-op'd the restore:

1. pg_restore lacked -U, so the postgres image connected as its OS user `root`
   and auth-failed. The failure was swallowed by `|| echo "ignorable warnings"`,
   so the script reported success while the DROP OWNED had already emptied the DB.
   E2's original seed was a manual process, so this path had never really run.
   Fix: pass `-h $PGHOST -U $SB_PGUSER`; don't trust pg_restore's exit code (it
   returns non-zero on the harmless "schema public already exists" notice) — verify
   by counting restored llx_* tables and FAIL the Job if < 250.

2. erp-sandbox is ArgoCD-managed with self-heal ON, which reverts the
   `kubectl scale --replicas=0` within seconds — so the seed ran with Dolibarr
   still connected. Fix: pause self-heal for the duration, re-arm it after; app
   restore + self-heal restoration + secret cleanup are guarded by an EXIT trap so
   an interrupt can't strand the sandbox at replicas=0 / self-heal off.

Validated end-to-end on the live sandbox: 295 llx tables, company=Arcodange,
owner=erp_sandbox_role, self-heal re-armed, pod 1/1. README documents the self-heal
pause and the iso-prod consequence (ai_agent_sandbox is wiped → re-provision).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arcodange merged commit d31e995acd into main 2026-06-30 07:00:04 +02:00
arcodange deleted branch claude/sandbox-lifecycle-restore-fix 2026-06-30 07:00:05 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: arcodange-org/erp#29