fix(ops): sandbox refresh-from-prod actually restores now (pg_restore -U + self-heal pause)

refresh-from-prod was structurally broken and silently no-op'd the restore:

1. pg_restore lacked -U, so the postgres image connected as its OS user `root`
   and auth-failed. The failure was swallowed by `|| echo "ignorable warnings"`,
   so the script reported success while the DROP OWNED had already emptied the DB.
   E2's original seed was a manual process, so this path had never really run.
   Fix: pass `-h $PGHOST -U $SB_PGUSER`; don't trust pg_restore's exit code (it
   returns non-zero on the harmless "schema public already exists" notice) — verify
   by counting restored llx_* tables and FAIL the Job if < 250.

2. erp-sandbox is ArgoCD-managed with self-heal ON, which reverts the
   `kubectl scale --replicas=0` within seconds — so the seed ran with Dolibarr
   still connected. Fix: pause self-heal for the duration, re-arm it after; app
   restore + self-heal restoration + secret cleanup are guarded by an EXIT trap so
   an interrupt can't strand the sandbox at replicas=0 / self-heal off.

Validated end-to-end on the live sandbox: 295 llx tables, company=Arcodange,
owner=erp_sandbox_role, self-heal re-armed, pod 1/1. README documents the self-heal
pause and the iso-prod consequence (ai_agent_sandbox is wiped → re-provision).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-30 06:59:39 +02:00
parent 0688e3d7fd
commit 434be7488d
2 changed files with 49 additions and 14 deletions

View File

@@ -27,12 +27,28 @@ in `factory postgres/iac/providers.tf`, used **only** in the human-gated
./sandbox-lifecycle.sh refresh # both, in order
```
`refresh-from-prod` scales the sandbox pod to 0, dumps the full prod `public`
schema (read-only), wipes the sandbox's app objects, restores, and scales back
up. It dumps the **whole** schema (not just `llx_*`) so app helper functions and
their triggers (e.g. `update_modified_column_tms()`) come over; it filters out
the provisioner-owned `user_lookup` pgbouncer function from the restore TOC
because that object already exists per-environment and is not app data.
`refresh-from-prod` pauses ArgoCD self-heal on the `erp-sandbox` Application (else
self-heal reverts the scale-down within seconds and the seed would run with the app
still connected), scales the sandbox pod to 0, dumps the full prod `public` schema
(read-only), wipes the sandbox's app objects, restores, scales back up and re-arms
self-heal. App restore and self-heal restoration are guarded by an EXIT trap, so an
interrupt can't leave the sandbox scaled to 0 with self-heal off. It dumps the
**whole** schema (not just `llx_*`) so app helper functions and their triggers (e.g.
`update_modified_column_tms()`) come over; it filters out the provisioner-owned
`user_lookup` pgbouncer function from the restore TOC because that object already
exists per-environment and is not app data.
> Note: `pg_restore` runs with an explicit `-U` (the sandbox role) — without it the
> postgres image connects as its OS user `root` and auth-fails. Its exit code is not
> trusted (it returns non-zero on the harmless "schema public already exists"
> notice); success is verified by counting restored `llx_*` tables.
> Heads-up: a full refresh is **iso-prod**, so it overwrites `llx_user` with prod's
> and wipes the `ai_agent_sandbox` write user + its API key (and resets
> `DOLI_INSTANCE_UNIQUE_ID` to prod's, invalidating any prior key). After a refresh,
> re-run `test/provisionSandbox.ts` to recreate the agent (it re-grants its rights,
> incl. `banque lire`) and refresh the `dolibarr-sandbox-write` skill `.env` from the
> new key file.
## Two fidelity caveats (by design — see ADR-0003)