Files

Gabriel Radureau 434be7488d fix(ops): sandbox refresh-from-prod actually restores now (pg_restore -U + self-heal pause)

refresh-from-prod was structurally broken and silently no-op'd the restore:

1. pg_restore lacked -U, so the postgres image connected as its OS user `root`
   and auth-failed. The failure was swallowed by `|| echo "ignorable warnings"`,
   so the script reported success while the DROP OWNED had already emptied the DB.
   E2's original seed was a manual process, so this path had never really run.
   Fix: pass `-h $PGHOST -U $SB_PGUSER`; don't trust pg_restore's exit code (it
   returns non-zero on the harmless "schema public already exists" notice) — verify
   by counting restored llx_* tables and FAIL the Job if < 250.

2. erp-sandbox is ArgoCD-managed with self-heal ON, which reverts the
   `kubectl scale --replicas=0` within seconds — so the seed ran with Dolibarr
   still connected. Fix: pause self-heal for the duration, re-arm it after; app
   restore + self-heal restoration + secret cleanup are guarded by an EXIT trap so
   an interrupt can't strand the sandbox at replicas=0 / self-heal off.

Validated end-to-end on the live sandbox: 295 llx tables, company=Arcodange,
owner=erp_sandbox_role, self-heal re-armed, pod 1/1. README documents the self-heal
pause and the iso-prod consequence (ai_agent_sandbox is wiped → re-provision).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-30 06:59:39 +02:00

README.md

fix(ops): sandbox refresh-from-prod actually restores now (pg_restore -U + self-heal pause)

2026-06-30 06:59:39 +02:00

sandbox-lifecycle.sh

fix(ops): sandbox refresh-from-prod actually restores now (pg_restore -U + self-heal pause)

2026-06-30 06:59:39 +02:00

README.md

erp-sandbox lifecycle ops

Tooling to make erp-sandbox iso-prod and to reset it, implementing ADR-0003 (sandbox state lifecycle). The sandbox exists so AI agents can rehearse Dolibarr write operations against a faithful copy of prod, with a structural guarantee that the rehearsal path can never mutate prod.

The prod-integrity guarantee (why this is safe)

Layer	Enforcement
prod is read-only during a refresh	`pg_dump` runs with `default_transaction_read_only=on`
the restore can only write the sandbox	it uses the sandbox's own dynamic creds — a member of `erp_sandbox_role`, which owns only `erp-sandbox`
no database is dropped/created	wipe is `DROP OWNED BY erp_sandbox_role CASCADE`; reload is `pg_restore` (no `CREATEDB`, no superuser)
prod is structurally undroppable	`DROP DATABASE` needs ownership; `erp_sandbox_role` does not own prod `erp` (owned by `erp_role`)

The only prod-capable credential on the platform is the superuser=true provider in factory postgres/iac/providers.tf, used only in the human-gated postgres.yaml CI. This tooling never touches it.

Usage

./sandbox-lifecycle.sh refresh-from-prod   # clone prod DB (data + config) into erp-sandbox
./sandbox-lifecycle.sh sync-documents      # copy mycompany/ uploads (company logo, PDFs)
./sandbox-lifecycle.sh refresh             # both, in order

refresh-from-prod pauses ArgoCD self-heal on the erp-sandbox Application (else self-heal reverts the scale-down within seconds and the seed would run with the app still connected), scales the sandbox pod to 0, dumps the full prod public schema (read-only), wipes the sandbox's app objects, restores, scales back up and re-arms self-heal. App restore and self-heal restoration are guarded by an EXIT trap, so an interrupt can't leave the sandbox scaled to 0 with self-heal off. It dumps the whole schema (not just llx_*) so app helper functions and their triggers (e.g. update_modified_column_tms()) come over; it filters out the provisioner-owned user_lookup pgbouncer function from the restore TOC because that object already exists per-environment and is not app data.

Note: pg_restore runs with an explicit -U (the sandbox role) — without it the postgres image connects as its OS user root and auth-fails. Its exit code is not trusted (it returns non-zero on the harmless "schema public already exists" notice); success is verified by counting restored llx_* tables.

Heads-up: a full refresh is iso-prod, so it overwrites llx_user with prod's and wipes the ai_agent_sandbox write user + its API key (and resets DOLI_INSTANCE_UNIQUE_ID to prod's, invalidating any prior key). After a refresh, re-run test/provisionSandbox.ts to recreate the agent (it re-grants its rights, incl. banque lire) and refresh the dolibarr-sandbox-write skill .env from the new key file.

Two fidelity caveats (by design — see ADR-0003)

Encryption. Dolibarr ties some encrypted fields (notably API keys) to DOLI_INSTANCE_UNIQUE_ID. The sandbox has its own uuid, so prod-encrypted values won't decrypt there. This is why the write-scoped ai_agent_sandbox API key must be generated inside the sandbox (see ../../test/ POC), not copied from prod. Most data is plaintext and unaffected.
Uploaded files live on the PVC, not the DB. A DB refresh copies the logo const (MAIN_INFO_SOCIETE_LOGO) but not the image; sync-documents copies the documents/mycompany tree so the logo + attachments actually render.

BDD reset loop (E4)

For repeated rehearsals, refresh-from-prod is the "reset to prod state". A faster checkpoint/reset that avoids re-reading prod each time (cache a golden dump on a small PVC, then DROP OWNED + pg_restore from it) is the documented next optimization — see ADR-0003 §Decision/Consequences.

Hardening backlog

Replace the transient copy of prod's read+write creds with a dedicated read-only Postgres role (issued via a Vault dynamic role) so the dump path is least-privilege by construction, not just by default_transaction_read_only.
Provision a golden-cache PVC for fast BDD resets.