refresh-from-prod was structurally broken and silently no-op'd the restore: 1. pg_restore lacked -U, so the postgres image connected as its OS user `root` and auth-failed. The failure was swallowed by `|| echo "ignorable warnings"`, so the script reported success while the DROP OWNED had already emptied the DB. E2's original seed was a manual process, so this path had never really run. Fix: pass `-h $PGHOST -U $SB_PGUSER`; don't trust pg_restore's exit code (it returns non-zero on the harmless "schema public already exists" notice) — verify by counting restored llx_* tables and FAIL the Job if < 250. 2. erp-sandbox is ArgoCD-managed with self-heal ON, which reverts the `kubectl scale --replicas=0` within seconds — so the seed ran with Dolibarr still connected. Fix: pause self-heal for the duration, re-arm it after; app restore + self-heal restoration + secret cleanup are guarded by an EXIT trap so an interrupt can't strand the sandbox at replicas=0 / self-heal off. Validated end-to-end on the live sandbox: 295 llx tables, company=Arcodange, owner=erp_sandbox_role, self-heal re-armed, pod 1/1. README documents the self-heal pause and the iso-prod consequence (ai_agent_sandbox is wiped → re-provision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.3 KiB
erp-sandbox lifecycle ops
Tooling to make erp-sandbox iso-prod and to reset it, implementing
ADR-0003
(sandbox state lifecycle). The sandbox exists so AI agents can rehearse Dolibarr
write operations against a faithful copy of prod, with a structural guarantee
that the rehearsal path can never mutate prod.
The prod-integrity guarantee (why this is safe)
| Layer | Enforcement |
|---|---|
| prod is read-only during a refresh | pg_dump runs with default_transaction_read_only=on |
| the restore can only write the sandbox | it uses the sandbox's own dynamic creds — a member of erp_sandbox_role, which owns only erp-sandbox |
| no database is dropped/created | wipe is DROP OWNED BY erp_sandbox_role CASCADE; reload is pg_restore (no CREATEDB, no superuser) |
| prod is structurally undroppable | DROP DATABASE needs ownership; erp_sandbox_role does not own prod erp (owned by erp_role) |
The only prod-capable credential on the platform is the superuser=true provider
in factory postgres/iac/providers.tf, used only in the human-gated
postgres.yaml CI. This tooling never touches it.
Usage
./sandbox-lifecycle.sh refresh-from-prod # clone prod DB (data + config) into erp-sandbox
./sandbox-lifecycle.sh sync-documents # copy mycompany/ uploads (company logo, PDFs)
./sandbox-lifecycle.sh refresh # both, in order
refresh-from-prod pauses ArgoCD self-heal on the erp-sandbox Application (else
self-heal reverts the scale-down within seconds and the seed would run with the app
still connected), scales the sandbox pod to 0, dumps the full prod public schema
(read-only), wipes the sandbox's app objects, restores, scales back up and re-arms
self-heal. App restore and self-heal restoration are guarded by an EXIT trap, so an
interrupt can't leave the sandbox scaled to 0 with self-heal off. It dumps the
whole schema (not just llx_*) so app helper functions and their triggers (e.g.
update_modified_column_tms()) come over; it filters out the provisioner-owned
user_lookup pgbouncer function from the restore TOC because that object already
exists per-environment and is not app data.
Note:
pg_restoreruns with an explicit-U(the sandbox role) — without it the postgres image connects as its OS userrootand auth-fails. Its exit code is not trusted (it returns non-zero on the harmless "schema public already exists" notice); success is verified by counting restoredllx_*tables.
Heads-up: a full refresh is iso-prod, so it overwrites
llx_userwith prod's and wipes theai_agent_sandboxwrite user + its API key (and resetsDOLI_INSTANCE_UNIQUE_IDto prod's, invalidating any prior key). After a refresh, re-runtest/provisionSandbox.tsto recreate the agent (it re-grants its rights, incl.banque lire) and refresh thedolibarr-sandbox-writeskill.envfrom the new key file.
Two fidelity caveats (by design — see ADR-0003)
- Encryption. Dolibarr ties some encrypted fields (notably API keys) to
DOLI_INSTANCE_UNIQUE_ID. The sandbox has its own uuid, so prod-encrypted values won't decrypt there. This is why the write-scopedai_agent_sandboxAPI key must be generated inside the sandbox (see../../test/POC), not copied from prod. Most data is plaintext and unaffected. - Uploaded files live on the PVC, not the DB. A DB refresh copies the logo
const (
MAIN_INFO_SOCIETE_LOGO) but not the image;sync-documentscopies thedocuments/mycompanytree so the logo + attachments actually render.
BDD reset loop (E4)
For repeated rehearsals, refresh-from-prod is the "reset to prod state". A
faster checkpoint/reset that avoids re-reading prod each time (cache a golden
dump on a small PVC, then DROP OWNED + pg_restore from it) is the documented
next optimization — see ADR-0003 §Decision/Consequences.
Hardening backlog
- Replace the transient copy of prod's read+write creds with a dedicated
read-only Postgres role (issued via a Vault dynamic role) so the dump path is
least-privilege by construction, not just by
default_transaction_read_only. - Provision a golden-cache PVC for fast BDD resets.