refresh-from-prod was structurally broken and silently no-op'd the restore: 1. pg_restore lacked -U, so the postgres image connected as its OS user `root` and auth-failed. The failure was swallowed by `|| echo "ignorable warnings"`, so the script reported success while the DROP OWNED had already emptied the DB. E2's original seed was a manual process, so this path had never really run. Fix: pass `-h $PGHOST -U $SB_PGUSER`; don't trust pg_restore's exit code (it returns non-zero on the harmless "schema public already exists" notice) — verify by counting restored llx_* tables and FAIL the Job if < 250. 2. erp-sandbox is ArgoCD-managed with self-heal ON, which reverts the `kubectl scale --replicas=0` within seconds — so the seed ran with Dolibarr still connected. Fix: pause self-heal for the duration, re-arm it after; app restore + self-heal restoration + secret cleanup are guarded by an EXIT trap so an interrupt can't strand the sandbox at replicas=0 / self-heal off. Validated end-to-end on the live sandbox: 295 llx tables, company=Arcodange, owner=erp_sandbox_role, self-heal re-armed, pod 1/1. README documents the self-heal pause and the iso-prod consequence (ai_agent_sandbox is wiped → re-provision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
77 lines
4.3 KiB
Markdown
77 lines
4.3 KiB
Markdown
# erp-sandbox lifecycle ops
|
|
|
|
Tooling to make `erp-sandbox` **iso-prod** and to reset it, implementing
|
|
[ADR-0003](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/vibe/ADR/0003-sandbox-state-lifecycle.md)
|
|
(sandbox state lifecycle). The sandbox exists so AI agents can rehearse Dolibarr
|
|
**write** operations against a faithful copy of prod, with a structural guarantee
|
|
that the rehearsal path can never mutate prod.
|
|
|
|
## The prod-integrity guarantee (why this is safe)
|
|
|
|
| Layer | Enforcement |
|
|
| --- | --- |
|
|
| prod is **read-only** during a refresh | `pg_dump` runs with `default_transaction_read_only=on` |
|
|
| the restore can only write the sandbox | it uses the sandbox's own dynamic creds — a member of `erp_sandbox_role`, which **owns only `erp-sandbox`** |
|
|
| no database is dropped/created | wipe is `DROP OWNED BY erp_sandbox_role CASCADE`; reload is `pg_restore` (no `CREATEDB`, no superuser) |
|
|
| prod is structurally undroppable | `DROP DATABASE` needs ownership; `erp_sandbox_role` does not own prod `erp` (owned by `erp_role`) |
|
|
|
|
The only prod-capable credential on the platform is the `superuser=true` provider
|
|
in `factory postgres/iac/providers.tf`, used **only** in the human-gated
|
|
`postgres.yaml` CI. This tooling never touches it.
|
|
|
|
## Usage
|
|
|
|
```sh
|
|
./sandbox-lifecycle.sh refresh-from-prod # clone prod DB (data + config) into erp-sandbox
|
|
./sandbox-lifecycle.sh sync-documents # copy mycompany/ uploads (company logo, PDFs)
|
|
./sandbox-lifecycle.sh refresh # both, in order
|
|
```
|
|
|
|
`refresh-from-prod` pauses ArgoCD self-heal on the `erp-sandbox` Application (else
|
|
self-heal reverts the scale-down within seconds and the seed would run with the app
|
|
still connected), scales the sandbox pod to 0, dumps the full prod `public` schema
|
|
(read-only), wipes the sandbox's app objects, restores, scales back up and re-arms
|
|
self-heal. App restore and self-heal restoration are guarded by an EXIT trap, so an
|
|
interrupt can't leave the sandbox scaled to 0 with self-heal off. It dumps the
|
|
**whole** schema (not just `llx_*`) so app helper functions and their triggers (e.g.
|
|
`update_modified_column_tms()`) come over; it filters out the provisioner-owned
|
|
`user_lookup` pgbouncer function from the restore TOC because that object already
|
|
exists per-environment and is not app data.
|
|
|
|
> Note: `pg_restore` runs with an explicit `-U` (the sandbox role) — without it the
|
|
> postgres image connects as its OS user `root` and auth-fails. Its exit code is not
|
|
> trusted (it returns non-zero on the harmless "schema public already exists"
|
|
> notice); success is verified by counting restored `llx_*` tables.
|
|
|
|
> Heads-up: a full refresh is **iso-prod**, so it overwrites `llx_user` with prod's
|
|
> and wipes the `ai_agent_sandbox` write user + its API key (and resets
|
|
> `DOLI_INSTANCE_UNIQUE_ID` to prod's, invalidating any prior key). After a refresh,
|
|
> re-run `test/provisionSandbox.ts` to recreate the agent (it re-grants its rights,
|
|
> incl. `banque lire`) and refresh the `dolibarr-sandbox-write` skill `.env` from the
|
|
> new key file.
|
|
|
|
## Two fidelity caveats (by design — see ADR-0003)
|
|
|
|
1. **Encryption.** Dolibarr ties some encrypted fields (notably API keys) to
|
|
`DOLI_INSTANCE_UNIQUE_ID`. The sandbox has its **own** uuid, so prod-encrypted
|
|
values won't decrypt there. This is why the write-scoped `ai_agent_sandbox`
|
|
API key must be **generated inside the sandbox** (see `../../test/` POC),
|
|
not copied from prod. Most data is plaintext and unaffected.
|
|
2. **Uploaded files live on the PVC, not the DB.** A DB refresh copies the logo
|
|
*const* (`MAIN_INFO_SOCIETE_LOGO`) but not the image; `sync-documents` copies
|
|
the `documents/mycompany` tree so the logo + attachments actually render.
|
|
|
|
## BDD reset loop (E4)
|
|
|
|
For repeated rehearsals, `refresh-from-prod` is the "reset to prod state". A
|
|
faster checkpoint/reset that avoids re-reading prod each time (cache a golden
|
|
dump on a small PVC, then `DROP OWNED + pg_restore` from it) is the documented
|
|
next optimization — see ADR-0003 §Decision/Consequences.
|
|
|
|
## Hardening backlog
|
|
|
|
- Replace the transient copy of prod's read+write creds with a **dedicated
|
|
read-only Postgres role** (issued via a Vault dynamic role) so the dump path is
|
|
least-privilege by construction, not just by `default_transaction_read_only`.
|
|
- Provision a golden-cache PVC for fast BDD resets.
|