docs(adr): ADR-0003 — sandbox state lifecycle (iso-prod seed, reset & prod-write isolation) #19

Merged
arcodange merged 2 commits from claude/adr-0003-sandbox-reset into main 2026-06-28 20:21:57 +02:00
Owner

Summary

Pins down how erp-sandbox's data is seeded, reset, and kept structurally incapable of harming prod — the application-data-layer complement to ADR-0001 (which rejected an in-cluster sandbox for infra rehearsal) and the lifecycle for the erp-sandbox instance ADR-0002 stood up.

Drafted via a clean-context agent, then the reset mechanism was refined and validated against the live erp-sandbox (see below).

The decision

  1. Iso-prod seed — read-only pg_dump of prod erp, app-scoped to llx_*, as a reusable golden checkpoint.
  2. Reset = DROP OWNED BY erp_sandbox_role CASCADE + pg_restore --no-owner --role=erp_sandbox_role into the existing DB — no DROP/CREATE DATABASE, no CREATEDB, no superuser. Provisioner-owned infra objects (the pgbouncer user_lookup function) are deliberately left untouched.
  3. Prod-write isolation as a structural invariant (not policy): superuser only behind the human-gated postgres.yaml CI; DROP DATABASE gated by ownership (erp_sandbox_role owns only erp-sandbox, never prod erp/erp_role); sandbox-scoped Dolibarr key; membership-only runtime creds; host-guard; resettability.
  4. Human-gated promote — rehearse in sandbox → capture a reviewable diff via the read-only dolibarr-data-snapshot skill → human approves → replay against prod under a separate promote-time credential never held by the agent.

Validated against the live sandbox (not just argued)

I prototyped the reset on erp-sandbox end-to-end:

checkpoint: app-scoped pg_dump (llx_*), 13 objects (infra user_lookup excluded)
drift: +1 invoice row, +1 canary table
reset: DROP OWNED BY erp_sandbox_role CASCADE → 0 tables  →  pg_restore (app-scoped)
verify: facture back to golden ✓  drift row gone ✓  canary gone ✓  owner=erp_sandbox_role ✓
integrity: superuser=false  createdb=false  member-of-erp_role=false
           erp-sandbox→erp_sandbox_role,  erp(prod)→erp_role  ⇒ cannot touch prod

The prototype also caught the refinement now baked into §2: a naive pg_restore --clean/DROP SCHEMA fails or over-reaches on the provisioner-owned pgbouncer function — so the golden is scoped to llx_* and the wipe is DROP OWNED BY <app role>. (Prototype artifacts were throwaway k8s Jobs; the sandbox DB was returned to its empty pre-install state and the pod restored.)

Follow-on (not in this PR)

Implementation is Phase E: E1 (enable Dolibarr API module + create ai_agent_sandbox user — your UI step), E2 (productionize the seed/reset as a k8s Job + a read-only prod dump role), E3 = V9 write skill + host-guard, E4 BDD harness, E5 promote.

🤖 Generated with Claude Code

## Summary Pins down how `erp-sandbox`'s **data** is seeded, reset, and kept structurally incapable of harming prod — the application-data-layer complement to [ADR-0001](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/vibe/ADR/0001-safe-prod-like-environment.md) (which rejected an in-cluster sandbox for *infra* rehearsal) and the lifecycle for the `erp-sandbox` instance [ADR-0002](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/vibe/ADR/0002-per-application-environments.md) stood up. Drafted via a clean-context agent, then the reset mechanism was **refined and validated against the live `erp-sandbox`** (see below). ## The decision 1. **Iso-prod seed** — read-only `pg_dump` of prod `erp`, app-scoped to `llx_*`, as a reusable golden checkpoint. 2. **Reset** = `DROP OWNED BY erp_sandbox_role CASCADE` + `pg_restore --no-owner --role=erp_sandbox_role` **into the existing DB** — no `DROP/CREATE DATABASE`, no `CREATEDB`, no superuser. Provisioner-owned infra objects (the pgbouncer `user_lookup` function) are deliberately left untouched. 3. **Prod-write isolation** as a *structural* invariant (not policy): superuser only behind the human-gated `postgres.yaml` CI; `DROP DATABASE` gated by ownership (`erp_sandbox_role` owns only `erp-sandbox`, never prod `erp`/`erp_role`); sandbox-scoped Dolibarr key; membership-only runtime creds; host-guard; resettability. 4. **Human-gated promote** — rehearse in sandbox → capture a reviewable diff via the read-only `dolibarr-data-snapshot` skill → human approves → replay against prod under a *separate* promote-time credential never held by the agent. ## Validated against the live sandbox (not just argued) I prototyped the reset on `erp-sandbox` end-to-end: ``` checkpoint: app-scoped pg_dump (llx_*), 13 objects (infra user_lookup excluded) drift: +1 invoice row, +1 canary table reset: DROP OWNED BY erp_sandbox_role CASCADE → 0 tables → pg_restore (app-scoped) verify: facture back to golden ✓ drift row gone ✓ canary gone ✓ owner=erp_sandbox_role ✓ integrity: superuser=false createdb=false member-of-erp_role=false erp-sandbox→erp_sandbox_role, erp(prod)→erp_role ⇒ cannot touch prod ``` The prototype also caught the refinement now baked into §2: a naive `pg_restore --clean`/`DROP SCHEMA` fails or over-reaches on the provisioner-owned pgbouncer function — so the golden is scoped to `llx_*` and the wipe is `DROP OWNED BY <app role>`. (Prototype artifacts were throwaway k8s Jobs; the sandbox DB was returned to its empty pre-install state and the pod restored.) ## Follow-on (not in this PR) Implementation is **Phase E**: E1 (enable Dolibarr API module + create `ai_agent_sandbox` user — your UI step), E2 (productionize the seed/reset as a k8s Job + a read-only prod dump role), E3 = V9 write skill + host-guard, E4 BDD harness, E5 promote. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
arcodange added 1 commit 2026-06-28 20:21:27 +02:00
Records how erp-sandbox's DATA is seeded, reset, and kept structurally
incapable of harming prod — the application-data-layer complement to ADR-0001
(which rejected an in-cluster sandbox for INFRA rehearsal) and the lifecycle for
the erp-sandbox instance ADR-0002 stood up.

Decision: (1) iso-prod golden via read-only pg_dump of prod erp, app-scoped to
llx_*; (2) reset = DROP OWNED BY erp_sandbox_role CASCADE + pg_restore
--no-owner --role=erp_sandbox_role into the EXISTING db (no DROP/CREATE DATABASE,
no CREATEDB, no superuser; provisioner-owned infra objects like the pgbouncer
user_lookup function are left untouched); (3) prod-write isolation as a
structural invariant (superuser only in human-gated postgres.yaml CI; DROP
DATABASE gated by ownership — erp_sandbox_role owns only erp-sandbox, never prod
erp/erp_role; sandbox-scoped Dolibarr key; membership-only runtime creds;
host-guard; resettability); plus a human-gated promote via the read-only
dolibarr-data-snapshot diff under a separate prod-write credential.

The reset mechanism + the integrity invariant were validated against the live
erp-sandbox: DROP OWNED BY erp_sandbox_role + app-scoped pg_restore round-trips
to the golden checkpoint using only erp_sandbox_role membership (superuser=false,
createdb=false, not a member of erp_role), proving prod is structurally
unreachable from the sandbox credential.

Drafted via a clean-context agent; mechanism refined from a live prototype.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arcodange added 1 commit 2026-06-28 20:21:50 +02:00
arcodange merged commit 9a42346852 into main 2026-06-28 20:21:56 +02:00
arcodange deleted branch claude/adr-0003-sandbox-reset 2026-06-28 20:21:58 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: arcodange-org/factory#19