Merge pull request 'docs(adr): ADR-0003 — sandbox state lifecycle (iso-prod seed, reset & prod-write isolation)' (#19) from claude/adr-0003-sandbox-reset into main

2026-06-28 20:21:54 +02:00
parent 5c60677171 8e69004b4c
commit 9a42346852
2 changed files with 97 additions and 1 deletions
--- a/vibe/ADR/0003-sandbox-state-lifecycle.md
+++ b/vibe/ADR/0003-sandbox-state-lifecycle.md
@@ -0,0 +1,95 @@
+[vibe](../README.md) > [ADR](README.md) > **0003 · Sandbox state lifecycle**
+
+# ADR-0003: Sandbox state lifecycle — iso-prod seed, reset & prod-write isolation
+
+> **Status**: Accepted
+> **Date**: 2026-06-28
+> **Deciders**: @arcodange
+
+## Context
+
+[ADR-0002](0002-per-application-environments.md) introduced the `<env>` coordinate and stood up `erp-sandbox` in-cluster: its own Postgres database `erp-sandbox` owned by `erp_sandbox_role`, its own Vault auth role with dynamic credentials at `postgres/creds/erp-sandbox` and KV config at `kvv2/erp-sandbox/config`, its own ArgoCD Application, reachable at `https://erp-sandbox.arcodange.lab`. That ADR created the *place*. It deliberately left open *how that place's data is filled, refreshed, and kept incapable of harming prod* — the lifecycle of the sandbox's state.
+
+The motivating workload is the write-capable AI-agent skill foreshadowed by ADR-0002 (the future "V9" Dolibarr write skill): auto-creating supplier invoices, fixing thirdparty records, and similar mutations. For that rehearsal to predict anything, three forces must be satisfied at once:
+
+| Force | Pressure it creates |
+| --- | --- |
+| Rehearsal must run against prod-shaped data | A sandbox seeded with synthetic data predicts nothing about how a write behaves on the real accounting set. |
+| Rehearsal must be repeatable and disposable | An agent (and BDD suite) must run writes, observe, and roll back to a known-good state many times without manual cleanup. |
+| The rehearsal path must be structurally unable to write prod | "Same app, different env" puts a sibling instance one API call away from the production database; intent alone is not a fence. |
+
+The reach matters. The agent's only surface is the **Dolibarr REST API** against `erp-sandbox.arcodange.lab` — it never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. That is precisely the boundary ADR-0002 established, and it is what makes an application-data rehearsal safe to operate in-cluster.
+
+### Why this is not what ADR-0001 rejected
+
+[ADR-0001](0001-safe-prod-like-environment.md) rejected its Alternative 3 — a "sandbox namespace on the real cluster" — for shared blast radius, and chose a **local-only** safe environment (k3d / arm64 VMs) instead. That rejection is scoped to rehearsing **infrastructure / platform** change-classes: Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them, and a sandbox that *looks* like prod at the infra layer gives false confidence — it cannot faithfully mirror a three-node fleet, its Longhorn, or its single Vault.
+
+ADR-0003 operates one layer up, at the **application-data layer**. The question here is not "is this Terraform/Ansible change safe to apply to the fleet" but "is this Dolibarr write safe to apply to the accounting data." At that layer the agent's reach is **API-only**, the state is a single Postgres database plus an uploads PVC, and a wrong write's blast radius is bounded to one app's data — all of which a sibling environment *can* faithfully carry, because it runs the real Dolibarr API against a real copy of prod's rows. This ADR does not reverse ADR-0001; it addresses a different problem at a different altitude. ADR-0001 isolates the *operator* from breaking the *fleet*; ADR-0003 defines how the *AI agent* rehearses *one app's data* without a structural path to prod.
+
+## Decision
+
+We will define the `erp-sandbox` state lifecycle around three mechanisms — an iso-prod seed, an object-level reset, and structural prod-write isolation — plus a human-gated promote step that carries a reviewed change from sandbox to prod.
+
+### 1 · Iso-prod seed (the golden checkpoint)
+
+We will produce a "golden" copy of production data with a **read-only** `pg_dump` of the prod `erp` database and store it as a reusable artifact. Seeding or refreshing the sandbox loads that golden into `erp-sandbox`. The dump is the source of business fidelity; Dolibarr's uploaded `documents/` PVC may *optionally* be rsync'd alongside it for file-level fidelity, but the database carries the data the rehearsal asserts against. `pg_dump` reads — it never writes prod — so producing the golden is itself a safe operation.
+
+### 2 · Reset via object-level wipe-and-reload — not `DROP/CREATE DATABASE`
+
+We will reset the sandbox by restoring an **app-scoped** golden dump **into the existing `erp-sandbox` database**, not by dropping and recreating the database. Concretely: the golden is a `pg_dump` scoped to the application's own objects (Dolibarr prefixes every table `llx_*`), and reset is **`DROP OWNED BY erp_sandbox_role CASCADE`** — which removes every object owned by the app role, i.e. the app tables *and* any drift a rehearsal created, regardless of name — followed by `pg_restore --no-owner --role=erp_sandbox_role`. It runs with the sandbox's **own dynamic credentials** — a short-lived login role that is a member of `erp_sandbox_role`, which owns the objects — so it needs **no `CREATEDB`, no superuser**, and is structurally confined to objects the app role owns.
+
+Infrastructure objects that share the `public` schema but are owned by the *provisioner* rather than the app role — notably the pgbouncer `user_lookup` function created per-database by `postgres/iac` — are deliberately left untouched: they are identical across environments, are not part of the app's data, and the app credential cannot (and must not) drop or recreate them. This is why the golden is scoped to `llx_*` and the wipe is `DROP OWNED BY <app role>` rather than a blanket `pg_restore --clean` (which would try to recreate the provisioner-owned function and fail on ownership) or a `DROP SCHEMA public CASCADE` (which would take the infra function with it). The Dolibarr pod is scaled to 0 (or its backends terminated) for the duration of the restore so it has exclusive access to the database. A faster `DROP/CREATE DATABASE … TEMPLATE` variant exists but requires a `CREATEDB` role; it is deferred to the Alternatives below.
+
+### 3 · Prod-write isolation — defense in depth
+
+We record the following as the integrity invariant of the sandbox: **no path the agent can reach can mutate prod.** Each layer is enforced *structurally* — by ownership and credential scope, not by policy or convention — so they hold even on a misdirected command:
+
+- **The only super-credential lives behind the human gate.** The sole credential that can create or drop databases or otherwise reach prod is the Postgres provider configured `superuser = true` in `postgres/iac/providers.tf`. It authenticates via Vault JWT and is exercised **only** inside the human-gated `postgres.yaml` CI run (Gitea OIDC handoff + PR merge). No standing or autonomous credential holds it.
+- **`DROP DATABASE` requires ownership.** The sandbox owner role `erp_sandbox_role` owns **only** `erp-sandbox`; it is structurally incapable of dropping prod `erp`, which is owned by `erp_role`. Ownership — not a deny rule — is the fence.
+- **The write skill is sandbox-scoped at the application layer.** The V9 write skill authenticates with a Dolibarr user and API key valid only on `erp-sandbox.arcodange.lab`. Prod stays read-only: the `ai_agent` account has no Dolibarr write permissions, so even a write misdirected at prod is rejected by Dolibarr itself.
+- **The runtime DB creds carry no prod rights.** The sandbox runtime credentials (`postgres/creds/erp-sandbox`) grant only membership in `erp_sandbox_role` — no rights on the prod database.
+- **A host-guard refuses non-sandbox targets.** The write tooling refuses any operation whose target host is not `erp-sandbox.*`.
+- **Resettability is itself a safety layer.** Any mistake made in the sandbox is reverted by the next reset, so the cost of a wrong sandbox write is bounded to "reset and retry."
+
+### 4 · Human-in-the-loop promote
+
+After rehearsing in the sandbox, the change is captured as a reviewable diff using the existing read-only `dolibarr-data-snapshot` skill (in the `erp` repo), which produces content-addressable before/after snapshots. A human approves the diff, and only then are the **same** operations applied to prod under a **separate, deliberately-scoped prod-write credential** used exclusively at promote time. That credential is never part of the agent's standing credentials — the agent authors and rehearses; promotion to prod is a distinct, human-initiated act.
+
+## Consequences
+
+- **+** Autonomous agents get a faithful, disposable rehearsal target — real prod-shaped data via the real Dolibarr API — with zero structural path to prod writes.
+- **+** The existing read-only skill family (`dolibarr-tva-summary`, `dolibarr-payments-state`, `dolibarr-invoice-audit`, `dolibarr-thirdparty-completeness`) becomes the BDD assertion library: each skill is a ready-made check the rehearsal can run before and after a write.
+- **+** Reset reuses the same Vault + Postgres scoping that already protects prod — no new privilege surface is introduced for the lifecycle; the sandbox's own dynamic creds suffice.
+- **+** The promote path keeps the prod-write credential out of the agent's hands entirely, so the only writes prod ever sees are human-approved replays.
+- **−** Encryption fidelity is imperfect. Dolibarr ties some encrypted fields to `DOLI_INSTANCE_UNIQUE_ID`; the sandbox has its own uuid, so a few encrypted fields will not decrypt unless prod's uuid and key are copied into the sandbox KV. That "high-fidelity" mode is opt-in because it brings a prod secret into the sandbox; the default is the sandbox's own uuid, accepting the minor breakage of a few undecryptable fields.
+- **−** Reset requires scaling the Dolibarr pod to 0 briefly, so the sandbox is unavailable for the duration of the restore.
+- **−** `pg_restore` cost grows with database size; a large enough golden makes reset slow.
+- **→** If reset becomes slow, introduce a `CREATEDB`-scoped role that owns only the sandbox and golden databases and switch reset to the `DROP/CREATE DATABASE … TEMPLATE` clone path — still structurally unable to drop prod, because it does not own `erp`.
+- **→** Optional `documents/` PVC rsync is a door left open for file-level fidelity if a rehearsal ever needs to assert on uploaded attachments, not just the database rows.
+
+## Alternatives considered
+
+| Option | Why not |
+| --- | --- |
+| `DROP/CREATE DATABASE … TEMPLATE` for fast reset | Rejected as the **default** because it requires a `CREATEDB` role. Acceptable **later** only via a dedicated role that owns only the sandbox + golden databases — ownership keeps prod undroppable — and documented here as the escape hatch for scale, not the day-one path. |
+| Use the human-gated CI superuser path (`postgres.yaml`) for resets | Rejected for autonomous / BDD use: that credential can reach prod, so it must stay behind the human merge gate. Wiring it into an automated reset loop would put a prod-capable credential on the agent's hot path — exactly what the integrity invariant forbids. |
+| A fully separate cluster ([ADR-0001](0001-safe-prod-like-environment.md)'s model) | The right answer for **infra** rehearsal, but overkill here. The agent's reach is API-only and the state is one database plus a PVC; a sibling in-cluster environment carries that data faithfully without a second cluster to operate. |
+| Synthetic / fixture seed data instead of an iso-prod dump | Cheaper and carries no prod secrets, but predicts nothing about how a write behaves on the real accounting set — the rehearsal's whole point is prod-shaped data. The encryption-fidelity trade-off is accepted instead. |
+
+## QA & validation
+
+- **Reset round-trip gate** — seed `erp-sandbox` from the golden, run a known write via the V9 skill, reset, and assert the sandbox state hashes back to the golden checkpoint (via the content-addressable `dolibarr-data-snapshot` hash). A reset that does not return to the golden hash is a failure.
+- **No-superuser proof** — the reset path runs end to end using only `postgres/creds/erp-sandbox` (membership in `erp_sandbox_role`); it must succeed with **no** `CREATEDB` and **no** superuser. If it needs either, the object-level mechanism is not confined as claimed.
+- **Prod-undroppable proof** — attempting `DROP DATABASE erp` (or any object write on prod) with the sandbox runtime credential must be rejected by Postgres on ownership grounds, and a write to `erp.arcodange.lab` with the sandbox Dolibarr key must be rejected by Dolibarr's permission model.
+- **Host-guard check** — the write tooling refuses any target host not matching `erp-sandbox.*`.
+- **Promotion gate** — no AI-authored write reaches prod until it has been rehearsed in `erp-sandbox`, captured as a reviewed before/after snapshot diff, and explicitly replayed against prod under the separate promote-time credential with human confirmation.
+
+## References
+
+- [ADR-0001 · Safe, production-like environment](0001-safe-prod-like-environment.md) — the local-only safe environment for **infra** rehearsal; this ADR addresses the **application-data** layer and does not supersede it.
+- [ADR-0002 · Per-application environments](0002-per-application-environments.md) — established the `<env>` coordinate and stood up the `erp-sandbox` instance whose state lifecycle this ADR defines.
+- `factory` `postgres/iac/providers.tf` — the `superuser = true` Postgres provider, the sole prod-capable credential, exercised only in the human-gated `postgres.yaml` CI run.
+- `factory` `postgres/iac/main.tf` — the per-instance flatten that owns each database by its `<app>_role` / `<app>_<env>_role`; `erp-sandbox` is owned by `erp_sandbox_role`, prod `erp` by `erp_role`, which is why the sandbox cannot drop prod.
+- `tools` `hashicorp-vault/iac/modules/app_roles/main.tf` — the dynamic-credential role whose creation statement grants only `GRANT <app>_role TO {{name}}` (membership only), so `postgres/creds/erp-sandbox` carries no rights on the prod database.
+- `erp` `.claude/skills/dolibarr-data-snapshot/` — the read-only, content-addressable snapshot skill used to capture the reviewable before/after diff at promote time and to verify the reset round-trip.
+- PRs: this ADR is introduced by [PR factory#19](https://gitea.arcodange.lab/arcodange-org/factory/pulls/19) (links back to this file).
--- a/vibe/ADR/README.md
+++ b/vibe/ADR/README.md
@@ -3,7 +3,7 @@
 # Architecture Decision Records

 > **Status**: 🟢 Active
-> **Last Updated**: 2026-06-25
+> **Last Updated**: 2026-06-28
 > **Related**: [vibe/PRD](../PRD/README.md) · [vibe/Investigations](../investigations/README.md)
 > **Historical**: [doc/adr](../../doc/adr/README.md) (foundational infra) · [ansible/.../docs/adr](../../ansible/arcodange/factory/docs/adr/) (dated infra ADRs)

@@ -35,6 +35,7 @@ When a new decision *supersedes* one of the historical records, write the new AD
 | --- | --- | --- | --- |
 | [0001](0001-safe-prod-like-environment.md) | Safe, production-like environment | 🟢 Accepted | 2026-06-23 |
 | [0002](0002-per-application-environments.md) | Per-application environments | 🟢 Accepted | 2026-06-25 |
+| [0003](0003-sandbox-state-lifecycle.md) | Sandbox state lifecycle | 🟢 Accepted | 2026-06-28 |

 ## Rules to contribute