Files
factory/vibe/ADR/0003-sandbox-state-lifecycle.md
2026-06-28 20:21:45 +02:00

14 KiB
Raw Blame History

vibe > ADR > 0003 · Sandbox state lifecycle

ADR-0003: Sandbox state lifecycle — iso-prod seed, reset & prod-write isolation

Status: Accepted Date: 2026-06-28 Deciders: @arcodange

Context

ADR-0002 introduced the <env> coordinate and stood up erp-sandbox in-cluster: its own Postgres database erp-sandbox owned by erp_sandbox_role, its own Vault auth role with dynamic credentials at postgres/creds/erp-sandbox and KV config at kvv2/erp-sandbox/config, its own ArgoCD Application, reachable at https://erp-sandbox.arcodange.lab. That ADR created the place. It deliberately left open how that place's data is filled, refreshed, and kept incapable of harming prod — the lifecycle of the sandbox's state.

The motivating workload is the write-capable AI-agent skill foreshadowed by ADR-0002 (the future "V9" Dolibarr write skill): auto-creating supplier invoices, fixing thirdparty records, and similar mutations. For that rehearsal to predict anything, three forces must be satisfied at once:

Force Pressure it creates
Rehearsal must run against prod-shaped data A sandbox seeded with synthetic data predicts nothing about how a write behaves on the real accounting set.
Rehearsal must be repeatable and disposable An agent (and BDD suite) must run writes, observe, and roll back to a known-good state many times without manual cleanup.
The rehearsal path must be structurally unable to write prod "Same app, different env" puts a sibling instance one API call away from the production database; intent alone is not a fence.

The reach matters. The agent's only surface is the Dolibarr REST API against erp-sandbox.arcodange.lab — it never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. That is precisely the boundary ADR-0002 established, and it is what makes an application-data rehearsal safe to operate in-cluster.

Why this is not what ADR-0001 rejected

ADR-0001 rejected its Alternative 3 — a "sandbox namespace on the real cluster" — for shared blast radius, and chose a local-only safe environment (k3d / arm64 VMs) instead. That rejection is scoped to rehearsing infrastructure / platform change-classes: Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them, and a sandbox that looks like prod at the infra layer gives false confidence — it cannot faithfully mirror a three-node fleet, its Longhorn, or its single Vault.

ADR-0003 operates one layer up, at the application-data layer. The question here is not "is this Terraform/Ansible change safe to apply to the fleet" but "is this Dolibarr write safe to apply to the accounting data." At that layer the agent's reach is API-only, the state is a single Postgres database plus an uploads PVC, and a wrong write's blast radius is bounded to one app's data — all of which a sibling environment can faithfully carry, because it runs the real Dolibarr API against a real copy of prod's rows. This ADR does not reverse ADR-0001; it addresses a different problem at a different altitude. ADR-0001 isolates the operator from breaking the fleet; ADR-0003 defines how the AI agent rehearses one app's data without a structural path to prod.

Decision

We will define the erp-sandbox state lifecycle around three mechanisms — an iso-prod seed, an object-level reset, and structural prod-write isolation — plus a human-gated promote step that carries a reviewed change from sandbox to prod.

1 · Iso-prod seed (the golden checkpoint)

We will produce a "golden" copy of production data with a read-only pg_dump of the prod erp database and store it as a reusable artifact. Seeding or refreshing the sandbox loads that golden into erp-sandbox. The dump is the source of business fidelity; Dolibarr's uploaded documents/ PVC may optionally be rsync'd alongside it for file-level fidelity, but the database carries the data the rehearsal asserts against. pg_dump reads — it never writes prod — so producing the golden is itself a safe operation.

2 · Reset via object-level wipe-and-reload — not DROP/CREATE DATABASE

We will reset the sandbox by restoring an app-scoped golden dump into the existing erp-sandbox database, not by dropping and recreating the database. Concretely: the golden is a pg_dump scoped to the application's own objects (Dolibarr prefixes every table llx_*), and reset is DROP OWNED BY erp_sandbox_role CASCADE — which removes every object owned by the app role, i.e. the app tables and any drift a rehearsal created, regardless of name — followed by pg_restore --no-owner --role=erp_sandbox_role. It runs with the sandbox's own dynamic credentials — a short-lived login role that is a member of erp_sandbox_role, which owns the objects — so it needs no CREATEDB, no superuser, and is structurally confined to objects the app role owns.

Infrastructure objects that share the public schema but are owned by the provisioner rather than the app role — notably the pgbouncer user_lookup function created per-database by postgres/iac — are deliberately left untouched: they are identical across environments, are not part of the app's data, and the app credential cannot (and must not) drop or recreate them. This is why the golden is scoped to llx_* and the wipe is DROP OWNED BY <app role> rather than a blanket pg_restore --clean (which would try to recreate the provisioner-owned function and fail on ownership) or a DROP SCHEMA public CASCADE (which would take the infra function with it). The Dolibarr pod is scaled to 0 (or its backends terminated) for the duration of the restore so it has exclusive access to the database. A faster DROP/CREATE DATABASE … TEMPLATE variant exists but requires a CREATEDB role; it is deferred to the Alternatives below.

3 · Prod-write isolation — defense in depth

We record the following as the integrity invariant of the sandbox: no path the agent can reach can mutate prod. Each layer is enforced structurally — by ownership and credential scope, not by policy or convention — so they hold even on a misdirected command:

  • The only super-credential lives behind the human gate. The sole credential that can create or drop databases or otherwise reach prod is the Postgres provider configured superuser = true in postgres/iac/providers.tf. It authenticates via Vault JWT and is exercised only inside the human-gated postgres.yaml CI run (Gitea OIDC handoff + PR merge). No standing or autonomous credential holds it.
  • DROP DATABASE requires ownership. The sandbox owner role erp_sandbox_role owns only erp-sandbox; it is structurally incapable of dropping prod erp, which is owned by erp_role. Ownership — not a deny rule — is the fence.
  • The write skill is sandbox-scoped at the application layer. The V9 write skill authenticates with a Dolibarr user and API key valid only on erp-sandbox.arcodange.lab. Prod stays read-only: the ai_agent account has no Dolibarr write permissions, so even a write misdirected at prod is rejected by Dolibarr itself.
  • The runtime DB creds carry no prod rights. The sandbox runtime credentials (postgres/creds/erp-sandbox) grant only membership in erp_sandbox_role — no rights on the prod database.
  • A host-guard refuses non-sandbox targets. The write tooling refuses any operation whose target host is not erp-sandbox.*.
  • Resettability is itself a safety layer. Any mistake made in the sandbox is reverted by the next reset, so the cost of a wrong sandbox write is bounded to "reset and retry."

4 · Human-in-the-loop promote

After rehearsing in the sandbox, the change is captured as a reviewable diff using the existing read-only dolibarr-data-snapshot skill (in the erp repo), which produces content-addressable before/after snapshots. A human approves the diff, and only then are the same operations applied to prod under a separate, deliberately-scoped prod-write credential used exclusively at promote time. That credential is never part of the agent's standing credentials — the agent authors and rehearses; promotion to prod is a distinct, human-initiated act.

Consequences

  • + Autonomous agents get a faithful, disposable rehearsal target — real prod-shaped data via the real Dolibarr API — with zero structural path to prod writes.
  • + The existing read-only skill family (dolibarr-tva-summary, dolibarr-payments-state, dolibarr-invoice-audit, dolibarr-thirdparty-completeness) becomes the BDD assertion library: each skill is a ready-made check the rehearsal can run before and after a write.
  • + Reset reuses the same Vault + Postgres scoping that already protects prod — no new privilege surface is introduced for the lifecycle; the sandbox's own dynamic creds suffice.
  • + The promote path keeps the prod-write credential out of the agent's hands entirely, so the only writes prod ever sees are human-approved replays.
  • Encryption fidelity is imperfect. Dolibarr ties some encrypted fields to DOLI_INSTANCE_UNIQUE_ID; the sandbox has its own uuid, so a few encrypted fields will not decrypt unless prod's uuid and key are copied into the sandbox KV. That "high-fidelity" mode is opt-in because it brings a prod secret into the sandbox; the default is the sandbox's own uuid, accepting the minor breakage of a few undecryptable fields.
  • Reset requires scaling the Dolibarr pod to 0 briefly, so the sandbox is unavailable for the duration of the restore.
  • pg_restore cost grows with database size; a large enough golden makes reset slow.
  • If reset becomes slow, introduce a CREATEDB-scoped role that owns only the sandbox and golden databases and switch reset to the DROP/CREATE DATABASE … TEMPLATE clone path — still structurally unable to drop prod, because it does not own erp.
  • Optional documents/ PVC rsync is a door left open for file-level fidelity if a rehearsal ever needs to assert on uploaded attachments, not just the database rows.

Alternatives considered

Option Why not
DROP/CREATE DATABASE … TEMPLATE for fast reset Rejected as the default because it requires a CREATEDB role. Acceptable later only via a dedicated role that owns only the sandbox + golden databases — ownership keeps prod undroppable — and documented here as the escape hatch for scale, not the day-one path.
Use the human-gated CI superuser path (postgres.yaml) for resets Rejected for autonomous / BDD use: that credential can reach prod, so it must stay behind the human merge gate. Wiring it into an automated reset loop would put a prod-capable credential on the agent's hot path — exactly what the integrity invariant forbids.
A fully separate cluster (ADR-0001's model) The right answer for infra rehearsal, but overkill here. The agent's reach is API-only and the state is one database plus a PVC; a sibling in-cluster environment carries that data faithfully without a second cluster to operate.
Synthetic / fixture seed data instead of an iso-prod dump Cheaper and carries no prod secrets, but predicts nothing about how a write behaves on the real accounting set — the rehearsal's whole point is prod-shaped data. The encryption-fidelity trade-off is accepted instead.

QA & validation

  • Reset round-trip gate — seed erp-sandbox from the golden, run a known write via the V9 skill, reset, and assert the sandbox state hashes back to the golden checkpoint (via the content-addressable dolibarr-data-snapshot hash). A reset that does not return to the golden hash is a failure.
  • No-superuser proof — the reset path runs end to end using only postgres/creds/erp-sandbox (membership in erp_sandbox_role); it must succeed with no CREATEDB and no superuser. If it needs either, the object-level mechanism is not confined as claimed.
  • Prod-undroppable proof — attempting DROP DATABASE erp (or any object write on prod) with the sandbox runtime credential must be rejected by Postgres on ownership grounds, and a write to erp.arcodange.lab with the sandbox Dolibarr key must be rejected by Dolibarr's permission model.
  • Host-guard check — the write tooling refuses any target host not matching erp-sandbox.*.
  • Promotion gate — no AI-authored write reaches prod until it has been rehearsed in erp-sandbox, captured as a reviewed before/after snapshot diff, and explicitly replayed against prod under the separate promote-time credential with human confirmation.

References

  • ADR-0001 · Safe, production-like environment — the local-only safe environment for infra rehearsal; this ADR addresses the application-data layer and does not supersede it.
  • ADR-0002 · Per-application environments — established the <env> coordinate and stood up the erp-sandbox instance whose state lifecycle this ADR defines.
  • factory postgres/iac/providers.tf — the superuser = true Postgres provider, the sole prod-capable credential, exercised only in the human-gated postgres.yaml CI run.
  • factory postgres/iac/main.tf — the per-instance flatten that owns each database by its <app>_role / <app>_<env>_role; erp-sandbox is owned by erp_sandbox_role, prod erp by erp_role, which is why the sandbox cannot drop prod.
  • tools hashicorp-vault/iac/modules/app_roles/main.tf — the dynamic-credential role whose creation statement grants only GRANT <app>_role TO {{name}} (membership only), so postgres/creds/erp-sandbox carries no rights on the prod database.
  • erp .claude/skills/dolibarr-data-snapshot/ — the read-only, content-addressable snapshot skill used to capture the reviewable before/after diff at promote time and to verify the reset round-trip.
  • PRs: this ADR is introduced by PR factory#19 (links back to this file).