Replaces the placeholder References line with the PR URL so the ADR↔PR crosslink is bidirectional per the AGENTS.md rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
vibe > ADR > 0002 · Per-application environments
ADR-0002: Per-application environments via an env coordinate
Status: Accepted Date: 2026-06-25 Deciders: @arcodange
Context
The <app> join key threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + <app>_role, Vault (postgres/creds/<app>, the k8s auth role <app>, the policies <app> / <app>-ops, the CI JWT role gitea_cicd_<app>), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix <app>/main, and DNS (<app>.arcodange.lab). Bricks wire together by name convention, not explicit config.
That convention conflates two ideas it never separated: an application and a deployed instance of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else."
The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a second deployed instance of the same application: the same Dolibarr chart, the same version, the same conventions — differing only in where it runs and which data it touches.
| Force | Pressure it creates |
|---|---|
| One identifier per app, no env coordinate | "Same app, different environment" is inexpressible without inventing a whole second app. |
| Write-capable AI agent landing on the prod ERP | A wrong autonomous write corrupts live accounting data with no rehearsal surface. |
| Fidelity requirement for the rehearsal surface | The sandbox must run the real Dolibarr API against prod-like data, or the rehearsal predicts nothing. |
| ADR-0001 rejected an in-cluster sandbox | Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away. |
Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on.
Decision
We will extend the <app> convention with a second coordinate, <env>, governed by an elision rule so that adding the coordinate changes nothing for any existing app.
envdefaults toprod, andprodelides. Whenenv == prod, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (local.instance == local.name), so every existing app'stofu planis a no-op.- Non-prod envs take the
<app>-<env>suffix in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as<app>_<env>_role, matching the existing_rolesuffix convention. - One repo and one chart serve every env of an app. Per-env differences are overlaid via
values-<env>.yaml; the chart's instance-specific values are.Values-driven, not hardcoded literals, so the same chart renders any instance. - One CI JWT role (
gitea_cicd_<app>) per repo covers all its envs. Its ops policy is widened to the<app>-*path family. Each running instance keeps its own runtime Vault policy.
Worked example: erp and erp-sandbox
| Coordinate | erp (env = prod, elided) |
erp-sandbox (env = sandbox) |
|---|---|---|
| Postgres database | erp |
erp-sandbox |
| Postgres owner role | erp_role |
erp_sandbox_role |
| k8s namespace + ServiceAccount | erp |
erp-sandbox |
| Vault dynamic DB creds | postgres/creds/erp |
postgres/creds/erp-sandbox |
| Vault KV config | kvv2/erp/config |
kvv2/erp-sandbox/config |
| ArgoCD Application | erp |
erp-sandbox |
| Internal DNS | erp.arcodange.lab |
erp-sandbox.arcodange.lab |
| Gitea repo | arcodange-org/erp |
arcodange-org/erp (shared) |
| Helm chart | one chart | one chart (shared) |
| CI JWT role | gitea_cicd_erp |
gitea_cicd_erp (shared) |
Why this is not what ADR-0001 rejected
ADR-0001 chose a local-only safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (erp-sandbox), which looks like the very thing that was rejected. The two stand together because they operate at different layers.
ADR-0001's rejection is scoped to rehearsing infrastructure / platform change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only.
ADR-0002 operates one layer up. The AI agent's only reach is the Dolibarr HTTP API, holding a write-scoped, app-specific API key against an isolated database — erp-sandbox on its own erp_sandbox_role, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply not in the agent's reach; the blast radius of a wrong AI write is bounded to the sandbox app's own data.
The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the operator from breaking the fleet. ADR-0002 isolates the AI agent from corrupting one app's production data, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer.
Consequences
- + Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan.
- + A second instance of an app is now a
values-<env>.yamloverlay plus anenvsentry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction. - + The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the same Dolibarr API and chart, an isolated database, a bounded blast radius.
- + The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the
-sandboxinstance, so runbooks read identically for any env. - − Names are no longer a flat app list: every consumer must reason about the
instance == app(prod) versusapp-env(non-prod) distinction, and the snake-case owner-role exception (<app>_<env>_role) is a special case that must be carried in the modules. - − A single shared Vault CI policy widened to
<app>-*means the CI role for a repo can write the ops paths of all that repo's envs — a deliberately looser ops scope than one-policy-per-instance. - − A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by
for_each, accepted at current scale — see Alternatives). - → The AI-agent promotion workflow this unlocks: the agent runs writes against
erp-sandboxautonomously, emits a structured changeset, a human reviews it, and the same operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g.DOLIBARR_TARGET=prod|sandbox, defaulting toprod). - → Rollout is additive and phased, each phase gated by a no-op
tofu planagainst existing apps: (A) thetoolsrepo adds an optionalenv/envsparameter to the sharedapp_rolesandapp_policyVault modules; (B) thefactoryrepo gains theenvsschema inpostgres/iactfvars, renders one ArgoCD Application per env, and documents the elision rule inconventions.md; (C) theerpchart literals are templated to.Values; (D)erp+factoryactivateerp-sandbox; (E) DNS + ArgoCD registration. - → Per-env state separation (
<app>/<env>prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model.
Alternatives considered
| Option | Why not |
|---|---|
Treat erp-sandbox as a wholly separate <app> (own repo, own chart copy) |
Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on. |
| Use the ADR-0001 local-only sandbox (k3d / VMs) for the AI-agent writes | That environment carries no production data — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it. |
Per-env OpenTofu state (<app>/<env> prefixes) instead of one shared state per repo |
Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with for_each keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door. |
No elision — always suffix, even prod (<app>-prod) |
Breaks every existing derived name, forcing a fleet-wide rename plus tofu resource moves; rejected in favour of the elision rule's zero-migration property. |
QA & validation
- Backwards-compat no-op gate — after the module change,
tofu planagainst every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guaranteeslocal.instance == local.nameforenv == prod, so no prod resource moves. - Byte-identical chart render —
helm template erp chart/before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides,diffexit 0). tofu fmt -check+tofu validateare clean on the module changes.- Sandbox activation gate — when
erp-sandboxis stood up, the new-web-app convention chain must resolve end to end for the-sandboxinstance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does. - Promotion gate — no AI-authored write reaches the prod ERP until it has been applied to
erp-sandbox, produced a reviewed changeset, and been explicitly re-applied with human confirmation.
References
- ADR-0001 · Safe, production-like environment — the local-only safe environment for infra rehearsal that this ADR complements (it stands; this does not supersede it).
- PRD · Safe, production-like environment — the product view this work relates to, and its isolation-boundary leaf detailing the cluster/Vault/state/DNS boundary.
- new-web-app conventions — the single-env
<app>convention this ADR extends with the env coordinate. - Phase A —
toolsVault module env parameter — adds the optionalenv/envsparameter to the sharedapp_rolesandapp_policymodules. - Phase C —
erpchart literal templating — templates the chart's single-env literals to.Valuesso one chart renders any instance. - PR factory#15 — this ADR — the change that introduces ADR-0002 (links back to this file).