Records the decision to extend the <app> join key with a second coordinate <env>, governed by an elision rule (env=prod elides → every existing app's derived names are byte-identical and its tofu plan is a no-op; non-prod envs take the <app>-<env> suffix, with the Postgres owner role staying snake-case <app>_<env>_role). Motivated by the ERP's incoming write-capable AI-agent skill: it needs an in-cluster sandbox instance (erp-sandbox) with a prod-like Dolibarr API + isolated database to rehearse writes before a human promotes them to prod. The ADR reconciles this against ADR-0001 honestly — ADR-0001 rejected an in-cluster sandbox for INFRA-change rehearsal (shared fleet-wide control planes); ADR-0002 operates one layer up where the agent's only reach is the app's HTTP API against an isolated DB, so the fleet blast radius is not in scope. The two are complementary; ADR-0002 does not supersede ADR-0001. Also: - vibe/ADR/README.md: index row for 0002 + Last Updated 2026-06-25 - PRD safe-prod-like-environment README: bidirectional back-link to ADR-0002 on the Adjacent line + Last Updated 2026-06-25 Authored via the ADR Scribe persona, validated via the Continuity Warden checklist (no-tombstone, breadcrumb, MADR-lite sections, dead-link scan, bidirectional links). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
vibe > ADR > 0002 · Per-application environments
ADR-0002: Per-application environments via an env coordinate
Status: Accepted Date: 2026-06-25 Deciders: @arcodange
Context
The <app> join key threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + <app>_role, Vault (postgres/creds/<app>, the k8s auth role <app>, the policies <app> / <app>-ops, the CI JWT role gitea_cicd_<app>), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix <app>/main, and DNS (<app>.arcodange.lab). Bricks wire together by name convention, not explicit config.
That convention conflates two ideas it never separated: an application and a deployed instance of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else."
The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a second deployed instance of the same application: the same Dolibarr chart, the same version, the same conventions — differing only in where it runs and which data it touches.
| Force | Pressure it creates |
|---|---|
| One identifier per app, no env coordinate | "Same app, different environment" is inexpressible without inventing a whole second app. |
| Write-capable AI agent landing on the prod ERP | A wrong autonomous write corrupts live accounting data with no rehearsal surface. |
| Fidelity requirement for the rehearsal surface | The sandbox must run the real Dolibarr API against prod-like data, or the rehearsal predicts nothing. |
| ADR-0001 rejected an in-cluster sandbox | Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away. |
Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on.
Decision
We will extend the <app> convention with a second coordinate, <env>, governed by an elision rule so that adding the coordinate changes nothing for any existing app.
envdefaults toprod, andprodelides. Whenenv == prod, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (local.instance == local.name), so every existing app'stofu planis a no-op.- Non-prod envs take the
<app>-<env>suffix in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as<app>_<env>_role, matching the existing_rolesuffix convention. - One repo and one chart serve every env of an app. Per-env differences are overlaid via
values-<env>.yaml; the chart's instance-specific values are.Values-driven, not hardcoded literals, so the same chart renders any instance. - One CI JWT role (
gitea_cicd_<app>) per repo covers all its envs. Its ops policy is widened to the<app>-*path family. Each running instance keeps its own runtime Vault policy.
Worked example: erp and erp-sandbox
| Coordinate | erp (env = prod, elided) |
erp-sandbox (env = sandbox) |
|---|---|---|
| Postgres database | erp |
erp-sandbox |
| Postgres owner role | erp_role |
erp_sandbox_role |
| k8s namespace + ServiceAccount | erp |
erp-sandbox |
| Vault dynamic DB creds | postgres/creds/erp |
postgres/creds/erp-sandbox |
| Vault KV config | kvv2/erp/config |
kvv2/erp-sandbox/config |
| ArgoCD Application | erp |
erp-sandbox |
| Internal DNS | erp.arcodange.lab |
erp-sandbox.arcodange.lab |
| Gitea repo | arcodange-org/erp |
arcodange-org/erp (shared) |
| Helm chart | one chart | one chart (shared) |
| CI JWT role | gitea_cicd_erp |
gitea_cicd_erp (shared) |
Why this is not what ADR-0001 rejected
ADR-0001 chose a local-only safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (erp-sandbox), which looks like the very thing that was rejected. The two stand together because they operate at different layers.
ADR-0001's rejection is scoped to rehearsing infrastructure / platform change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only.
ADR-0002 operates one layer up. The AI agent's only reach is the Dolibarr HTTP API, holding a write-scoped, app-specific API key against an isolated database — erp-sandbox on its own erp_sandbox_role, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply not in the agent's reach; the blast radius of a wrong AI write is bounded to the sandbox app's own data.
The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the operator from breaking the fleet. ADR-0002 isolates the AI agent from corrupting one app's production data, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer.
Consequences
- + Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan.
- + A second instance of an app is now a
values-<env>.yamloverlay plus anenvsentry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction. - + The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the same Dolibarr API and chart, an isolated database, a bounded blast radius.
- + The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the
-sandboxinstance, so runbooks read identically for any env. - − Names are no longer a flat app list: every consumer must reason about the
instance == app(prod) versusapp-env(non-prod) distinction, and the snake-case owner-role exception (<app>_<env>_role) is a special case that must be carried in the modules. - − A single shared Vault CI policy widened to
<app>-*means the CI role for a repo can write the ops paths of all that repo's envs — a deliberately looser ops scope than one-policy-per-instance. - − A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by
for_each, accepted at current scale — see Alternatives). - → The AI-agent promotion workflow this unlocks: the agent runs writes against
erp-sandboxautonomously, emits a structured changeset, a human reviews it, and the same operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g.DOLIBARR_TARGET=prod|sandbox, defaulting toprod). - → Rollout is additive and phased, each phase gated by a no-op
tofu planagainst existing apps: (A) thetoolsrepo adds an optionalenv/envsparameter to the sharedapp_rolesandapp_policyVault modules; (B) thefactoryrepo gains theenvsschema inpostgres/iactfvars, renders one ArgoCD Application per env, and documents the elision rule inconventions.md; (C) theerpchart literals are templated to.Values; (D)erp+factoryactivateerp-sandbox; (E) DNS + ArgoCD registration. - → Per-env state separation (
<app>/<env>prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model.
Alternatives considered
| Option | Why not |
|---|---|
Treat erp-sandbox as a wholly separate <app> (own repo, own chart copy) |
Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on. |
| Use the ADR-0001 local-only sandbox (k3d / VMs) for the AI-agent writes | That environment carries no production data — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it. |
Per-env OpenTofu state (<app>/<env> prefixes) instead of one shared state per repo |
Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with for_each keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door. |
No elision — always suffix, even prod (<app>-prod) |
Breaks every existing derived name, forcing a fleet-wide rename plus tofu resource moves; rejected in favour of the elision rule's zero-migration property. |
QA & validation
- Backwards-compat no-op gate — after the module change,
tofu planagainst every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guaranteeslocal.instance == local.nameforenv == prod, so no prod resource moves. - Byte-identical chart render —
helm template erp chart/before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides,diffexit 0). tofu fmt -check+tofu validateare clean on the module changes.- Sandbox activation gate — when
erp-sandboxis stood up, the new-web-app convention chain must resolve end to end for the-sandboxinstance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does. - Promotion gate — no AI-authored write reaches the prod ERP until it has been applied to
erp-sandbox, produced a reviewed changeset, and been explicitly re-applied with human confirmation.
References
- ADR-0001 · Safe, production-like environment — the local-only safe environment for infra rehearsal that this ADR complements (it stands; this does not supersede it).
- PRD · Safe, production-like environment — the product view this work relates to, and its isolation-boundary leaf detailing the cluster/Vault/state/DNS boundary.
- new-web-app conventions — the single-env
<app>convention this ADR extends with the env coordinate. - Phase A —
toolsVault module env parameter — adds the optionalenv/envsparameter to the sharedapp_rolesandapp_policymodules. - Phase C —
erpchart literal templating — templates the chart's single-env literals to.Valuesso one chart renders any instance. - PR that introduces this ADR: to be linked once opened (the PR description must link back to this ADR — bidirectional).