[vibe](../README.md) > [ADR](README.md) > **0002 · Per-application environments** # ADR-0002: Per-application environments via an env coordinate > **Status**: Accepted > **Date**: 2026-06-25 > **Deciders**: @arcodange ## Context The [`` join key](../../doc/runbooks/new-web-app/conventions.md) threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + `_role`, Vault (`postgres/creds/`, the k8s auth role ``, the policies `` / `-ops`, the CI JWT role `gitea_cicd_`), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix `/main`, and DNS (`.arcodange.lab`). Bricks wire together by name convention, not explicit config. That convention conflates two ideas it never separated: an **application** and a **deployed instance** of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else." The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a **second deployed instance of the same application**: the same Dolibarr chart, the same version, the same conventions — differing only in *where* it runs and *which data* it touches. | Force | Pressure it creates | | --- | --- | | One identifier per app, no env coordinate | "Same app, different environment" is inexpressible without inventing a whole second app. | | Write-capable AI agent landing on the prod ERP | A wrong autonomous write corrupts live accounting data with no rehearsal surface. | | Fidelity requirement for the rehearsal surface | The sandbox must run the *real* Dolibarr API against *prod-like* data, or the rehearsal predicts nothing. | | [ADR-0001](0001-safe-prod-like-environment.md) rejected an in-cluster sandbox | Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away. | Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on. ## Decision We will extend the `` convention with a second coordinate, ``, governed by an **elision rule** so that adding the coordinate changes nothing for any existing app. - **`env` defaults to `prod`, and `prod` elides.** When `env == prod`, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (`local.instance == local.name`), so every existing app's `tofu plan` is a no-op. - **Non-prod envs take the `-` suffix** in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as `__role`, matching the existing `_role` suffix convention. - **One repo and one chart serve every env of an app.** Per-env differences are overlaid via `values-.yaml`; the chart's instance-specific values are `.Values`-driven, not hardcoded literals, so the same chart renders any instance. - **One CI JWT role (`gitea_cicd_`) per repo covers all its envs.** Its ops policy is widened to the `-*` path family. Each running instance keeps its own runtime Vault policy. ### Worked example: `erp` and `erp-sandbox` | Coordinate | `erp` (env = prod, elided) | `erp-sandbox` (env = sandbox) | | --- | --- | --- | | Postgres database | `erp` | `erp-sandbox` | | Postgres owner role | `erp_role` | `erp_sandbox_role` | | k8s namespace + ServiceAccount | `erp` | `erp-sandbox` | | Vault dynamic DB creds | `postgres/creds/erp` | `postgres/creds/erp-sandbox` | | Vault KV config | `kvv2/erp/config` | `kvv2/erp-sandbox/config` | | ArgoCD Application | `erp` | `erp-sandbox` | | Internal DNS | `erp.arcodange.lab` | `erp-sandbox.arcodange.lab` | | Gitea repo | `arcodange-org/erp` | `arcodange-org/erp` (shared) | | Helm chart | one chart | one chart (shared) | | CI JWT role | `gitea_cicd_erp` | `gitea_cicd_erp` (shared) | ### Why this is not what ADR-0001 rejected [ADR-0001](0001-safe-prod-like-environment.md) chose a **local-only** safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (`erp-sandbox`), which looks like the very thing that was rejected. The two stand together because they operate at **different layers**. ADR-0001's rejection is scoped to rehearsing **infrastructure / platform** change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only. ADR-0002 operates one layer up. The AI agent's only reach is the **Dolibarr HTTP API**, holding a write-scoped, app-specific API key against an isolated database — `erp-sandbox` on its own `erp_sandbox_role`, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply **not in the agent's reach**; the blast radius of a wrong AI write is bounded to the sandbox app's own data. The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the *operator* from breaking the *fleet*. ADR-0002 isolates the *AI agent* from corrupting *one app's production data*, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer. ## Consequences - **+** Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan. - **+** A second instance of an app is now a `values-.yaml` overlay plus an `envs` entry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction. - **+** The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the *same* Dolibarr API and chart, an *isolated* database, a bounded blast radius. - **+** The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the `-sandbox` instance, so runbooks read identically for any env. - **−** Names are no longer a flat app list: every consumer must reason about the `instance == app` (prod) versus `app-env` (non-prod) distinction, and the snake-case owner-role exception (`__role`) is a special case that must be carried in the modules. - **−** A single shared Vault CI policy widened to `-*` means the CI role for a repo can write the ops paths of *all* that repo's envs — a deliberately looser ops scope than one-policy-per-instance. - **−** A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by `for_each`, accepted at current scale — see Alternatives). - **→** The AI-agent promotion workflow this unlocks: the agent runs writes against `erp-sandbox` autonomously, emits a structured changeset, a human reviews it, and the **same** operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g. `DOLIBARR_TARGET=prod|sandbox`, defaulting to `prod`). - **→** Rollout is additive and phased, each phase gated by a no-op `tofu plan` against existing apps: **(A)** the `tools` repo adds an optional `env` / `envs` parameter to the shared `app_roles` and `app_policy` Vault modules; **(B)** the `factory` repo gains the `envs` schema in `postgres/iac` tfvars, renders one ArgoCD Application per env, and documents the elision rule in `conventions.md`; **(C)** the `erp` chart literals are templated to `.Values`; **(D)** `erp` + `factory` activate `erp-sandbox`; **(E)** DNS + ArgoCD registration. - **→** Per-env state separation (`/` prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model. ## Alternatives considered | Option | Why not | | --- | --- | | Treat `erp-sandbox` as a wholly separate `` (own repo, own chart copy) | Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on. | | Use the [ADR-0001](0001-safe-prod-like-environment.md) local-only sandbox (k3d / VMs) for the AI-agent writes | That environment carries **no production data** — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it. | | Per-env OpenTofu state (`/` prefixes) instead of one shared state per repo | Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with `for_each` keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door. | | No elision — always suffix, even prod (`-prod`) | Breaks every existing derived name, forcing a fleet-wide rename plus `tofu` resource moves; rejected in favour of the elision rule's zero-migration property. | ## QA & validation - **Backwards-compat no-op gate** — after the module change, `tofu plan` against every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guarantees `local.instance == local.name` for `env == prod`, so no prod resource moves. - **Byte-identical chart render** — `helm template erp chart/` before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides, `diff` exit 0). - **`tofu fmt -check` + `tofu validate`** are clean on the module changes. - **Sandbox activation gate** — when `erp-sandbox` is stood up, the [new-web-app convention chain](../../doc/runbooks/new-web-app/conventions.md) must resolve end to end for the `-sandbox` instance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does. - **Promotion gate** — no AI-authored write reaches the prod ERP until it has been applied to `erp-sandbox`, produced a reviewed changeset, and been explicitly re-applied with human confirmation. ## References - [ADR-0001 · Safe, production-like environment](0001-safe-prod-like-environment.md) — the local-only safe environment for infra rehearsal that this ADR complements (it stands; this does not supersede it). - [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — the product view this work relates to, and its [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) detailing the cluster/Vault/state/DNS boundary. - [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the single-env `` convention this ADR extends with the env coordinate. - [Phase A — `tools` Vault module env parameter](https://gitea.arcodange.lab/arcodange-org/tools/pulls/2) — adds the optional `env` / `envs` parameter to the shared `app_roles` and `app_policy` modules. - [Phase C — `erp` chart literal templating](https://gitea.arcodange.lab/arcodange-org/erp/pulls/11) — templates the chart's single-env literals to `.Values` so one chart renders any instance. - [PR factory#15 — this ADR](https://gitea.arcodange.lab/arcodange-org/factory/pulls/15) — the change that introduces ADR-0002 (links back to this file).