Replaces the placeholder References line with the PR URL so the
ADR↔PR crosslink is bidirectional per the AGENTS.md rule.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records the decision to extend the <app> join key with a second
coordinate <env>, governed by an elision rule (env=prod elides → every
existing app's derived names are byte-identical and its tofu plan is a
no-op; non-prod envs take the <app>-<env> suffix, with the Postgres
owner role staying snake-case <app>_<env>_role).
Motivated by the ERP's incoming write-capable AI-agent skill: it needs
an in-cluster sandbox instance (erp-sandbox) with a prod-like Dolibarr
API + isolated database to rehearse writes before a human promotes them
to prod. The ADR reconciles this against ADR-0001 honestly — ADR-0001
rejected an in-cluster sandbox for INFRA-change rehearsal (shared
fleet-wide control planes); ADR-0002 operates one layer up where the
agent's only reach is the app's HTTP API against an isolated DB, so the
fleet blast radius is not in scope. The two are complementary; ADR-0002
does not supersede ADR-0001.
Also:
- vibe/ADR/README.md: index row for 0002 + Last Updated 2026-06-25
- PRD safe-prod-like-environment README: bidirectional back-link to
ADR-0002 on the Adjacent line + Last Updated 2026-06-25
Authored via the ADR Scribe persona, validated via the Continuity Warden
checklist (no-tombstone, breadcrumb, MADR-lite sections, dead-link scan,
bidirectional links).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-25 14:55:19 +02:00
3 changed files with 101 additions and 3 deletions
# ADR-0002: Per-application environments via an env coordinate
> **Status**: Accepted
> **Date**: 2026-06-25
> **Deciders**: @arcodange
## Context
The [`<app>` join key](../../doc/runbooks/new-web-app/conventions.md) threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + `<app>_role`, Vault (`postgres/creds/<app>`, the k8s auth role `<app>`, the policies `<app>` / `<app>-ops`, the CI JWT role `gitea_cicd_<app>`), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix `<app>/main`, and DNS (`<app>.arcodange.lab`). Bricks wire together by name convention, not explicit config.
That convention conflates two ideas it never separated: an **application** and a **deployed instance** of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else."
The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a **second deployed instance of the same application**: the same Dolibarr chart, the same version, the same conventions — differing only in *where* it runs and *which data* it touches.
| Force | Pressure it creates |
| --- | --- |
| One identifier per app, no env coordinate | "Same app, different environment" is inexpressible without inventing a whole second app. |
| Write-capable AI agent landing on the prod ERP | A wrong autonomous write corrupts live accounting data with no rehearsal surface. |
| Fidelity requirement for the rehearsal surface | The sandbox must run the *real* Dolibarr API against *prod-like* data, or the rehearsal predicts nothing. |
| [ADR-0001](0001-safe-prod-like-environment.md) rejected an in-cluster sandbox | Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away. |
Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on.
## Decision
We will extend the `<app>` convention with a second coordinate, `<env>`, governed by an **elision rule** so that adding the coordinate changes nothing for any existing app.
- **`env` defaults to `prod`, and `prod` elides.** When `env == prod`, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (`local.instance == local.name`), so every existing app's `tofu plan` is a no-op.
- **Non-prod envs take the `<app>-<env>` suffix** in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as `<app>_<env>_role`, matching the existing `_role` suffix convention.
- **One repo and one chart serve every env of an app.** Per-env differences are overlaid via `values-<env>.yaml`; the chart's instance-specific values are `.Values`-driven, not hardcoded literals, so the same chart renders any instance.
- **One CI JWT role (`gitea_cicd_<app>`) per repo covers all its envs.** Its ops policy is widened to the `<app>-*` path family. Each running instance keeps its own runtime Vault policy.
| CI JWT role | `gitea_cicd_erp` | `gitea_cicd_erp` (shared) |
### Why this is not what ADR-0001 rejected
[ADR-0001](0001-safe-prod-like-environment.md) chose a **local-only** safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (`erp-sandbox`), which looks like the very thing that was rejected. The two stand together because they operate at **different layers**.
ADR-0001's rejection is scoped to rehearsing **infrastructure / platform** change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only.
ADR-0002 operates one layer up. The AI agent's only reach is the **Dolibarr HTTP API**, holding a write-scoped, app-specific API key against an isolated database — `erp-sandbox` on its own `erp_sandbox_role`, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply **not in the agent's reach**; the blast radius of a wrong AI write is bounded to the sandbox app's own data.
The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the *operator* from breaking the *fleet*. ADR-0002 isolates the *AI agent* from corrupting *one app's production data*, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer.
## Consequences
- **+** Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan.
- **+** A second instance of an app is now a `values-<env>.yaml` overlay plus an `envs` entry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction.
- **+** The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the *same* Dolibarr API and chart, an *isolated* database, a bounded blast radius.
- **+** The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the `-sandbox` instance, so runbooks read identically for any env.
- **−** Names are no longer a flat app list: every consumer must reason about the `instance == app` (prod) versus `app-env` (non-prod) distinction, and the snake-case owner-role exception (`<app>_<env>_role`) is a special case that must be carried in the modules.
- **−** A single shared Vault CI policy widened to `<app>-*` means the CI role for a repo can write the ops paths of *all* that repo's envs — a deliberately looser ops scope than one-policy-per-instance.
- **−** A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by `for_each`, accepted at current scale — see Alternatives).
- **→** The AI-agent promotion workflow this unlocks: the agent runs writes against `erp-sandbox` autonomously, emits a structured changeset, a human reviews it, and the **same** operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g. `DOLIBARR_TARGET=prod|sandbox`, defaulting to `prod`).
- **→** Rollout is additive and phased, each phase gated by a no-op `tofu plan` against existing apps: **(A)** the `tools` repo adds an optional `env` / `envs` parameter to the shared `app_roles` and `app_policy` Vault modules; **(B)** the `factory` repo gains the `envs` schema in `postgres/iac` tfvars, renders one ArgoCD Application per env, and documents the elision rule in `conventions.md`; **(C)** the `erp` chart literals are templated to `.Values`; **(D)** `erp` + `factory` activate `erp-sandbox`; **(E)** DNS + ArgoCD registration.
- **→** Per-env state separation (`<app>/<env>` prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model.
## Alternatives considered
| Option | Why not |
| --- | --- |
| Treat `erp-sandbox` as a wholly separate `<app>` (own repo, own chart copy) | Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on. |
| Use the [ADR-0001](0001-safe-prod-like-environment.md) local-only sandbox (k3d / VMs) for the AI-agent writes | That environment carries **no production data** — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it. |
| Per-env OpenTofu state (`<app>/<env>` prefixes) instead of one shared state per repo | Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with `for_each` keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door. |
| No elision — always suffix, even prod (`<app>-prod`) | Breaks every existing derived name, forcing a fleet-wide rename plus `tofu` resource moves; rejected in favour of the elision rule's zero-migration property. |
## QA & validation
- **Backwards-compat no-op gate** — after the module change, `tofu plan` against every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guarantees `local.instance == local.name` for `env == prod`, so no prod resource moves.
- **Byte-identical chart render** — `helm template erp chart/` before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides, `diff` exit 0).
- **`tofu fmt -check` + `tofu validate`** are clean on the module changes.
- **Sandbox activation gate** — when `erp-sandbox` is stood up, the [new-web-app convention chain](../../doc/runbooks/new-web-app/conventions.md) must resolve end to end for the `-sandbox` instance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does.
- **Promotion gate** — no AI-authored write reaches the prod ERP until it has been applied to `erp-sandbox`, produced a reviewed changeset, and been explicitly re-applied with human confirmation.
## References
- [ADR-0001 · Safe, production-like environment](0001-safe-prod-like-environment.md) — the local-only safe environment for infra rehearsal that this ADR complements (it stands; this does not supersede it).
- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — the product view this work relates to, and its [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) detailing the cluster/Vault/state/DNS boundary.
- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the single-env `<app>` convention this ADR extends with the env coordinate.
- [Phase A — `tools` Vault module env parameter](https://gitea.arcodange.lab/arcodange-org/tools/pulls/2) — adds the optional `env` / `envs` parameter to the shared `app_roles` and `app_policy` modules.
- [Phase C — `erp` chart literal templating](https://gitea.arcodange.lab/arcodange-org/erp/pulls/11) — templates the chart's single-env literals to `.Values` so one chart renders any instance.
- [PR factory#15 — this ADR](https://gitea.arcodange.lab/arcodange-org/factory/pulls/15) — the change that introduces ADR-0002 (links back to this file).
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.