Files
factory/vibe/ADR/0002-per-application-environments.md
Gabriel Radureau 3961914613 docs(adr): ADR-0002 — per-application environments via an env coordinate
Records the decision to extend the <app> join key with a second
coordinate <env>, governed by an elision rule (env=prod elides → every
existing app's derived names are byte-identical and its tofu plan is a
no-op; non-prod envs take the <app>-<env> suffix, with the Postgres
owner role staying snake-case <app>_<env>_role).

Motivated by the ERP's incoming write-capable AI-agent skill: it needs
an in-cluster sandbox instance (erp-sandbox) with a prod-like Dolibarr
API + isolated database to rehearse writes before a human promotes them
to prod. The ADR reconciles this against ADR-0001 honestly — ADR-0001
rejected an in-cluster sandbox for INFRA-change rehearsal (shared
fleet-wide control planes); ADR-0002 operates one layer up where the
agent's only reach is the app's HTTP API against an isolated DB, so the
fleet blast radius is not in scope. The two are complementary; ADR-0002
does not supersede ADR-0001.

Also:
- vibe/ADR/README.md: index row for 0002 + Last Updated 2026-06-25
- PRD safe-prod-like-environment README: bidirectional back-link to
  ADR-0002 on the Adjacent line + Last Updated 2026-06-25

Authored via the ADR Scribe persona, validated via the Continuity Warden
checklist (no-tombstone, breadcrumb, MADR-lite sections, dead-link scan,
bidirectional links).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-25 14:55:19 +02:00

12 KiB
Raw Blame History

vibe > ADR > 0002 · Per-application environments

ADR-0002: Per-application environments via an env coordinate

Status: Accepted Date: 2026-06-25 Deciders: @arcodange

Context

The <app> join key threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + <app>_role, Vault (postgres/creds/<app>, the k8s auth role <app>, the policies <app> / <app>-ops, the CI JWT role gitea_cicd_<app>), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix <app>/main, and DNS (<app>.arcodange.lab). Bricks wire together by name convention, not explicit config.

That convention conflates two ideas it never separated: an application and a deployed instance of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else."

The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a second deployed instance of the same application: the same Dolibarr chart, the same version, the same conventions — differing only in where it runs and which data it touches.

Force Pressure it creates
One identifier per app, no env coordinate "Same app, different environment" is inexpressible without inventing a whole second app.
Write-capable AI agent landing on the prod ERP A wrong autonomous write corrupts live accounting data with no rehearsal surface.
Fidelity requirement for the rehearsal surface The sandbox must run the real Dolibarr API against prod-like data, or the rehearsal predicts nothing.
ADR-0001 rejected an in-cluster sandbox Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away.

Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on.

Decision

We will extend the <app> convention with a second coordinate, <env>, governed by an elision rule so that adding the coordinate changes nothing for any existing app.

  • env defaults to prod, and prod elides. When env == prod, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (local.instance == local.name), so every existing app's tofu plan is a no-op.
  • Non-prod envs take the <app>-<env> suffix in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as <app>_<env>_role, matching the existing _role suffix convention.
  • One repo and one chart serve every env of an app. Per-env differences are overlaid via values-<env>.yaml; the chart's instance-specific values are .Values-driven, not hardcoded literals, so the same chart renders any instance.
  • One CI JWT role (gitea_cicd_<app>) per repo covers all its envs. Its ops policy is widened to the <app>-* path family. Each running instance keeps its own runtime Vault policy.

Worked example: erp and erp-sandbox

Coordinate erp (env = prod, elided) erp-sandbox (env = sandbox)
Postgres database erp erp-sandbox
Postgres owner role erp_role erp_sandbox_role
k8s namespace + ServiceAccount erp erp-sandbox
Vault dynamic DB creds postgres/creds/erp postgres/creds/erp-sandbox
Vault KV config kvv2/erp/config kvv2/erp-sandbox/config
ArgoCD Application erp erp-sandbox
Internal DNS erp.arcodange.lab erp-sandbox.arcodange.lab
Gitea repo arcodange-org/erp arcodange-org/erp (shared)
Helm chart one chart one chart (shared)
CI JWT role gitea_cicd_erp gitea_cicd_erp (shared)

Why this is not what ADR-0001 rejected

ADR-0001 chose a local-only safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (erp-sandbox), which looks like the very thing that was rejected. The two stand together because they operate at different layers.

ADR-0001's rejection is scoped to rehearsing infrastructure / platform change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only.

ADR-0002 operates one layer up. The AI agent's only reach is the Dolibarr HTTP API, holding a write-scoped, app-specific API key against an isolated database — erp-sandbox on its own erp_sandbox_role, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply not in the agent's reach; the blast radius of a wrong AI write is bounded to the sandbox app's own data.

The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the operator from breaking the fleet. ADR-0002 isolates the AI agent from corrupting one app's production data, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer.

Consequences

  • + Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan.
  • + A second instance of an app is now a values-<env>.yaml overlay plus an envs entry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction.
  • + The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the same Dolibarr API and chart, an isolated database, a bounded blast radius.
  • + The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the -sandbox instance, so runbooks read identically for any env.
  • Names are no longer a flat app list: every consumer must reason about the instance == app (prod) versus app-env (non-prod) distinction, and the snake-case owner-role exception (<app>_<env>_role) is a special case that must be carried in the modules.
  • A single shared Vault CI policy widened to <app>-* means the CI role for a repo can write the ops paths of all that repo's envs — a deliberately looser ops scope than one-policy-per-instance.
  • A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by for_each, accepted at current scale — see Alternatives).
  • The AI-agent promotion workflow this unlocks: the agent runs writes against erp-sandbox autonomously, emits a structured changeset, a human reviews it, and the same operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g. DOLIBARR_TARGET=prod|sandbox, defaulting to prod).
  • Rollout is additive and phased, each phase gated by a no-op tofu plan against existing apps: (A) the tools repo adds an optional env / envs parameter to the shared app_roles and app_policy Vault modules; (B) the factory repo gains the envs schema in postgres/iac tfvars, renders one ArgoCD Application per env, and documents the elision rule in conventions.md; (C) the erp chart literals are templated to .Values; (D) erp + factory activate erp-sandbox; (E) DNS + ArgoCD registration.
  • Per-env state separation (<app>/<env> prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model.

Alternatives considered

Option Why not
Treat erp-sandbox as a wholly separate <app> (own repo, own chart copy) Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on.
Use the ADR-0001 local-only sandbox (k3d / VMs) for the AI-agent writes That environment carries no production data — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it.
Per-env OpenTofu state (<app>/<env> prefixes) instead of one shared state per repo Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with for_each keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door.
No elision — always suffix, even prod (<app>-prod) Breaks every existing derived name, forcing a fleet-wide rename plus tofu resource moves; rejected in favour of the elision rule's zero-migration property.

QA & validation

  • Backwards-compat no-op gate — after the module change, tofu plan against every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guarantees local.instance == local.name for env == prod, so no prod resource moves.
  • Byte-identical chart renderhelm template erp chart/ before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides, diff exit 0).
  • tofu fmt -check + tofu validate are clean on the module changes.
  • Sandbox activation gate — when erp-sandbox is stood up, the new-web-app convention chain must resolve end to end for the -sandbox instance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does.
  • Promotion gate — no AI-authored write reaches the prod ERP until it has been applied to erp-sandbox, produced a reviewed changeset, and been explicitly re-applied with human confirmation.

References