Files
factory/vibe/ADR/0002-per-application-environments.md
Gabriel Radureau c35b510040 docs(adr): fill the ADR-0002 ↔ PR backlink (factory#15)
Replaces the placeholder References line with the PR URL so the
ADR↔PR crosslink is bidirectional per the AGENTS.md rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-25 14:56:09 +02:00

12 KiB
Raw Permalink Blame History

vibe > ADR > 0002 · Per-application environments

ADR-0002: Per-application environments via an env coordinate

Status: Accepted Date: 2026-06-25 Deciders: @arcodange

Context

The <app> join key threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + <app>_role, Vault (postgres/creds/<app>, the k8s auth role <app>, the policies <app> / <app>-ops, the CI JWT role gitea_cicd_<app>), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix <app>/main, and DNS (<app>.arcodange.lab). Bricks wire together by name convention, not explicit config.

That convention conflates two ideas it never separated: an application and a deployed instance of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else."

The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a second deployed instance of the same application: the same Dolibarr chart, the same version, the same conventions — differing only in where it runs and which data it touches.

Force Pressure it creates
One identifier per app, no env coordinate "Same app, different environment" is inexpressible without inventing a whole second app.
Write-capable AI agent landing on the prod ERP A wrong autonomous write corrupts live accounting data with no rehearsal surface.
Fidelity requirement for the rehearsal surface The sandbox must run the real Dolibarr API against prod-like data, or the rehearsal predicts nothing.
ADR-0001 rejected an in-cluster sandbox Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away.

Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on.

Decision

We will extend the <app> convention with a second coordinate, <env>, governed by an elision rule so that adding the coordinate changes nothing for any existing app.

  • env defaults to prod, and prod elides. When env == prod, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (local.instance == local.name), so every existing app's tofu plan is a no-op.
  • Non-prod envs take the <app>-<env> suffix in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as <app>_<env>_role, matching the existing _role suffix convention.
  • One repo and one chart serve every env of an app. Per-env differences are overlaid via values-<env>.yaml; the chart's instance-specific values are .Values-driven, not hardcoded literals, so the same chart renders any instance.
  • One CI JWT role (gitea_cicd_<app>) per repo covers all its envs. Its ops policy is widened to the <app>-* path family. Each running instance keeps its own runtime Vault policy.

Worked example: erp and erp-sandbox

Coordinate erp (env = prod, elided) erp-sandbox (env = sandbox)
Postgres database erp erp-sandbox
Postgres owner role erp_role erp_sandbox_role
k8s namespace + ServiceAccount erp erp-sandbox
Vault dynamic DB creds postgres/creds/erp postgres/creds/erp-sandbox
Vault KV config kvv2/erp/config kvv2/erp-sandbox/config
ArgoCD Application erp erp-sandbox
Internal DNS erp.arcodange.lab erp-sandbox.arcodange.lab
Gitea repo arcodange-org/erp arcodange-org/erp (shared)
Helm chart one chart one chart (shared)
CI JWT role gitea_cicd_erp gitea_cicd_erp (shared)

Why this is not what ADR-0001 rejected

ADR-0001 chose a local-only safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (erp-sandbox), which looks like the very thing that was rejected. The two stand together because they operate at different layers.

ADR-0001's rejection is scoped to rehearsing infrastructure / platform change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only.

ADR-0002 operates one layer up. The AI agent's only reach is the Dolibarr HTTP API, holding a write-scoped, app-specific API key against an isolated database — erp-sandbox on its own erp_sandbox_role, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply not in the agent's reach; the blast radius of a wrong AI write is bounded to the sandbox app's own data.

The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the operator from breaking the fleet. ADR-0002 isolates the AI agent from corrupting one app's production data, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer.

Consequences

  • + Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan.
  • + A second instance of an app is now a values-<env>.yaml overlay plus an envs entry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction.
  • + The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the same Dolibarr API and chart, an isolated database, a bounded blast radius.
  • + The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the -sandbox instance, so runbooks read identically for any env.
  • Names are no longer a flat app list: every consumer must reason about the instance == app (prod) versus app-env (non-prod) distinction, and the snake-case owner-role exception (<app>_<env>_role) is a special case that must be carried in the modules.
  • A single shared Vault CI policy widened to <app>-* means the CI role for a repo can write the ops paths of all that repo's envs — a deliberately looser ops scope than one-policy-per-instance.
  • A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by for_each, accepted at current scale — see Alternatives).
  • The AI-agent promotion workflow this unlocks: the agent runs writes against erp-sandbox autonomously, emits a structured changeset, a human reviews it, and the same operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g. DOLIBARR_TARGET=prod|sandbox, defaulting to prod).
  • Rollout is additive and phased, each phase gated by a no-op tofu plan against existing apps: (A) the tools repo adds an optional env / envs parameter to the shared app_roles and app_policy Vault modules; (B) the factory repo gains the envs schema in postgres/iac tfvars, renders one ArgoCD Application per env, and documents the elision rule in conventions.md; (C) the erp chart literals are templated to .Values; (D) erp + factory activate erp-sandbox; (E) DNS + ArgoCD registration.
  • Per-env state separation (<app>/<env> prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model.

Alternatives considered

Option Why not
Treat erp-sandbox as a wholly separate <app> (own repo, own chart copy) Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on.
Use the ADR-0001 local-only sandbox (k3d / VMs) for the AI-agent writes That environment carries no production data — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it.
Per-env OpenTofu state (<app>/<env> prefixes) instead of one shared state per repo Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with for_each keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door.
No elision — always suffix, even prod (<app>-prod) Breaks every existing derived name, forcing a fleet-wide rename plus tofu resource moves; rejected in favour of the elision rule's zero-migration property.

QA & validation

  • Backwards-compat no-op gate — after the module change, tofu plan against every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guarantees local.instance == local.name for env == prod, so no prod resource moves.
  • Byte-identical chart renderhelm template erp chart/ before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides, diff exit 0).
  • tofu fmt -check + tofu validate are clean on the module changes.
  • Sandbox activation gate — when erp-sandbox is stood up, the new-web-app convention chain must resolve end to end for the -sandbox instance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does.
  • Promotion gate — no AI-authored write reaches the prod ERP until it has been applied to erp-sandbox, produced a reviewed changeset, and been explicitly re-applied with human confirmation.

References