diff --git a/vibe/ADR/0002-per-application-environments.md b/vibe/ADR/0002-per-application-environments.md new file mode 100644 index 0000000..8d2e85e --- /dev/null +++ b/vibe/ADR/0002-per-application-environments.md @@ -0,0 +1,97 @@ +[vibe](../README.md) > [ADR](README.md) > **0002 · Per-application environments** + +# ADR-0002: Per-application environments via an env coordinate + +> **Status**: Accepted +> **Date**: 2026-06-25 +> **Deciders**: @arcodange + +## Context + +The [`` join key](../../doc/runbooks/new-web-app/conventions.md) threads one kebab-case identifier identically through every system that makes up an application: the Gitea repo, the Postgres database + `_role`, Vault (`postgres/creds/`, the k8s auth role ``, the policies `` / `-ops`, the CI JWT role `gitea_cicd_`), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix `/main`, and DNS (`.arcodange.lab`). Bricks wire together by name convention, not explicit config. + +That convention conflates two ideas it never separated: an **application** and a **deployed instance** of it. There is exactly one of everything per app — one namespace, one database, one Vault creds path, one DNS host. The model cannot express "the same app, a second time, somewhere else." + +The motivating need makes the gap concrete. The Arcodange Dolibarr ERP is growing a write-capable AI-agent skill — auto-creating supplier invoices from ingested emails, fixing thirdparty data, and similar mutations. Before such writes touch the production accounting database, the operator needs a place where the agent can run write operations autonomously, a human reviews the result, and only then the same operation is promoted to prod. That requires a **second deployed instance of the same application**: the same Dolibarr chart, the same version, the same conventions — differing only in *where* it runs and *which data* it touches. + +| Force | Pressure it creates | +| --- | --- | +| One identifier per app, no env coordinate | "Same app, different environment" is inexpressible without inventing a whole second app. | +| Write-capable AI agent landing on the prod ERP | A wrong autonomous write corrupts live accounting data with no rehearsal surface. | +| Fidelity requirement for the rehearsal surface | The sandbox must run the *real* Dolibarr API against *prod-like* data, or the rehearsal predicts nothing. | +| [ADR-0001](0001-safe-prod-like-environment.md) rejected an in-cluster sandbox | Its Alternative 3 ("sandbox namespace on the real cluster") was rejected for shared blast radius — so any in-cluster sibling instance must be reconciled against that, not pretended away. | + +Treating the sandbox as a wholly separate app would fork the chart, the repo, the runbook chain, and the Vault wiring — four things that then drift apart over time, defeating the "same app, same version" fidelity the rehearsal depends on. + +## Decision + +We will extend the `` convention with a second coordinate, ``, governed by an **elision rule** so that adding the coordinate changes nothing for any existing app. + +- **`env` defaults to `prod`, and `prod` elides.** When `env == prod`, no suffix is added: every derived name is character-for-character identical to today's single-env output. The instance name equals the app name (`local.instance == local.name`), so every existing app's `tofu plan` is a no-op. +- **Non-prod envs take the `-` suffix** in kebab-case everywhere — namespace, Vault paths / roles / policies, ArgoCD Application, DNS host, GCS-state sub-prefix — with one exception: the Postgres owner role stays snake-case as `__role`, matching the existing `_role` suffix convention. +- **One repo and one chart serve every env of an app.** Per-env differences are overlaid via `values-.yaml`; the chart's instance-specific values are `.Values`-driven, not hardcoded literals, so the same chart renders any instance. +- **One CI JWT role (`gitea_cicd_`) per repo covers all its envs.** Its ops policy is widened to the `-*` path family. Each running instance keeps its own runtime Vault policy. + +### Worked example: `erp` and `erp-sandbox` + +| Coordinate | `erp` (env = prod, elided) | `erp-sandbox` (env = sandbox) | +| --- | --- | --- | +| Postgres database | `erp` | `erp-sandbox` | +| Postgres owner role | `erp_role` | `erp_sandbox_role` | +| k8s namespace + ServiceAccount | `erp` | `erp-sandbox` | +| Vault dynamic DB creds | `postgres/creds/erp` | `postgres/creds/erp-sandbox` | +| Vault KV config | `kvv2/erp/config` | `kvv2/erp-sandbox/config` | +| ArgoCD Application | `erp` | `erp-sandbox` | +| Internal DNS | `erp.arcodange.lab` | `erp-sandbox.arcodange.lab` | +| Gitea repo | `arcodange-org/erp` | `arcodange-org/erp` (shared) | +| Helm chart | one chart | one chart (shared) | +| CI JWT role | `gitea_cicd_erp` | `gitea_cicd_erp` (shared) | + +### Why this is not what ADR-0001 rejected + +[ADR-0001](0001-safe-prod-like-environment.md) chose a **local-only** safe environment (k3d / arm64 VMs) and rejected its Alternative 3, an in-cluster "sandbox namespace on the real cluster," for shared blast radius. ADR-0002 introduces an in-cluster sibling instance (`erp-sandbox`), which looks like the very thing that was rejected. The two stand together because they operate at **different layers**. + +ADR-0001's rejection is scoped to rehearsing **infrastructure / platform** change-classes — Ansible playbooks, Vault policy / auth / mount changes, Postgres superuser migrations, ArgoCD prune / selfHeal, Longhorn ops, DNS / email. Those couplings share fleet-wide control planes, so an in-cluster sandbox cannot isolate them; only a separate cluster + Vault + state + DNS zone can. That is exactly why ADR-0001 is local-only. + +ADR-0002 operates one layer up. The AI agent's only reach is the **Dolibarr HTTP API**, holding a write-scoped, app-specific API key against an isolated database — `erp-sandbox` on its own `erp_sandbox_role`, its own namespace, its own Vault creds path. The agent never touches kubectl, the Vault root, the Postgres superuser, ArgoCD, Longhorn, or DNS. The fleet-level blast radius that doomed Alternative 3 for infra rehearsal is simply **not in the agent's reach**; the blast radius of a wrong AI write is bounded to the sandbox app's own data. + +The two ADRs are therefore complementary, not contradictory, and ADR-0002 does not supersede ADR-0001. ADR-0001 isolates the *operator* from breaking the *fleet*. ADR-0002 isolates the *AI agent* from corrupting *one app's production data*, while preserving the prod-like API surface and real-data fidelity that the local k3d sandbox — which carries no prod data — cannot offer. + +## Consequences + +- **+** Every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) is unaffected: the elision rule makes the prod instance's derived names byte-identical, so adoption ships with zero migration and a no-op plan. +- **+** A second instance of an app is now a `values-.yaml` overlay plus an `envs` entry — not a forked repo, chart, and runbook chain — so prod and sandbox share one source of truth and stay on the same version by construction. +- **+** The AI-agent write skill gets a prod-like rehearsal surface with real-shaped data: the *same* Dolibarr API and chart, an *isolated* database, a bounded blast radius. +- **+** The convention chain (db + role → Vault creds + policy → namespace + SA → ArgoCD → DNS) is reused verbatim for the `-sandbox` instance, so runbooks read identically for any env. +- **−** Names are no longer a flat app list: every consumer must reason about the `instance == app` (prod) versus `app-env` (non-prod) distinction, and the snake-case owner-role exception (`__role`) is a special case that must be carried in the modules. +- **−** A single shared Vault CI policy widened to `-*` means the CI role for a repo can write the ops paths of *all* that repo's envs — a deliberately looser ops scope than one-policy-per-instance. +- **−** A single shared OpenTofu state per repo holds every env's resources together, so the envs of one app share a blast radius at the state layer (mitigated by `for_each`, accepted at current scale — see Alternatives). +- **→** The AI-agent promotion workflow this unlocks: the agent runs writes against `erp-sandbox` autonomously, emits a structured changeset, a human reviews it, and the **same** operation is re-applied to prod only with explicit confirmation — never auto-applied by the agent. The read/write skills resolve their target by an env switch (e.g. `DOLIBARR_TARGET=prod|sandbox`, defaulting to `prod`). +- **→** Rollout is additive and phased, each phase gated by a no-op `tofu plan` against existing apps: **(A)** the `tools` repo adds an optional `env` / `envs` parameter to the shared `app_roles` and `app_policy` Vault modules; **(B)** the `factory` repo gains the `envs` schema in `postgres/iac` tfvars, renders one ArgoCD Application per env, and documents the elision rule in `conventions.md`; **(C)** the `erp` chart literals are templated to `.Values`; **(D)** `erp` + `factory` activate `erp-sandbox`; **(E)** DNS + ArgoCD registration. +- **→** Per-env state separation (`/` prefixes) is a door left open: if env-to-env blast-radius isolation at the state layer becomes warranted, the prefix scheme can be revisited without changing the naming model. + +## Alternatives considered + +| Option | Why not | +| --- | --- | +| Treat `erp-sandbox` as a wholly separate `` (own repo, own chart copy) | Forks the chart, the repo, and the runbook chain; the two copies drift over time; defeats the "same app, same version" fidelity the rehearsal depends on. | +| Use the [ADR-0001](0001-safe-prod-like-environment.md) local-only sandbox (k3d / VMs) for the AI-agent writes | That environment carries **no production data** — the write-rehearsal needs prod-like data and the real Dolibarr API surface to be meaningful. Complementary to ADR-0001, not a substitute for it. | +| Per-env OpenTofu state (`/` prefixes) instead of one shared state per repo | Buys more env-to-env blast-radius isolation, but at the cost of more CI plumbing and cross-env output wiring than current scale warrants; one shared state with `for_each` keeps runbooks simple. A real decision point — the chosen path is single shared state per repo, with the prefix scheme left as a future door. | +| No elision — always suffix, even prod (`-prod`) | Breaks every existing derived name, forcing a fleet-wide rename plus `tofu` resource moves; rejected in favour of the elision rule's zero-migration property. | + +## QA & validation + +- **Backwards-compat no-op gate** — after the module change, `tofu plan` against every existing app (webapp, erp, crowdsec, plausible, dance-lessons-coach, cms) reports zero changes. The elision rule guarantees `local.instance == local.name` for `env == prod`, so no prod resource moves. +- **Byte-identical chart render** — `helm template erp chart/` before versus after the literal-templating refactor diffs to nothing (verified: 10857 bytes on both sides, `diff` exit 0). +- **`tofu fmt -check` + `tofu validate`** are clean on the module changes. +- **Sandbox activation gate** — when `erp-sandbox` is stood up, the [new-web-app convention chain](../../doc/runbooks/new-web-app/conventions.md) must resolve end to end for the `-sandbox` instance (db + role → Vault creds + policy → namespace + SA → ArgoCD Healthy/Synced → VSO injects → pod Running), exactly as the prod instance does. +- **Promotion gate** — no AI-authored write reaches the prod ERP until it has been applied to `erp-sandbox`, produced a reviewed changeset, and been explicitly re-applied with human confirmation. + +## References + +- [ADR-0001 · Safe, production-like environment](0001-safe-prod-like-environment.md) — the local-only safe environment for infra rehearsal that this ADR complements (it stands; this does not supersede it). +- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — the product view this work relates to, and its [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) detailing the cluster/Vault/state/DNS boundary. +- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the single-env `` convention this ADR extends with the env coordinate. +- [Phase A — `tools` Vault module env parameter](https://gitea.arcodange.lab/arcodange-org/tools/pulls/2) — adds the optional `env` / `envs` parameter to the shared `app_roles` and `app_policy` modules. +- [Phase C — `erp` chart literal templating](https://gitea.arcodange.lab/arcodange-org/erp/pulls/11) — templates the chart's single-env literals to `.Values` so one chart renders any instance. +- [PR factory#15 — this ADR](https://gitea.arcodange.lab/arcodange-org/factory/pulls/15) — the change that introduces ADR-0002 (links back to this file). diff --git a/vibe/ADR/README.md b/vibe/ADR/README.md index e2378ec..2ac1866 100644 --- a/vibe/ADR/README.md +++ b/vibe/ADR/README.md @@ -3,7 +3,7 @@ # Architecture Decision Records > **Status**: 🟢 Active -> **Last Updated**: 2026-06-23 +> **Last Updated**: 2026-06-25 > **Related**: [vibe/PRD](../PRD/README.md) · [vibe/Investigations](../investigations/README.md) > **Historical**: [doc/adr](../../doc/adr/README.md) (foundational infra) · [ansible/.../docs/adr](../../ansible/arcodange/factory/docs/adr/) (dated infra ADRs) @@ -34,6 +34,7 @@ When a new decision *supersedes* one of the historical records, write the new AD | # | Title | Status | Date | | --- | --- | --- | --- | | [0001](0001-safe-prod-like-environment.md) | Safe, production-like environment | 🟢 Accepted | 2026-06-23 | +| [0002](0002-per-application-environments.md) | Per-application environments | 🟢 Accepted | 2026-06-25 | ## Rules to contribute diff --git a/vibe/PRD/safe-prod-like-environment/README.md b/vibe/PRD/safe-prod-like-environment/README.md index a821950..feb7b19 100644 --- a/vibe/PRD/safe-prod-like-environment/README.md +++ b/vibe/PRD/safe-prod-like-environment/README.md @@ -3,9 +3,9 @@ # Safe, production-like environment > **Status:** In design -> **Last Updated:** 2026-06-23 +> **Last Updated:** 2026-06-25 > **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) -> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) +> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) · [ADR 0002 — per-application environments](../../ADR/0002-per-application-environments.md) (the application-data-layer counterpart) > **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md) ## Problem