Records the decision to extend the <app> join key with a second coordinate <env>, governed by an elision rule (env=prod elides → every existing app's derived names are byte-identical and its tofu plan is a no-op; non-prod envs take the <app>-<env> suffix, with the Postgres owner role staying snake-case <app>_<env>_role). Motivated by the ERP's incoming write-capable AI-agent skill: it needs an in-cluster sandbox instance (erp-sandbox) with a prod-like Dolibarr API + isolated database to rehearse writes before a human promotes them to prod. The ADR reconciles this against ADR-0001 honestly — ADR-0001 rejected an in-cluster sandbox for INFRA-change rehearsal (shared fleet-wide control planes); ADR-0002 operates one layer up where the agent's only reach is the app's HTTP API against an isolated DB, so the fleet blast radius is not in scope. The two are complementary; ADR-0002 does not supersede ADR-0001. Also: - vibe/ADR/README.md: index row for 0002 + Last Updated 2026-06-25 - PRD safe-prod-like-environment README: bidirectional back-link to ADR-0002 on the Adjacent line + Last Updated 2026-06-25 Authored via the ADR Scribe persona, validated via the Continuity Warden checklist (no-tombstone, breadcrumb, MADR-lite sections, dead-link scan, bidirectional links). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.4 KiB
vibe > PRD > Safe, production-like environment
Safe, production-like environment
Status: In design Last Updated: 2026-06-25 Design record: ADR 0001 — Safe, production-like environment Adjacent: INV-001 — prod blast-radius couplings · ADR 0002 — per-application environments (the application-data-layer counterpart) Map: Lab ecosystem guidebook
Problem
The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS (arcodange.fr), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.
There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.
Users & personas
A single operator wearing two hats:
- The inner-loop developer — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
- The game-day / recovery operator — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.
Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.
Goals & non-goals
Goals
- Rehearse every dangerous change-class (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with zero prod blast radius.
- Validate
CLUSTER_RECOVERY.mdagainst the sandbox, not against prod — turn the recovery runbook from incident-validated into drill-validated. - A fast inner loop for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
- Run on the existing control node (the MacBook) at $0 marginal cost.
Non-goals
- No real
arcodange.frDNS or email tests. DNS/email modules run plan-only against a throwaway zone; real public DNS/ACME end-to-end is out of scope. - No physical-node tier (4th Pi / mini-PC) — explicitly rejected, not deferred.
- No cloud tier for disposable real-DNS clusters — explicitly rejected, not deferred.
- Not a performance-benchmark environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.
Requirements
Functional:
- One-command bring-up, seeded from the same GitOps repos as prod.
- Sandbox inventory + guards — a separate
inventory/sandbox/hosts.ymlplus a pre-task guard that aborts on any prod IP (192.168.1.201-203) unlessi_mean_prod=true. See Isolation boundary. - Parity manifest — same k3s
v1.34.3+k3s1, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail. - Two modes seeded from the same repos via the sandbox inventory:
- (a) k3d single-node fast inner loop (~60s bring-up) for app / Vault / ArgoCD iteration.
- (b) 3 arm64 VMs (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.
The isolation boundary (the table mapping each prod coupling to its sandbox control) and the <app> naming note are detailed in isolation-boundary.md.
QA strategy
Fidelity gates (parity manifest, a canary provisioning-parity test that drives the new-web-app runbook end to end, an ansible-playbook --check --diff changed=0 idempotence gate, and a tofu plan diff gate) plus chaos drills mapped section-by-section to CLUSTER_RECOVERY.md, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox and survived the matching drill. Full detail in qa-strategy.md.
Implementation status
Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: STATUS.md.
Leaves
| Page | Summary | Status |
|---|---|---|
| Isolation boundary | Prod-coupling → sandbox-control table; the <app> naming note; the token caution. |
🟡 In design |
| QA strategy | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design |
| STATUS | Phase tracker (all not-started) and PR log. | ⬜ Not started |