Files
factory/vibe/ADR/0001-safe-prod-like-environment.md

85 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
[vibe](../README.md) > [ADR](README.md) > **0001 · Safe, production-like environment**
# ADR-0001: Safe, production-like environment for the lab
> **Status**: Accepted
> **Date**: 2026-06-23
> **Deciders**: @arcodange
## Context
The Arcodange lab doubles as company production. The same three Raspberry Pis and one MacBook control node run the public CMS (`arcodange.fr`), Zoho-backed business email, the Dolibarr ERP holding accounting and business records, the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one laptop holding the kubeconfig, the Vault root token, and every cloud admin token.
There is no separation between *where I experiment* and *where the business runs*. Every risky change is tested directly in production. The 2026-04-13 power-cut proved recovery is manual, multi-step, and only ever validated by a real incident — never rehearsed.
The danger is concentrated in a handful of change-classes. Each one can cause silent, fleet-wide, or data-losing damage when applied to the live environment:
| Change-class | Blast radius if wrong |
| --- | --- |
| Ansible playbook edits | Can wipe disks, reset k3s, or corrupt Longhorn across the fleet. |
| Vault policy / auth / mount changes | Lock out the Vault Secrets Operator → fleet-wide secret outage; a botched init could overwrite the single unseal key. |
| Postgres migrations / role changes | The superuser provider on `192.168.1.202` can drop or alter live databases → ERP data loss. |
| ArgoCD sync / app-of-apps changes | `prune` + `selfHeal` auto-prunes live resources fleet-wide. |
| Cloudflare / DNS / email changes | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. |
| Longhorn / storage ops | Volume recreation orphans replicas via new engine IDs. |
| Recovery drills | The runbook is only validated by real incidents, never rehearsed. |
| Cert / PKI re-init | Rotates the internal CA, invalidating every issued `*.arcodange.lab` cert. |
A change-management process is not enough: the operator needs a place to *make the mistake first*, where the mistake cannot reach production.
## Decision
We will build a **local-only safe environment on the MacBook control node**, seeded from the *same* GitOps repos via a dedicated sandbox inventory, with two modes:
- **(a) k3d single-node "fast inner loop"** (~60s bring-up) for app, Vault, and ArgoCD iteration.
- **(b) Three arm64 VMs** (multipass or Vagrant on the M4) reproducing the three-node topology — Postgres + Gitea as docker-compose *outside* k3s on the "pi2-equivalent" VM, Longhorn across the three VM disks — for Ansible, Longhorn, and recovery work.
The load-bearing requirement is the **isolation boundary**: the sandbox must be *unable* to mutate real production even on a wrong command. Each production coupling maps to a concrete guardrail — a separate sandbox inventory with a prod-IP abort guard, sandbox GCS state prefixes, a separate sandbox Vault with its own unseal-key path, a sandbox Postgres host check, and plan-only DNS against a throwaway zone. The real `arcodange.fr` Cloudflare/Zoho tokens are never exported into a sandbox shell. Because the `<app>` convention keys everything *within* a cluster/Vault/DB, the sandbox reuses identical `<app>` names with no collision — the boundary is the cluster + Vault + state + DNS zone, not the names, so runbooks read identically in both environments.
See the [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) for the full coupling→control mapping, and the [PRD](../PRD/safe-prod-like-environment/README.md) for the complete product view.
## Consequences
- **+** $0 inner loop that runs on the existing control node — no new hardware.
- **+** Rehearses every dangerous change-class except real public DNS/email.
- **+** The 2026-04-13 recovery sequence becomes a repeatable drill instead of a once-per-incident gamble.
- **+** Identical `<app>` names mean runbooks are environment-agnostic.
- **** x86/ARM nuance must be handled (use arm64 VMs/images on the M4).
- **** New guardrail and parity-manifest maintenance burden.
- **** Single-laptop resource limits — k3d for speed, VMs only when multi-node fidelity is actually needed.
- **→** Real public DNS/ACME and physical-ARM always-on testing remain unsolved by design; revisit only if recurring game-days demand them.
## Alternatives considered
| Option | Fidelity | Isolation | $ / effort | Verdict |
| --- | --- | --- | --- | --- |
| 1 · Ephemeral local cluster (k3d/kind) | Medium (single-node) | Full (separate cluster) | $0 / low | ✅ **Chosen** as the fast mode. |
| 2 · Three arm64 VMs reproducing the topology | High (3-node, PG+Gitea outside k3s, Longhorn) | Full (separate VMs) | $0 / medium | ✅ **Chosen** for fidelity. |
| 3 · Sandbox namespace on the real cluster | High | None — shared Vault/PG/Longhorn/ArgoCD | $0 / low | ❌ **Rejected**: shared blast radius fails the core isolation requirement. |
| 4 · Dedicated physical node (4th Pi / mini-PC) | High (real ARM, always-on) | Full | $$ hardware / medium | ⛔ **Out of scope**: hardware cost; revisit only for recurring always-on ARM game-days. |
| 5 · Disposable cloud k3s for real public DNS/ACME | High infra, but arch drift | Full | $ recurring / medium | ⛔ **Out of scope**: cost + ARM drift; its only unique value is real DNS/email, which we explicitly do not test. |
## QA & validation
- **Parity manifest** — k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, PG + Gitea outside k3s, three nodes. Any drift from this manifest is a failure.
- **Provisioning-parity test** — run the new-web-app runbook for a throwaway `<app>` "canary" and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running.
- **Idempotence gate** — `ansible-playbook --check --diff` reports `changed=0` on the converged sandbox *before* any change is promoted to prod.
- **`tofu plan` diff gate** — plan against sandbox state; for DNS, assert it touches *only* the throwaway zone.
- **Chaos drills** (mapped to CLUSTER_RECOVERY.md sections): node-kill; Vault-seal (unseal via the *sandbox* key; VSO re-auths); Longhorn volume corruption (run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)); DB drop/restore; full power-cut simulation (execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green); ArgoCD bad-sync (observe prune/selfHeal, author a rollback runbook section); cert/PKI re-issue.
- **Monthly game-day** — the operator follows *only* the runbook; any improvised step becomes a runbook PR. This is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident.
- **Promotion gate** — no infra/Vault/storage/DNS change reaches prod until it has been applied to the sandbox *and* survived the matching drill. Each drill records time-to-recover and which step failed or was improvised.
See the [QA strategy leaf](../PRD/safe-prod-like-environment/qa-strategy.md) for the detailed drill table and evidence trail.
## References
- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — full product view, requirements, and phased rollout.
- [INV-001 · Prod blast-radius couplings](../investigations/INV-001-prod-blast-radius-couplings.md) — the investigation that mapped every prod coupling.
- [Guidebook · Lab ecosystem](../guidebooks/lab-ecosystem/README.md) — the end-to-end map of the prod topology this sandbox mirrors.
- [Guidebook · Storage and recovery](../guidebooks/lab-ecosystem/storage-and-recovery.md) — how Longhorn and the recovery sequence work today.
- [doc/adr index](../../doc/adr/README.md) — foundational infrastructure ADRs (read-only history).
- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association, exercised by the Longhorn chaos drill.
- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the `<app>` convention reused identically across sandbox and prod.
- CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section.
- PRs: [#10 — bootstrap vibe/ tree + ecosystem AGENTS.md](https://gitea.arcodange.lab/arcodange-org/factory/pulls/10).