Files

Gabriel Radureau b886f06824 docs(vibe): backfill PR #10 crosslink into ADR-0001 + PRD STATUS

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-23 11:53:39 +02:00

8.4 KiB

Raw Blame History

vibe > ADR > 0001 · Safe, production-like environment

ADR-0001: Safe, production-like environment for the lab

Status: Accepted Date: 2026-06-23 Deciders: @arcodange

Context

The Arcodange lab doubles as company production. The same three Raspberry Pis and one MacBook control node run the public CMS (arcodange.fr), Zoho-backed business email, the Dolibarr ERP holding accounting and business records, the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one laptop holding the kubeconfig, the Vault root token, and every cloud admin token.

There is no separation between where I experiment and where the business runs. Every risky change is tested directly in production. The 2026-04-13 power-cut proved recovery is manual, multi-step, and only ever validated by a real incident — never rehearsed.

The danger is concentrated in a handful of change-classes. Each one can cause silent, fleet-wide, or data-losing damage when applied to the live environment:

Change-class	Blast radius if wrong
Ansible playbook edits	Can wipe disks, reset k3s, or corrupt Longhorn across the fleet.
Vault policy / auth / mount changes	Lock out the Vault Secrets Operator → fleet-wide secret outage; a botched init could overwrite the single unseal key.
Postgres migrations / role changes	The superuser provider on `192.168.1.202` can drop or alter live databases → ERP data loss.
ArgoCD sync / app-of-apps changes	`prune` + `selfHeal` auto-prunes live resources fleet-wide.
Cloudflare / DNS / email changes	A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days.
Longhorn / storage ops	Volume recreation orphans replicas via new engine IDs.
Recovery drills	The runbook is only validated by real incidents, never rehearsed.
Cert / PKI re-init	Rotates the internal CA, invalidating every issued `*.arcodange.lab` cert.

A change-management process is not enough: the operator needs a place to make the mistake first, where the mistake cannot reach production.

Decision

We will build a local-only safe environment on the MacBook control node, seeded from the same GitOps repos via a dedicated sandbox inventory, with two modes:

(a) k3d single-node "fast inner loop" (~60s bring-up) for app, Vault, and ArgoCD iteration.
(b) Three arm64 VMs (multipass or Vagrant on the M4) reproducing the three-node topology — Postgres + Gitea as docker-compose outside k3s on the "pi2-equivalent" VM, Longhorn across the three VM disks — for Ansible, Longhorn, and recovery work.

The load-bearing requirement is the isolation boundary: the sandbox must be unable to mutate real production even on a wrong command. Each production coupling maps to a concrete guardrail — a separate sandbox inventory with a prod-IP abort guard, sandbox GCS state prefixes, a separate sandbox Vault with its own unseal-key path, a sandbox Postgres host check, and plan-only DNS against a throwaway zone. The real arcodange.fr Cloudflare/Zoho tokens are never exported into a sandbox shell. Because the <app> convention keys everything within a cluster/Vault/DB, the sandbox reuses identical <app> names with no collision — the boundary is the cluster + Vault + state + DNS zone, not the names, so runbooks read identically in both environments.

See the isolation-boundary leaf for the full coupling→control mapping, and the PRD for the complete product view.

Consequences

+ $0 inner loop that runs on the existing control node — no new hardware.
+ Rehearses every dangerous change-class except real public DNS/email.
+ The 2026-04-13 recovery sequence becomes a repeatable drill instead of a once-per-incident gamble.
+ Identical <app> names mean runbooks are environment-agnostic.
− x86/ARM nuance must be handled (use arm64 VMs/images on the M4).
− New guardrail and parity-manifest maintenance burden.
− Single-laptop resource limits — k3d for speed, VMs only when multi-node fidelity is actually needed.
→ Real public DNS/ACME and physical-ARM always-on testing remain unsolved by design; revisit only if recurring game-days demand them.

Alternatives considered

Option	Fidelity	Isolation	$ / effort	Verdict
1 · Ephemeral local cluster (k3d/kind)	Medium (single-node)	Full (separate cluster)	$0 / low	✅ Chosen as the fast mode.
2 · Three arm64 VMs reproducing the topology	High (3-node, PG+Gitea outside k3s, Longhorn)	Full (separate VMs)	$0 / medium	✅ Chosen for fidelity.
3 · Sandbox namespace on the real cluster	High	None — shared Vault/PG/Longhorn/ArgoCD	$0 / low	❌ Rejected: shared blast radius fails the core isolation requirement.
4 · Dedicated physical node (4th Pi / mini-PC)	High (real ARM, always-on)	Full	$$ hardware / medium	⛔ Out of scope: hardware cost; revisit only for recurring always-on ARM game-days.
5 · Disposable cloud k3s for real public DNS/ACME	High infra, but arch drift	Full	$ recurring / medium	⛔ Out of scope: cost + ARM drift; its only unique value is real DNS/email, which we explicitly do not test.

QA & validation

Parity manifest — k3s v1.34.3+k3s1, same Longhorn/Vault/VSO, same app-of-apps list, PG + Gitea outside k3s, three nodes. Any drift from this manifest is a failure.
Provisioning-parity test — run the new-web-app runbook for a throwaway <app> "canary" and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running.
Idempotence gate — ansible-playbook --check --diff reports changed=0 on the converged sandbox before any change is promoted to prod.
tofu plan diff gate — plan against sandbox state; for DNS, assert it touches only the throwaway zone.
Chaos drills (mapped to CLUSTER_RECOVERY.md sections): node-kill; Vault-seal (unseal via the sandbox key; VSO re-auths); Longhorn volume corruption (run recover/longhorn*.yml, validate engine-ID re-association per the Longhorn PVC recovery ADR); DB drop/restore; full power-cut simulation (execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green); ArgoCD bad-sync (observe prune/selfHeal, author a rollback runbook section); cert/PKI re-issue.
Monthly game-day — the operator follows only the runbook; any improvised step becomes a runbook PR. This is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident.
Promotion gate — no infra/Vault/storage/DNS change reaches prod until it has been applied to the sandbox and survived the matching drill. Each drill records time-to-recover and which step failed or was improvised.

See the QA strategy leaf for the detailed drill table and evidence trail.

References

PRD · Safe, production-like environment — full product view, requirements, and phased rollout.
INV-001 · Prod blast-radius couplings — the investigation that mapped every prod coupling.
Guidebook · Lab ecosystem — the end-to-end map of the prod topology this sandbox mirrors.
Guidebook · Storage and recovery — how Longhorn and the recovery sequence work today.
doc/adr index — foundational infrastructure ADRs (read-only history).
Longhorn PVC recovery ADR — engine-ID re-association, exercised by the Longhorn chaos drill.
new-web-app conventions — the <app> convention reused identically across sandbox and prod.
CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section.
PRs: #10 — bootstrap vibe/ tree + ecosystem AGENTS.md.

8.4 KiB Raw Blame History Unescape Escape