Files
factory/vibe/PRD/safe-prod-like-environment
Gabriel Radureau 7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00
..

vibe > PRD > Safe, production-like environment

Safe, production-like environment

Status: In design Last Updated: 2026-06-23 Design record: ADR 0001 — Safe, production-like environment Adjacent: INV-001 — prod blast-radius couplings Map: Lab ecosystem guidebook

Problem

The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS (arcodange.fr), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.

There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.

Users & personas

A single operator wearing two hats:

  • The inner-loop developer — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
  • The game-day / recovery operator — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.

Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.

Goals & non-goals

Goals

  • Rehearse every dangerous change-class (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with zero prod blast radius.
  • Validate CLUSTER_RECOVERY.md against the sandbox, not against prod — turn the recovery runbook from incident-validated into drill-validated.
  • A fast inner loop for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
  • Run on the existing control node (the MacBook) at $0 marginal cost.

Non-goals

  • No real arcodange.fr DNS or email tests. DNS/email modules run plan-only against a throwaway zone; real public DNS/ACME end-to-end is out of scope.
  • No physical-node tier (4th Pi / mini-PC) — explicitly rejected, not deferred.
  • No cloud tier for disposable real-DNS clusters — explicitly rejected, not deferred.
  • Not a performance-benchmark environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.

Requirements

Functional:

  • One-command bring-up, seeded from the same GitOps repos as prod.
  • Sandbox inventory + guards — a separate inventory/sandbox/hosts.yml plus a pre-task guard that aborts on any prod IP (192.168.1.201-203) unless i_mean_prod=true. See Isolation boundary.
  • Parity manifest — same k3s v1.34.3+k3s1, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail.
  • Two modes seeded from the same repos via the sandbox inventory:
    • (a) k3d single-node fast inner loop (~60s bring-up) for app / Vault / ArgoCD iteration.
    • (b) 3 arm64 VMs (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.

The isolation boundary (the table mapping each prod coupling to its sandbox control) and the <app> naming note are detailed in isolation-boundary.md.

QA strategy

Fidelity gates (parity manifest, a canary provisioning-parity test that drives the new-web-app runbook end to end, an ansible-playbook --check --diff changed=0 idempotence gate, and a tofu plan diff gate) plus chaos drills mapped section-by-section to CLUSTER_RECOVERY.md, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox and survived the matching drill. Full detail in qa-strategy.md.

Implementation status

Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: STATUS.md.

Leaves

Page Summary Status
Isolation boundary Prod-coupling → sandbox-control table; the <app> naming note; the token caution. 🟡 In design
QA strategy Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. 🟡 In design
STATUS Phase tracker (all not-started) and PR log. Not started