docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md

Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00
parent 827af6b392
commit 7647a68cdc
25 changed files with 1878 additions and 0 deletions
--- a/vibe/PRD/safe-prod-like-environment/README.md
+++ b/vibe/PRD/safe-prod-like-environment/README.md
@@ -0,0 +1,69 @@
+[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment**
+
+# Safe, production-like environment
+
+> **Status:** In design
+> **Last Updated:** 2026-06-23
+> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
+> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
+> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md)
+
+## Problem
+
+The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.
+
+There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.
+
+## Users & personas
+
+A **single operator wearing two hats**:
+
+- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
+- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.
+
+Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.
+
+## Goals & non-goals
+
+**Goals**
+
+- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**.
+- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated.
+- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
+- Run on the **existing control node** (the MacBook) at **$0** marginal cost.
+
+**Non-goals**
+
+- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope.
+- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred.
+- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred.
+- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.
+
+## Requirements
+
+Functional:
+
+- **One-command bring-up**, seeded from the same GitOps repos as prod.
+- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md).
+- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail.
+- **Two modes** seeded from the same repos via the sandbox inventory:
+  - **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration.
+  - **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.
+
+The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`<app>` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md).
+
+## QA strategy
+
+Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md).
+
+## Implementation status
+
+Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md).
+
+## Leaves
+
+| Page | Summary | Status |
+| --- | --- | --- |
+| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `<app>` naming note; the token caution. | 🟡 In design |
+| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design |
+| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started |
--- a/vibe/PRD/safe-prod-like-environment/STATUS.md
+++ b/vibe/PRD/safe-prod-like-environment/STATUS.md
@@ -0,0 +1,52 @@
+[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **STATUS**
+
+# STATUS — Safe, production-like environment
+
+> **Last Updated:** 2026-06-23
+
+Legend: ⬜ not started · 🟡 in progress · ✅ done
+
+> [!IMPORTANT]
+> This file MUST be updated whenever something ships. Every PR that advances a phase crosslinks back here (and the matching checkbox flips), and the [PRs](#prs) table gets a row.
+
+## Phase 0 — Isolation guardrails
+
+*Must land before any sandbox run.*
+
+- [ ] ⬜ Sandbox inventory `inventory/sandbox/hosts.yml` (VM/cloud hosts only)
+- [ ] ⬜ Prod-IP abort guard (aborts on `192.168.1.201-203` unless `i_mean_prod=true`)
+- [ ] ⬜ Sandbox GCS state prefixes (`sandbox/...`) or `gs://arcodange-tf-sandbox`
+- [ ] ⬜ Sandbox Vault unseal-key path (`~/.arcodange/sandbox/cluster-keys.json`)
+- [ ] ⬜ Sandbox env profile / plan-only DNS against a throwaway zone
+
+## Phase 1 — Tier-1 k3d fast mode
+
+- [ ] ⬜ One-command bring-up seeded from GitOps
+- [ ] ⬜ Parity manifest v1
+- [ ] ⬜ Canary provisioning-parity test
+- [ ] ⬜ `changed=0` idempotence gate documented
+
+## Phase 2 — Tier-1 3-VM cluster
+
+- [ ] ⬜ Three arm64 VMs (multipass / Vagrant on the M4)
+- [ ] ⬜ Same `system_k3s`; Postgres + Gitea outside k3s on the pi2-equivalent VM
+- [ ] ⬜ Longhorn across the three VM disks
+- [ ] ⬜ Chaos drills: node-kill / Vault-seal / DB-drop
+- [ ] ⬜ First full `CLUSTER_RECOVERY` dry-run against the sandbox
+
+## Phase 3 — Game-day operationalization
+
+- [ ] ⬜ Monthly cadence + promotion gate in the PR checklist
+- [ ] ⬜ Longhorn engine-ID drill
+- [ ] ⬜ ArgoCD bad-sync rollback runbook
+- [ ] ⬜ Evidence trail for ≥1 cycle
+
+## Phase 4 — out of scope
+
+Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s for real public DNS/ACME. See [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) for the rejected-alternatives rationale.
+
+## PRs
+
+| PR | Scope | Phase | Merged |
+| --- | --- | --- | --- |
+| _pending_ | Bootstrap the PRD tree (this `vibe/` set) — backfilled on open | — | ⬜ |
--- a/vibe/PRD/safe-prod-like-environment/isolation-boundary.md
+++ b/vibe/PRD/safe-prod-like-environment/isolation-boundary.md
@@ -0,0 +1,31 @@
+[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **Isolation boundary**
+
+# Isolation boundary
+
+> **Status:** In design
+> **Last Updated:** 2026-06-23
+> **Upstream:** [Safe, production-like environment](README.md)
+> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
+
+The isolation boundary is the load-bearing part of this PRD: the sandbox must be **unable to mutate real prod even on a wrong command**. Every prod coupling that a sandbox run could touch is mapped below to a concrete control. The boundary is the **cluster + Vault + state + DNS zone** — not the names (see the naming note).
+
+## Prod couplings → sandbox controls
+
+| Prod coupling | What it can break in prod | Sandbox control |
+| --- | --- | --- |
+| Ansible inventory `hosts.yml` → `192.168.1.201-203` | Wipe disks, reset k3s, corrupt Longhorn on the live Pis. | Separate `inventory/sandbox/hosts.yml` (VM/cloud hosts only) **plus** a pre-task guard that **aborts** if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true` is set explicitly. |
+| OpenTofu state in `gs://arcodange-tf` (prefixes) | A sandbox apply rewrites live state and re-plans prod resources. | A sandbox prefix family (`sandbox/factory/main`, `sandbox/tools/...`, `sandbox/factory/postgres`) via a backend-config override, **or** a separate bucket `gs://arcodange-tf-sandbox`. Sandbox runs never touch prod state. |
+| Gitea provider `base_url` `gitea.arcodange.lab` + ArgoCD `repoURL` / `targetRevision` | Sandbox commits/pushes into the prod forge; ArgoCD syncs sandbox refs onto the prod cluster. | Sandbox Gitea on the sandbox cluster (or org `arcodange-sandbox`); the sandbox app-of-apps points at a **sandbox branch** so the sandbox cluster syncs only sandbox refs. |
+| Vault provider `address` `vault.arcodange.lab` + unseal key `~/.arcodange/cluster-keys.json` | Sandbox writes clobber prod policies/auth/mounts; a botched init overwrites the prod unseal key. | A **separate sandbox Vault**; override the unseal-key path to `~/.arcodange/sandbox/cluster-keys.json` so prod's key can never be overwritten. |
+| PostgreSQL provider `host` `192.168.1.202` (superuser) | Drop or alter live DBs — including ERP business records. | Sandbox PG is the docker-compose on the sandbox pi2-equivalent; a guard **refuses apply** if `host == 192.168.1.202` and `workspace != prod`. |
+| Cloudflare account / OVH `arcodange.fr` / Zoho live mail | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | DNS/email modules run **plan-only** against a throwaway zone/subdomain with a separate token. The real `arcodange.fr` token is **never** exported into a sandbox shell. Real public DNS/ACME is out of scope. |
+| Longhorn backup bucket | A restore drill overwrites prod backups. | Sandbox backup target is a **separate bucket/prefix** so restore drills cannot overwrite prod backups. |
+
+## The `<app>` naming note
+
+The `<app>` key threads one kebab-case identifier through the Gitea repo, the PG db + role, the Vault paths/policies, the k8s namespace + SA, the ArgoCD Application, the GCS state prefix, and DNS — see [conventions](../../../doc/runbooks/new-web-app/conventions.md).
+
+Because `<app>` keys everything **within** a cluster / Vault / DB / zone, the sandbox can reuse **identical `<app>` names with no collision**. The isolation boundary is the cluster + Vault + state + DNS zone, not the names. This is deliberate: runbooks read **identically** in both environments, so a drill exercises the exact same convention chain an operator runs in prod.
+
+> [!CAUTION]
+> The real `arcodange.fr` Cloudflare token must **never** be exported into a sandbox shell. DNS/email work in the sandbox is plan-only against a throwaway zone with its own separate token. Exporting the prod token into a sandbox session would defeat the entire isolation boundary — a single `tofu apply` could rewrite live public DNS or mail records.
--- a/vibe/PRD/safe-prod-like-environment/qa-strategy.md
+++ b/vibe/PRD/safe-prod-like-environment/qa-strategy.md
@@ -0,0 +1,49 @@
+[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy**
+
+# QA strategy
+
+> **Status:** In design
+> **Last Updated:** 2026-06-23
+> **Upstream:** [Safe, production-like environment](README.md)
+> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md)
+
+The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate).
+
+## Fidelity gates
+
+- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate.
+- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `<app>` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test.
+- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready.
+- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate.
+
+## Chaos drills
+
+Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.
+
+| Drill | Action | Acceptance | Recovery section |
+| --- | --- | --- | --- |
+| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. |
+| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. |
+| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. |
+| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. |
+| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. |
+| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). |
+| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. |
+
+1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
+2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets.
+3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR.
+4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
+5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last.
+6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section.
+7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.
+
+## Recurring game-day & promotion gate
+
+A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident.
+
+**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md).
+
+## Evidence trail
+
+Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.