docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md

Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 11:52:37 +02:00
parent 827af6b392
commit 7647a68cdc
25 changed files with 1878 additions and 0 deletions

View File

@@ -0,0 +1,69 @@
[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment**
# Safe, production-like environment
> **Status:** In design
> **Last Updated:** 2026-06-23
> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md)
## Problem
The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.
There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.
## Users & personas
A **single operator wearing two hats**:
- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.
Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.
## Goals & non-goals
**Goals**
- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**.
- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated.
- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
- Run on the **existing control node** (the MacBook) at **$0** marginal cost.
**Non-goals**
- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope.
- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred.
- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred.
- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.
## Requirements
Functional:
- **One-command bring-up**, seeded from the same GitOps repos as prod.
- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md).
- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail.
- **Two modes** seeded from the same repos via the sandbox inventory:
- **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration.
- **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.
The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`<app>` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md).
## QA strategy
Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md).
## Implementation status
Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md).
## Leaves
| Page | Summary | Status |
| --- | --- | --- |
| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `<app>` naming note; the token caution. | 🟡 In design |
| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design |
| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started |

View File

@@ -0,0 +1,52 @@
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **STATUS**
# STATUS — Safe, production-like environment
> **Last Updated:** 2026-06-23
Legend: ⬜ not started · 🟡 in progress · ✅ done
> [!IMPORTANT]
> This file MUST be updated whenever something ships. Every PR that advances a phase crosslinks back here (and the matching checkbox flips), and the [PRs](#prs) table gets a row.
## Phase 0 — Isolation guardrails
*Must land before any sandbox run.*
- [ ] ⬜ Sandbox inventory `inventory/sandbox/hosts.yml` (VM/cloud hosts only)
- [ ] ⬜ Prod-IP abort guard (aborts on `192.168.1.201-203` unless `i_mean_prod=true`)
- [ ] ⬜ Sandbox GCS state prefixes (`sandbox/...`) or `gs://arcodange-tf-sandbox`
- [ ] ⬜ Sandbox Vault unseal-key path (`~/.arcodange/sandbox/cluster-keys.json`)
- [ ] ⬜ Sandbox env profile / plan-only DNS against a throwaway zone
## Phase 1 — Tier-1 k3d fast mode
- [ ] ⬜ One-command bring-up seeded from GitOps
- [ ] ⬜ Parity manifest v1
- [ ] ⬜ Canary provisioning-parity test
- [ ]`changed=0` idempotence gate documented
## Phase 2 — Tier-1 3-VM cluster
- [ ] ⬜ Three arm64 VMs (multipass / Vagrant on the M4)
- [ ] ⬜ Same `system_k3s`; Postgres + Gitea outside k3s on the pi2-equivalent VM
- [ ] ⬜ Longhorn across the three VM disks
- [ ] ⬜ Chaos drills: node-kill / Vault-seal / DB-drop
- [ ] ⬜ First full `CLUSTER_RECOVERY` dry-run against the sandbox
## Phase 3 — Game-day operationalization
- [ ] ⬜ Monthly cadence + promotion gate in the PR checklist
- [ ] ⬜ Longhorn engine-ID drill
- [ ] ⬜ ArgoCD bad-sync rollback runbook
- [ ] ⬜ Evidence trail for ≥1 cycle
## Phase 4 — out of scope
Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s for real public DNS/ACME. See [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) for the rejected-alternatives rationale.
## PRs
| PR | Scope | Phase | Merged |
| --- | --- | --- | --- |
| _pending_ | Bootstrap the PRD tree (this `vibe/` set) — backfilled on open | — | ⬜ |

View File

@@ -0,0 +1,31 @@
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **Isolation boundary**
# Isolation boundary
> **Status:** In design
> **Last Updated:** 2026-06-23
> **Upstream:** [Safe, production-like environment](README.md)
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
The isolation boundary is the load-bearing part of this PRD: the sandbox must be **unable to mutate real prod even on a wrong command**. Every prod coupling that a sandbox run could touch is mapped below to a concrete control. The boundary is the **cluster + Vault + state + DNS zone** — not the names (see the naming note).
## Prod couplings → sandbox controls
| Prod coupling | What it can break in prod | Sandbox control |
| --- | --- | --- |
| Ansible inventory `hosts.yml``192.168.1.201-203` | Wipe disks, reset k3s, corrupt Longhorn on the live Pis. | Separate `inventory/sandbox/hosts.yml` (VM/cloud hosts only) **plus** a pre-task guard that **aborts** if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true` is set explicitly. |
| OpenTofu state in `gs://arcodange-tf` (prefixes) | A sandbox apply rewrites live state and re-plans prod resources. | A sandbox prefix family (`sandbox/factory/main`, `sandbox/tools/...`, `sandbox/factory/postgres`) via a backend-config override, **or** a separate bucket `gs://arcodange-tf-sandbox`. Sandbox runs never touch prod state. |
| Gitea provider `base_url` `gitea.arcodange.lab` + ArgoCD `repoURL` / `targetRevision` | Sandbox commits/pushes into the prod forge; ArgoCD syncs sandbox refs onto the prod cluster. | Sandbox Gitea on the sandbox cluster (or org `arcodange-sandbox`); the sandbox app-of-apps points at a **sandbox branch** so the sandbox cluster syncs only sandbox refs. |
| Vault provider `address` `vault.arcodange.lab` + unseal key `~/.arcodange/cluster-keys.json` | Sandbox writes clobber prod policies/auth/mounts; a botched init overwrites the prod unseal key. | A **separate sandbox Vault**; override the unseal-key path to `~/.arcodange/sandbox/cluster-keys.json` so prod's key can never be overwritten. |
| PostgreSQL provider `host` `192.168.1.202` (superuser) | Drop or alter live DBs — including ERP business records. | Sandbox PG is the docker-compose on the sandbox pi2-equivalent; a guard **refuses apply** if `host == 192.168.1.202` and `workspace != prod`. |
| Cloudflare account / OVH `arcodange.fr` / Zoho live mail | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | DNS/email modules run **plan-only** against a throwaway zone/subdomain with a separate token. The real `arcodange.fr` token is **never** exported into a sandbox shell. Real public DNS/ACME is out of scope. |
| Longhorn backup bucket | A restore drill overwrites prod backups. | Sandbox backup target is a **separate bucket/prefix** so restore drills cannot overwrite prod backups. |
## The `<app>` naming note
The `<app>` key threads one kebab-case identifier through the Gitea repo, the PG db + role, the Vault paths/policies, the k8s namespace + SA, the ArgoCD Application, the GCS state prefix, and DNS — see [conventions](../../../doc/runbooks/new-web-app/conventions.md).
Because `<app>` keys everything **within** a cluster / Vault / DB / zone, the sandbox can reuse **identical `<app>` names with no collision**. The isolation boundary is the cluster + Vault + state + DNS zone, not the names. This is deliberate: runbooks read **identically** in both environments, so a drill exercises the exact same convention chain an operator runs in prod.
> [!CAUTION]
> The real `arcodange.fr` Cloudflare token must **never** be exported into a sandbox shell. DNS/email work in the sandbox is plan-only against a throwaway zone with its own separate token. Exporting the prod token into a sandbox session would defeat the entire isolation boundary — a single `tofu apply` could rewrite live public DNS or mail records.

View File

@@ -0,0 +1,49 @@
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy**
# QA strategy
> **Status:** In design
> **Last Updated:** 2026-06-23
> **Upstream:** [Safe, production-like environment](README.md)
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md)
The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate).
## Fidelity gates
- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate.
- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `<app>` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test.
- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready.
- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate.
## Chaos drills
Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.
| Drill | Action | Acceptance | Recovery section |
| --- | --- | --- | --- |
| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. |
| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. |
| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. |
| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. |
| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. |
| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). |
| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. |
1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets.
3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR.
4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last.
6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section.
7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.
## Recurring game-day & promotion gate
A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident.
**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md).
## Evidence trail
Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.