docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md

Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00
parent 827af6b392
commit 7647a68cdc
25 changed files with 1878 additions and 0 deletions
--- a/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md
+++ b/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md
@@ -0,0 +1,76 @@
+[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery**
+
+# Storage & recovery
+
+> **Status**: 🟢 Active
+> **Last Updated**: 2026-06-23
+> **Related**: [Lab ecosystem](README.md) · [Secrets & Vault](secrets-and-vault.md) · [Factory brick](01-factory.md) · [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)
+> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
+> **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
+
+## TL;DR
+
+The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** — Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** — that brings the cluster back deterministically. That order is now rehearsed as a drill.
+
+## Two storage tiers, on purpose
+
+| Tier | Backing | What lives there | Why here |
+| --- | --- | --- | --- |
+| **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. |
+| **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — **without** k3s being healthy, so they cannot live inside it. |
+
+This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations.
+
+## The Longhorn engine-ID re-association failure mode
+
+Longhorn stores each replica's data on a node in a directory named **`<volume-name>-<engine-id>`**. The raw `volume-head-*.img` files are durable — they survive a power cut on the disk. The danger is in the *metadata*, not the data:
+
+1. A power cut drops the Longhorn CSI driver.
+2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible.
+3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**.
+4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories.
+
+The result: the real data sits in an orphaned `…-<old-id>/` directory while Longhorn happily serves an empty `…-<new-id>/`. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
+
+> [!CAUTION]
+> Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead.
+
+## The tested 2026-04-13 power-cut recovery
+
+The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a **dependency gate** for the next:
+
+```mermaid
+%%{init: {'theme':'base'}}%%
+flowchart TD
+    PC["Power cut<br/>(cluster down, disks intact)"]:::dead
+
+    PC --> LH["1 · Restore Longhorn<br/>volumes (block-device<br/>injection if engine IDs changed)"]:::store
+    LH --> VU["2 · Unseal Vault<br/>(1 key, threshold 1,<br/>key on the Mac)"]:::proc
+    VU --> VSO["3 · VSO re-auth<br/>(k8s auth → fresh<br/>dynamic creds)"]:::proc
+    VSO --> ERP["4 · Scale up ERP<br/>last (depends on DB +<br/>injected secrets)"]:::src
+
+    classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
+    classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
+    classDef proc fill:#059669,stroke:#047857,color:#fff
+    classDef src fill:#2563eb,stroke:#1e40af,color:#fff
+```
+
+1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done.
+2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint — see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step.
+3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/<app>` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials.
+4. **Scale up ERP last.** ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1–3 are confirmed avoids a crash-loop against a half-recovered platform.
+
+The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo).
+
+## Why this is rehearsed in the sandbox
+
+A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md).
+
+## See also
+
+- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID failure mode, the five recovery methods, and the block-device injection automation.
+- [Secrets & Vault](secrets-and-vault.md) — the unseal model and why it gates step 2 of the recovery order.
+- [Factory brick](01-factory.md) — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation.
+- [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) — how recovery drills become routine QA.
+- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md).
+- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo).