Files
factory/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md
Gabriel Radureau 7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00

77 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery**
# Storage & recovery
> **Status**: 🟢 Active
> **Last Updated**: 2026-06-23
> **Related**: [Lab ecosystem](README.md) · [Secrets & Vault](secrets-and-vault.md) · [Factory brick](01-factory.md) · [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)
> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
> **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
## TL;DR
The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** — Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** — that brings the cluster back deterministically. That order is now rehearsed as a drill.
## Two storage tiers, on purpose
| Tier | Backing | What lives there | Why here |
| --- | --- | --- | --- |
| **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. |
| **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — **without** k3s being healthy, so they cannot live inside it. |
This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations.
## The Longhorn engine-ID re-association failure mode
Longhorn stores each replica's data on a node in a directory named **`<volume-name>-<engine-id>`**. The raw `volume-head-*.img` files are durable — they survive a power cut on the disk. The danger is in the *metadata*, not the data:
1. A power cut drops the Longhorn CSI driver.
2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible.
3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**.
4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories.
The result: the real data sits in an orphaned `…-<old-id>/` directory while Longhorn happily serves an empty `…-<new-id>/`. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
> [!CAUTION]
> Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead.
## The tested 2026-04-13 power-cut recovery
The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a **dependency gate** for the next:
```mermaid
%%{init: {'theme':'base'}}%%
flowchart TD
PC["Power cut<br/>(cluster down, disks intact)"]:::dead
PC --> LH["1 · Restore Longhorn<br/>volumes (block-device<br/>injection if engine IDs changed)"]:::store
LH --> VU["2 · Unseal Vault<br/>(1 key, threshold 1,<br/>key on the Mac)"]:::proc
VU --> VSO["3 · VSO re-auth<br/>(k8s auth → fresh<br/>dynamic creds)"]:::proc
VSO --> ERP["4 · Scale up ERP<br/>last (depends on DB +<br/>injected secrets)"]:::src
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
```
1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done.
2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint — see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step.
3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/<app>` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials.
4. **Scale up ERP last.** ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 13 are confirmed avoids a crash-loop against a half-recovered platform.
The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo).
## Why this is rehearsed in the sandbox
A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md).
## See also
- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID failure mode, the five recovery methods, and the block-device injection automation.
- [Secrets & Vault](secrets-and-vault.md) — the unseal model and why it gates step 2 of the recovery order.
- [Factory brick](01-factory.md) — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation.
- [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) — how recovery drills become routine QA.
- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md).
- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo).