Files
factory/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md
Gabriel Radureau 7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00

8.1 KiB
Raw Blame History

vibe > Guidebooks > Lab ecosystem > Storage & recovery

Storage & recovery

Status: 🟢 Active Last Updated: 2026-06-23 Related: Lab ecosystem · Secrets & Vault · Factory brick · PRD — QA strategy Decision: ADR 0001 — Safe, production-like environment Upstream (incident ADR): Longhorn PVC recovery

TL;DR

The lab keeps state in two places, on purpose. Longhorn provides distributed block storage inside k3s for everything cluster-native (app PVCs, Traefik's acme.json, the backup volume itself). PostgreSQL and Gitea deliberately persist on pi2's local disk, outside k3s, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns new engine IDs and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a fixed startup order — Longhorn first, then Vault unseal, then VSO re-auth, with ERP scaled up last — that brings the cluster back deterministically. That order is now rehearsed as a drill.

Two storage tiers, on purpose

Tier Backing What lives there Why here
In-cluster Longhorn (distributed block storage inside k3s, replicated across pi1/pi2/pi3) App PVCs, Traefik certificates (acme.json), the cluster backup volume (backups-rwx) Cluster-native workloads get replicated, snapshot-able volumes that follow the pod.
Outside the cluster docker-compose on pi2's local disk PostgreSQL + Gitea These are foundations: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — without k3s being healthy, so they cannot live inside it.

This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the Factory brick for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations.

The Longhorn engine-ID re-association failure mode

Longhorn stores each replica's data on a node in a directory named <volume-name>-<engine-id>. The raw volume-head-*.img files are durable — they survive a power cut on the disk. The danger is in the metadata, not the data:

  1. A power cut drops the Longhorn CSI driver.
  2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible.
  3. Reinstalling Longhorn recreates the Volume CRDs, but with new engine IDs.
  4. Longhorn creates new, empty replica directories under the new engine IDs and does not adopt the old, data-bearing directories.

The result: the real data sits in an orphaned …-<old-id>/ directory while Longhorn happily serves an empty …-<new-id>/. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a Dirty: true orphan alongside a clean empty replica and silently rebuild from the empty one, destroying the data. The proven safe path is the automated block-device injection (Method D): create a fresh volume, attach it in maintenance mode, and rsync the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the playbooks/recover/longhorn_data.yml automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast kubectl apply restore) are documented in the Longhorn PVC recovery ADR.

Caution

Do not recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead.

The tested 2026-04-13 power-cut recovery

The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a dependency gate for the next:

%%{init: {'theme':'base'}}%%
flowchart TD
    PC["Power cut<br/>(cluster down, disks intact)"]:::dead

    PC --> LH["1 · Restore Longhorn<br/>volumes (block-device<br/>injection if engine IDs changed)"]:::store
    LH --> VU["2 · Unseal Vault<br/>(1 key, threshold 1,<br/>key on the Mac)"]:::proc
    VU --> VSO["3 · VSO re-auth<br/>(k8s auth → fresh<br/>dynamic creds)"]:::proc
    VSO --> ERP["4 · Scale up ERP<br/>last (depends on DB +<br/>injected secrets)"]:::src

    classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
    classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
    classDef proc fill:#059669,stroke:#047857,color:#fff
    classDef src fill:#2563eb,stroke:#1e40af,color:#fff
  1. Restore Longhorn first. Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done.
  2. Unseal Vault. Vault restarts sealed and serves nothing until a human unseals it with the single key from ~/.arcodange/cluster-keys.json (threshold 1). This is the secret-flow chokepoint — see Secrets & Vault. No secret consumer recovers before this step.
  3. VSO re-authenticates. Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh postgres/creds/<app> leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials.
  4. Scale up ERP last. ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 13 are confirmed avoids a crash-loop against a half-recovered platform.

The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo).

Why this is rehearsed in the sandbox

A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be rehearsed deliberately — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's QA strategy, and the overall decision to maintain such an environment is ADR 0001.

See also