Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.1 KiB
vibe > Guidebooks > Lab ecosystem > Storage & recovery
Storage & recovery
Status: 🟢 Active Last Updated: 2026-06-23 Related: Lab ecosystem · Secrets & Vault · Factory brick · PRD — QA strategy Decision: ADR 0001 — Safe, production-like environment Upstream (incident ADR): Longhorn PVC recovery
TL;DR
The lab keeps state in two places, on purpose. Longhorn provides distributed block storage inside k3s for everything cluster-native (app PVCs, Traefik's acme.json, the backup volume itself). PostgreSQL and Gitea deliberately persist on pi2's local disk, outside k3s, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns new engine IDs and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a fixed startup order — Longhorn first, then Vault unseal, then VSO re-auth, with ERP scaled up last — that brings the cluster back deterministically. That order is now rehearsed as a drill.
Two storage tiers, on purpose
| Tier | Backing | What lives there | Why here |
|---|---|---|---|
| In-cluster | Longhorn (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (acme.json), the cluster backup volume (backups-rwx) |
Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. |
| Outside the cluster | docker-compose on pi2's local disk | PostgreSQL + Gitea | These are foundations: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — without k3s being healthy, so they cannot live inside it. |
This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the Factory brick for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations.
The Longhorn engine-ID re-association failure mode
Longhorn stores each replica's data on a node in a directory named <volume-name>-<engine-id>. The raw volume-head-*.img files are durable — they survive a power cut on the disk. The danger is in the metadata, not the data:
- A power cut drops the Longhorn CSI driver.
- Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible.
- Reinstalling Longhorn recreates the Volume CRDs, but with new engine IDs.
- Longhorn creates new, empty replica directories under the new engine IDs and does not adopt the old, data-bearing directories.
The result: the real data sits in an orphaned …-<old-id>/ directory while Longhorn happily serves an empty …-<new-id>/. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a Dirty: true orphan alongside a clean empty replica and silently rebuild from the empty one, destroying the data. The proven safe path is the automated block-device injection (Method D): create a fresh volume, attach it in maintenance mode, and rsync the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the playbooks/recover/longhorn_data.yml automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast kubectl apply restore) are documented in the Longhorn PVC recovery ADR.
Caution
Do not recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead.
The tested 2026-04-13 power-cut recovery
The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a dependency gate for the next:
%%{init: {'theme':'base'}}%%
flowchart TD
PC["Power cut<br/>(cluster down, disks intact)"]:::dead
PC --> LH["1 · Restore Longhorn<br/>volumes (block-device<br/>injection if engine IDs changed)"]:::store
LH --> VU["2 · Unseal Vault<br/>(1 key, threshold 1,<br/>key on the Mac)"]:::proc
VU --> VSO["3 · VSO re-auth<br/>(k8s auth → fresh<br/>dynamic creds)"]:::proc
VSO --> ERP["4 · Scale up ERP<br/>last (depends on DB +<br/>injected secrets)"]:::src
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
- Restore Longhorn first. Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done.
- Unseal Vault. Vault restarts sealed and serves nothing until a human unseals it with the single key from
~/.arcodange/cluster-keys.json(threshold 1). This is the secret-flow chokepoint — see Secrets & Vault. No secret consumer recovers before this step. - VSO re-authenticates. Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh
postgres/creds/<app>leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials. - Scale up ERP last. ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1–3 are confirmed avoids a crash-loop against a half-recovered platform.
The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo).
Why this is rehearsed in the sandbox
A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be rehearsed deliberately — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's QA strategy, and the overall decision to maintain such an environment is ADR 0001.
See also
- Longhorn PVC recovery ADR — engine-ID failure mode, the five recovery methods, and the block-device injection automation.
- Secrets & Vault — the unseal model and why it gates step 2 of the recovery order.
- Factory brick — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation.
- PRD — QA strategy — how recovery drills become routine QA.
- ADR 0001 — Safe, production-like environment.
- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo).