[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery** # Storage & recovery > **Status**: ๐ŸŸข Active > **Last Updated**: 2026-06-23 > **Related**: [Lab ecosystem](README.md) ยท [Secrets & Vault](secrets-and-vault.md) ยท [Factory brick](01-factory.md) ยท [PRD โ€” QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) > **Decision**: [ADR 0001 โ€” Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) > **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) ## TL;DR The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose โ€” they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** โ€” Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** โ€” that brings the cluster back deterministically. That order is now rehearsed as a drill. ## Two storage tiers, on purpose | Tier | Backing | What lives there | Why here | | --- | --- | --- | --- | | **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. | | **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive โ€” and start โ€” **without** k3s being healthy, so they cannot live inside it. | This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations. ## The Longhorn engine-ID re-association failure mode Longhorn stores each replica's data on a node in a directory named **`-`**. The raw `volume-head-*.img` files are durable โ€” they survive a power cut on the disk. The danger is in the *metadata*, not the data: 1. A power cut drops the Longhorn CSI driver. 2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) โ€” a webhook circular dependency makes a clean shutdown impossible. 3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**. 4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories. The result: the real data sits in an orphaned `โ€ฆ-/` directory while Longhorn happily serves an empty `โ€ฆ-/`. Worse, a naive directory rename can backfire โ€” Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device โ€” never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). > [!CAUTION] > Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead. ## The tested 2026-04-13 power-cut recovery The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary โ€” each step is a **dependency gate** for the next: ```mermaid %%{init: {'theme':'base'}}%% flowchart TD PC["Power cut
(cluster down, disks intact)"]:::dead PC --> LH["1 ยท Restore Longhorn
volumes (block-device
injection if engine IDs changed)"]:::store LH --> VU["2 ยท Unseal Vault
(1 key, threshold 1,
key on the Mac)"]:::proc VU --> VSO["3 ยท VSO re-auth
(k8s auth โ†’ fresh
dynamic creds)"]:::proc VSO --> ERP["4 ยท Scale up ERP
last (depends on DB +
injected secrets)"]:::src classDef dead fill:#6b7280,stroke:#4b5563,color:#fff classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff classDef proc fill:#059669,stroke:#047857,color:#fff classDef src fill:#2563eb,stroke:#1e40af,color:#fff ``` 1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done. 2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint โ€” see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step. 3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials. 4. **Scale up ERP last.** ERP is the most dependency-heavy app โ€” it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1โ€“3 are confirmed avoids a crash-loop against a half-recovered platform. The single backing fact for this drill โ€” Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail โ€” is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo). ## Why this is rehearsed in the sandbox A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** โ€” kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green โ€” without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md). ## See also - [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) โ€” engine-ID failure mode, the five recovery methods, and the block-device injection automation. - [Secrets & Vault](secrets-and-vault.md) โ€” the unseal model and why it gates step 2 of the recovery order. - [Factory brick](01-factory.md) โ€” the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation. - [PRD โ€” QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) โ€” how recovery drills become routine QA. - [ADR 0001 โ€” Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). - CLUSTER_RECOVERY.md โ€” the tested power-cut recovery record (lab root, outside this repo).