factory/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md

[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery**

# Storage & recovery

> **Status**: 🟢 Active
> **Last Updated**: 2026-06-23
> **Related**: [Lab ecosystem](README.md) · [Secrets & Vault](secrets-and-vault.md) · [Factory brick](01-factory.md) · [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)
> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
> **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)

## TL;DR

The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** — Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** — that brings the cluster back deterministically. That order is now rehearsed as a drill.

## Two storage tiers, on purpose

| Tier | Backing | What lives there | Why here |
| --- | --- | --- | --- |
| **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. |
| **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — **without** k3s being healthy, so they cannot live inside it. |

This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations.

## The Longhorn engine-ID re-association failure mode

Longhorn stores each replica's data on a node in a directory named **`<volume-name>-<engine-id>`**. The raw `volume-head-*.img` files are durable — they survive a power cut on the disk. The danger is in the *metadata*, not the data:

1. A power cut drops the Longhorn CSI driver.
2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible.
3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**.
4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories.

The result: the real data sits in an orphaned `…-<old-id>/` directory while Longhorn happily serves an empty `…-<new-id>/`. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).

> [!CAUTION]
> Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead.

## The tested 2026-04-13 power-cut recovery

The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a **dependency gate** for the next:

```mermaid
%%{init: {'theme':'base'}}%%
flowchart TD
    PC["Power cut<br/>(cluster down, disks intact)"]:::dead

    PC --> LH["1 · Restore Longhorn<br/>volumes (block-device<br/>injection if engine IDs changed)"]:::store
    LH --> VU["2 · Unseal Vault<br/>(1 key, threshold 1,<br/>key on the Mac)"]:::proc
    VU --> VSO["3 · VSO re-auth<br/>(k8s auth → fresh<br/>dynamic creds)"]:::proc
    VSO --> ERP["4 · Scale up ERP<br/>last (depends on DB +<br/>injected secrets)"]:::src

    classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
    classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
    classDef proc fill:#059669,stroke:#047857,color:#fff
    classDef src fill:#2563eb,stroke:#1e40af,color:#fff
```

1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done.
2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint — see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step.
3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/<app>` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials.
4. **Scale up ERP last.** ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1–3 are confirmed avoids a crash-loop against a half-recovered platform.

The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo).

## Why this is rehearsed in the sandbox

A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md).

## See also

- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID failure mode, the five recovery methods, and the block-device injection automation.
- [Secrets & Vault](secrets-and-vault.md) — the unseal model and why it gates step 2 of the recovery order.
- [Factory brick](01-factory.md) — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation.
- [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) — how recovery drills become routine QA.
- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md).
- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo).