[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover** # 06 · Recover — Longhorn disaster recovery > [!NOTE] > **Status:** 🟡 beta · **Last Updated:** 2026-06-23 > **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) > **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume > **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?** > [!IMPORTANT] > **Decision — pick the right playbook before you start:** > - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted. > - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume. ```mermaid %%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% flowchart TD classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb; classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4; classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; classDef dead fill:#6b7280,stroke:#4b5563,color:#fff; Q{"Do the Longhorn
Volume CRDs
still exist?"}:::q F["longhorn.yml
CSI/CRD recovery (fast)"]:::fast S{"Raw replica
.img files
survive?"}:::q D["longhorn_data.yml
block-device recovery (slow)"]:::slow X["Data unrecoverable
(replicas zeroed)"]:::dead Q -- "yes" --> F Q -- "no" --> S S -- "yes" --> D S -- "no" --> X ``` 1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast. 2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk. 3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume. 4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this. > [!NOTE] > This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo. --- ## `longhorn.yml` — CSI/CRD recovery (CRDs present) Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`. | Phase | What it does | | --- | --- | | **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. | | **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. | | **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. | | **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. | | **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. | | **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. | | **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. | > [!IMPORTANT] > Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe. --- ## `longhorn_data.yml` — block-device data recovery (CRDs gone) This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file** — `vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml). ```sh ansible-playbook -i inventory/hosts.yml \ playbooks/recover/longhorn_data.yml \ -e @vars/recovery_volumes.yml ``` ```mermaid %%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% flowchart TD classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb; classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4; P0["Pre-flight + Phase 0:
auto-discover largest replica dir (>16K)"]:::pre P1["Phase 1: back up untouched replica dir
(safe copy before any op)"]:::merge P2["Phase 2: merge-longhorn-layers.py
→ single .img · test-mount RO"]:::merge P3["Phase 3: create Volume CRD
(scale down workload, clear stuck PVCs)"]:::k8s P5["Phase 5: attach via maintenance ticket
→ /dev/longhorn/<pv>"]:::k8s P6["Phase 6: mkfs + rsync merged image
into live block device"]:::merge P8["Phase 8: recreate PV (Retain) + PVC
pinned by volumeName"]:::k8s P9["Phase 9: scale workload up · verify"]:::done P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9 ``` 1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this. 2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`. 3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound. 4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise). 5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/` appears on the source node, with the frontend enabled. 6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions). 7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound. 8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod. > [!CAUTION] > The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). > [!NOTE] > **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml). --- ## Gotchas > [!WARNING] > - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed. > - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that. > - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica. > - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`. > - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves. --- ## Why this is rehearsed A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate. --- ## Where this branch sits ```mermaid %%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% flowchart LR classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; s05["05 · Backup
(produces .volumes dump)"]:::done rec["recover/*
longhorn.yml · longhorn_data.yml"]:::here s01["01 · System
(rejoin pipeline)"]:::done s05 -. "on disaster" .-> rec rec -. "once recovered" .-> s01 ``` 1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays. 2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone). 3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green.