Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
12 KiB
vibe > Guidebooks > Factory provisioning > Ansible > 06 · Recover
06 · Recover — Longhorn disaster recovery
Note
Status: 🟡 beta · Last Updated: 2026-06-23 Upstream: Ansible sub-hub · Factory provisioning hub Downstream: 05 · Backup — the dumps these playbooks consume Related: Storage & recovery · PRD — QA strategy · Longhorn PVC recovery ADR
The recover/ playbooks are not part of the linear 01..05 pipeline — they are an on-demand disaster-recovery branch, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: do the Longhorn Volume CRDs still exist?
Important
Decision — pick the right playbook before you start:
- Volume CRDs still present (e.g. they were captured by the 05 · Backup k3s_pvc dump, or never wiped) → run
recover/longhorn.yml. Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted.- Volume CRDs are GONE (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica
.imgfiles survive on disk → runrecover/longhorn_data.yml. Slow: it merges replica layers at the block-device level and injects the data into a fresh volume.
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4;
classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff;
Q{"Do the Longhorn<br/>Volume CRDs<br/>still exist?"}:::q
F["longhorn.yml<br/>CSI/CRD recovery (fast)"]:::fast
S{"Raw replica<br/>.img files<br/>survive?"}:::q
D["longhorn_data.yml<br/>block-device recovery (slow)"]:::slow
X["Data unrecoverable<br/>(replicas zeroed)"]:::dead
Q -- "yes" --> F
Q -- "no" --> S
S -- "yes" --> D
S -- "no" --> X
- CRDs present? Yes →
longhorn.ymlre-applies the Volume CRDs and the on-disk replicas re-attach. Done fast. - CRDs gone? Then ask whether the raw replica
.imgfiles survived on disk. - Replicas survive? Yes →
longhorn_data.ymlreconstructs the filesystem at the block level and injects it into a new volume. - Replicas zeroed by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this.
Note
This branch sits at step 1 of the broader tested startup order — Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in Storage & recovery. The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo.
longhorn.yml — CSI/CRD recovery (CRDs present)
Runs against raspberries:&local as root. It diagnoses how broken Longhorn is and applies the least invasive fix that works, escalating only if needed. Most logic runs run_once on pi1, delegating cluster reads to localhost.
| Phase | What it does |
|---|---|
| 0 · Pre-flight | Verifies the data dir /mnt/arcodange/longhorn exists on pi1 (fails hard if missing) and that at least one backup_*.volumes dump exists in the primary or fallback backup dir. |
| 1 · Diagnosis | Checks the longhorn-system namespace, the driver.longhorn.io CSIDriver registration, and the longhorn-manager pods, then sets recovery_phase = soft (CSI driver gone), hard (managers unhealthy), or none. |
| 2 · Soft | Touches longhorn-install.yaml to make k3s reconcile the HelmChart, waits, and checks pods recreate. |
| 3 · Hard | Force-deletes the longhorn-driver-deployer pods so the HelmChart recreates them. |
| 4 · Nuclear | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the longhorn-install HelmChart manifest (v1.9.1, defaultDataPath preserved), wait for pods. |
| 5 · Restore | Waits for managers to be ready, then kubectl applys the latest backup_*.volumes dump (PV/PVC + Longhorn CRDs) and any longhorn_metadata_*.yaml. |
| 6 · Verify | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. |
Important
Phase 5 is exactly where the 05 · Backup k3s_pvc dump pays off: re-applying the captured Volume CRDs lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is idempotent — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe.
longhorn_data.yml — block-device data recovery (CRDs gone)
This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in orphaned replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is driven by a vars file — vars/recovery_volumes.yml, one entry per volume — and the format is documented in longhorn_data_vars.example.yml.
ansible-playbook -i inventory/hosts.yml \
playbooks/recover/longhorn_data.yml \
-e @vars/recovery_volumes.yml
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4;
P0["Pre-flight + Phase 0:<br/>auto-discover largest replica dir (>16K)"]:::pre
P1["Phase 1: back up untouched replica dir<br/>(safe copy before any op)"]:::merge
P2["Phase 2: merge-longhorn-layers.py<br/>→ single .img · test-mount RO"]:::merge
P3["Phase 3: create Volume CRD<br/>(scale down workload, clear stuck PVCs)"]:::k8s
P5["Phase 5: attach via maintenance ticket<br/>→ /dev/longhorn/<pv>"]:::k8s
P6["Phase 6: mkfs + rsync merged image<br/>into live block device"]:::merge
P8["Phase 8: recreate PV (Retain) + PVC<br/>pinned by volumeName"]:::k8s
P9["Phase 9: scale workload up · verify"]:::done
P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9
- Pre-flight + Phase 0. Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then auto-discover the best replica source for each volume — the largest dir >16 MiB across
pi1/pi2/pi3, skipping any replica stillRebuilding.source_node/source_dirin the vars file override this. - Phase 1.
cp -athe untouched replica dir to a backup location before touching anything, and verify it containsvolume.meta. - Phase 2. Run
merge-longhorn-layers.pyto collapse the snapshot + head.imglayers into one image, then test-mount it read-only to confirm the filesystem is sound. - Phase 3. Scale the workload to 0 and clear any stuck
TerminatingPV/PVCs before creating a fresh LonghornVolumeCRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise). - Phase 5. Attach the volume via a Longhorn
VolumeAttachmentmaintenance ticket so/dev/longhorn/<pv>appears on the source node, with the frontend enabled. - Phase 6.
mkfs.ext4the live block device if unformatted, thenrsyncthe merged recovery image into it (--ignore-errors; rsync rc=23 partial-transfer is treated as success for power-cut partitions). - Phase 8. Detach the recovery ticket, recreate the PV (
Retain, noclaimRef) and a PVC pinned byvolumeName, and wait for Bound. - Phase 9. Scale the workload back up, wait for ready replicas, and run the optional per-volume
verify_cmdinside the pod.
Caution
The
merge-longhorn-layers.pytool is invoked per replica dir viadmsetupto stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the empty new replica as the rebuild source and overwrite your data. The block-device injection is the only proven-safe path. The full method comparison is in the Longhorn PVC recovery ADR.
Note
Tested 2026-04-13 power-cut. This block-device path was proven end to end recovering the url-shortener's SQLite database after that power cut forced a nuclear Longhorn reinstall (verified
2026-04-14withsqlite3 … 'SELECT COUNT(*) FROM urls;'). That scenario is the worked example inlonghorn_data_vars.example.yml.
Gotchas
Warning
- Run
longhorn.ymlfirst if there is any chance the CRDs survived. It is fast and idempotent; falling straight tolonghorn_data.ymlis unnecessary block-level work when akubectl applywould have sufficed.longhorn_data.ymlneeds a healthy Longhorn control plane. Its pre-flight aborts unless ≥1longhorn-manageris Running — it recovers data into a working Longhorn, it does not bring Longhorn back. Uselonghorn.ymlfor that.- Process volumes one at a time first. The example vars file recommends validating a single volume before batching — a misidentified
source_dircan pin the PVC to the wrong (empty) replica.python3on every node. Phase 0's replica scan and the merge tool both requirepython3onpi1/pi2/pi3.- The merge tool path is repo-relative.
longhorn_data.ymlresolvesmerge-longhorn-layers.pyfromdocs/incidents/2026-04-13-power-cut/tools/andscps it to the source node — run the playbook from inside the collection so that path resolves.
Why this is rehearsed
A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are rehearsed deliberately in the production-like sandbox: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's QA strategy for how recovery drills become a regular exercise, and Storage & recovery for the full startup order these drills validate.
Where this branch sits
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
s05["05 · Backup<br/>(produces .volumes dump)"]:::done
rec["recover/*<br/>longhorn.yml · longhorn_data.yml"]:::here
s01["01 · System<br/>(rejoin pipeline)"]:::done
s05 -. "on disaster" .-> rec
rec -. "once recovered" .-> s01
- 05 · Backup produced the
.volumesdump thatlonghorn.yml's restore phase replays. - recover/ (this page) is invoked only on disaster — pick
longhorn.yml(CRDs present) orlonghorn_data.yml(CRDs gone). - Once volumes are healthy, the cluster re-enters the normal pipeline at 01 · System, and you re-run a fresh 05 · Backup once everything is green.