Files
factory/vibe/guidebooks/factory-provisioning/ansible/06-recover.md
Gabriel Radureau dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00

12 KiB

vibe > Guidebooks > Factory provisioning > Ansible > 06 · Recover

06 · Recover — Longhorn disaster recovery

Note

Status: 🟡 beta · Last Updated: 2026-06-23 Upstream: Ansible sub-hub · Factory provisioning hub Downstream: 05 · Backup — the dumps these playbooks consume Related: Storage & recovery · PRD — QA strategy · Longhorn PVC recovery ADR

The recover/ playbooks are not part of the linear 01..05 pipeline — they are an on-demand disaster-recovery branch, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: do the Longhorn Volume CRDs still exist?

Important

Decision — pick the right playbook before you start:

  • Volume CRDs still present (e.g. they were captured by the 05 · Backup k3s_pvc dump, or never wiped) → run recover/longhorn.yml. Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted.
  • Volume CRDs are GONE (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica .img files survive on disk → run recover/longhorn_data.yml. Slow: it merges replica layers at the block-device level and injects the data into a fresh volume.
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
  classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
  classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4;
  classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
  classDef dead fill:#6b7280,stroke:#4b5563,color:#fff;

  Q{"Do the Longhorn<br/>Volume CRDs<br/>still exist?"}:::q
  F["longhorn.yml<br/>CSI/CRD recovery (fast)"]:::fast
  S{"Raw replica<br/>.img files<br/>survive?"}:::q
  D["longhorn_data.yml<br/>block-device recovery (slow)"]:::slow
  X["Data unrecoverable<br/>(replicas zeroed)"]:::dead

  Q -- "yes" --> F
  Q -- "no" --> S
  S -- "yes" --> D
  S -- "no" --> X
  1. CRDs present? Yes → longhorn.yml re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast.
  2. CRDs gone? Then ask whether the raw replica .img files survived on disk.
  3. Replicas survive? Yes → longhorn_data.yml reconstructs the filesystem at the block level and injects it into a new volume.
  4. Replicas zeroed by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this.

Note

This branch sits at step 1 of the broader tested startup order — Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in Storage & recovery. The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo.


longhorn.yml — CSI/CRD recovery (CRDs present)

Runs against raspberries:&local as root. It diagnoses how broken Longhorn is and applies the least invasive fix that works, escalating only if needed. Most logic runs run_once on pi1, delegating cluster reads to localhost.

Phase What it does
0 · Pre-flight Verifies the data dir /mnt/arcodange/longhorn exists on pi1 (fails hard if missing) and that at least one backup_*.volumes dump exists in the primary or fallback backup dir.
1 · Diagnosis Checks the longhorn-system namespace, the driver.longhorn.io CSIDriver registration, and the longhorn-manager pods, then sets recovery_phase = soft (CSI driver gone), hard (managers unhealthy), or none.
2 · Soft Touches longhorn-install.yaml to make k3s reconcile the HelmChart, waits, and checks pods recreate.
3 · Hard Force-deletes the longhorn-driver-deployer pods so the HelmChart recreates them.
4 · Nuclear Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the longhorn-install HelmChart manifest (v1.9.1, defaultDataPath preserved), wait for pods.
5 · Restore Waits for managers to be ready, then kubectl applys the latest backup_*.volumes dump (PV/PVC + Longhorn CRDs) and any longhorn_metadata_*.yaml.
6 · Verify Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary.

Important

Phase 5 is exactly where the 05 · Backup k3s_pvc dump pays off: re-applying the captured Volume CRDs lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is idempotent — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe.


longhorn_data.yml — block-device data recovery (CRDs gone)

This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in orphaned replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is driven by a vars filevars/recovery_volumes.yml, one entry per volume — and the format is documented in longhorn_data_vars.example.yml.

ansible-playbook -i inventory/hosts.yml \
  playbooks/recover/longhorn_data.yml \
  -e @vars/recovery_volumes.yml
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
  classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
  classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
  classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
  classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4;

  P0["Pre-flight + Phase 0:<br/>auto-discover largest replica dir (>16K)"]:::pre
  P1["Phase 1: back up untouched replica dir<br/>(safe copy before any op)"]:::merge
  P2["Phase 2: merge-longhorn-layers.py<br/>→ single .img · test-mount RO"]:::merge
  P3["Phase 3: create Volume CRD<br/>(scale down workload, clear stuck PVCs)"]:::k8s
  P5["Phase 5: attach via maintenance ticket<br/>→ /dev/longhorn/&lt;pv&gt;"]:::k8s
  P6["Phase 6: mkfs + rsync merged image<br/>into live block device"]:::merge
  P8["Phase 8: recreate PV (Retain) + PVC<br/>pinned by volumeName"]:::k8s
  P9["Phase 9: scale workload up · verify"]:::done

  P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9
  1. Pre-flight + Phase 0. Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then auto-discover the best replica source for each volume — the largest dir >16 MiB across pi1/pi2/pi3, skipping any replica still Rebuilding. source_node/source_dir in the vars file override this.
  2. Phase 1. cp -a the untouched replica dir to a backup location before touching anything, and verify it contains volume.meta.
  3. Phase 2. Run merge-longhorn-layers.py to collapse the snapshot + head .img layers into one image, then test-mount it read-only to confirm the filesystem is sound.
  4. Phase 3. Scale the workload to 0 and clear any stuck Terminating PV/PVCs before creating a fresh Longhorn Volume CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise).
  5. Phase 5. Attach the volume via a Longhorn VolumeAttachment maintenance ticket so /dev/longhorn/<pv> appears on the source node, with the frontend enabled.
  6. Phase 6. mkfs.ext4 the live block device if unformatted, then rsync the merged recovery image into it (--ignore-errors; rsync rc=23 partial-transfer is treated as success for power-cut partitions).
  7. Phase 8. Detach the recovery ticket, recreate the PV (Retain, no claimRef) and a PVC pinned by volumeName, and wait for Bound.
  8. Phase 9. Scale the workload back up, wait for ready replicas, and run the optional per-volume verify_cmd inside the pod.

Caution

The merge-longhorn-layers.py tool is invoked per replica dir via dmsetup to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the empty new replica as the rebuild source and overwrite your data. The block-device injection is the only proven-safe path. The full method comparison is in the Longhorn PVC recovery ADR.

Note

Tested 2026-04-13 power-cut. This block-device path was proven end to end recovering the url-shortener's SQLite database after that power cut forced a nuclear Longhorn reinstall (verified 2026-04-14 with sqlite3 … 'SELECT COUNT(*) FROM urls;'). That scenario is the worked example in longhorn_data_vars.example.yml.


Gotchas

Warning

  • Run longhorn.yml first if there is any chance the CRDs survived. It is fast and idempotent; falling straight to longhorn_data.yml is unnecessary block-level work when a kubectl apply would have sufficed.
  • longhorn_data.yml needs a healthy Longhorn control plane. Its pre-flight aborts unless ≥1 longhorn-manager is Running — it recovers data into a working Longhorn, it does not bring Longhorn back. Use longhorn.yml for that.
  • Process volumes one at a time first. The example vars file recommends validating a single volume before batching — a misidentified source_dir can pin the PVC to the wrong (empty) replica.
  • python3 on every node. Phase 0's replica scan and the merge tool both require python3 on pi1/pi2/pi3.
  • The merge tool path is repo-relative. longhorn_data.yml resolves merge-longhorn-layers.py from docs/incidents/2026-04-13-power-cut/tools/ and scps it to the source node — run the playbook from inside the collection so that path resolves.

Why this is rehearsed

A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are rehearsed deliberately in the production-like sandbox: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's QA strategy for how recovery drills become a regular exercise, and Storage & recovery for the full startup order these drills validate.


Where this branch sits

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
  classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
  classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;

  s05["05 · Backup<br/>(produces .volumes dump)"]:::done
  rec["recover/*<br/>longhorn.yml · longhorn_data.yml"]:::here
  s01["01 · System<br/>(rejoin pipeline)"]:::done

  s05 -. "on disaster" .-> rec
  rec -. "once recovered" .-> s01
  1. 05 · Backup produced the .volumes dump that longhorn.yml's restore phase replays.
  2. recover/ (this page) is invoked only on disaster — pick longhorn.yml (CRDs present) or longhorn_data.yml (CRDs gone).
  3. Once volumes are healthy, the cluster re-enters the normal pipeline at 01 · System, and you re-run a fresh 05 · Backup once everything is green.