docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)

Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00
parent b886f06824
commit dbe32161dc
16 changed files with 1571 additions and 0 deletions
--- a/vibe/guidebooks/factory-provisioning/ansible/06-recover.md
+++ b/vibe/guidebooks/factory-provisioning/ansible/06-recover.md
@@ -0,0 +1,149 @@
+[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover**
+
+# 06 · Recover — Longhorn disaster recovery
+
+> [!NOTE]
+> **Status:** 🟡 beta · **Last Updated:** 2026-06-23
+> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
+> **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume
+> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
+
+The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?**
+
+> [!IMPORTANT]
+> **Decision — pick the right playbook before you start:**
+> - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted.
+> - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume.
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart TD
+  classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
+  classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4;
+  classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
+  classDef dead fill:#6b7280,stroke:#4b5563,color:#fff;
+
+  Q{"Do the Longhorn<br/>Volume CRDs<br/>still exist?"}:::q
+  F["longhorn.yml<br/>CSI/CRD recovery (fast)"]:::fast
+  S{"Raw replica<br/>.img files<br/>survive?"}:::q
+  D["longhorn_data.yml<br/>block-device recovery (slow)"]:::slow
+  X["Data unrecoverable<br/>(replicas zeroed)"]:::dead
+
+  Q -- "yes" --> F
+  Q -- "no" --> S
+  S -- "yes" --> D
+  S -- "no" --> X
+```
+
+1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast.
+2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk.
+3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume.
+4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this.
+
+> [!NOTE]
+> This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo.
+
+---
+
+## `longhorn.yml` — CSI/CRD recovery (CRDs present)
+
+Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`.
+
+| Phase | What it does |
+| --- | --- |
+| **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. |
+| **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. |
+| **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. |
+| **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. |
+| **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. |
+| **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. |
+| **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. |
+
+> [!IMPORTANT]
+> Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe.
+
+---
+
+## `longhorn_data.yml` — block-device data recovery (CRDs gone)
+
+This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file** — `vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
+
+```sh
+ansible-playbook -i inventory/hosts.yml \
+  playbooks/recover/longhorn_data.yml \
+  -e @vars/recovery_volumes.yml
+```
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart TD
+  classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
+  classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
+  classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4;
+
+  P0["Pre-flight + Phase 0:<br/>auto-discover largest replica dir (>16K)"]:::pre
+  P1["Phase 1: back up untouched replica dir<br/>(safe copy before any op)"]:::merge
+  P2["Phase 2: merge-longhorn-layers.py<br/>→ single .img · test-mount RO"]:::merge
+  P3["Phase 3: create Volume CRD<br/>(scale down workload, clear stuck PVCs)"]:::k8s
+  P5["Phase 5: attach via maintenance ticket<br/>→ /dev/longhorn/&lt;pv&gt;"]:::k8s
+  P6["Phase 6: mkfs + rsync merged image<br/>into live block device"]:::merge
+  P8["Phase 8: recreate PV (Retain) + PVC<br/>pinned by volumeName"]:::k8s
+  P9["Phase 9: scale workload up · verify"]:::done
+
+  P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9
+```
+
+1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this.
+2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`.
+3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound.
+4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise).
+5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/<pv>` appears on the source node, with the frontend enabled.
+6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions).
+7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound.
+8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod.
+
+> [!CAUTION]
+> The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
+
+> [!NOTE]
+> **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
+
+---
+
+## Gotchas
+
+> [!WARNING]
+> - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed.
+> - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that.
+> - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica.
+> - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`.
+> - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves.
+
+---
+
+## Why this is rehearsed
+
+A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate.
+
+---
+
+## Where this branch sits
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart LR
+  classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
+
+  s05["05 · Backup<br/>(produces .volumes dump)"]:::done
+  rec["recover/*<br/>longhorn.yml · longhorn_data.yml"]:::here
+  s01["01 · System<br/>(rejoin pipeline)"]:::done
+
+  s05 -. "on disaster" .-> rec
+  rec -. "once recovered" .-> s01
+```
+
+1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays.
+2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone).
+3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green.