docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
149
vibe/guidebooks/factory-provisioning/ansible/06-recover.md
Normal file
149
vibe/guidebooks/factory-provisioning/ansible/06-recover.md
Normal file
@@ -0,0 +1,149 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover**
|
||||
|
||||
# 06 · Recover — Longhorn disaster recovery
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** 🟡 beta · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
|
||||
> **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume
|
||||
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
|
||||
|
||||
The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?**
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Decision — pick the right playbook before you start:**
|
||||
> - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted.
|
||||
> - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume.
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
|
||||
classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4;
|
||||
classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff;
|
||||
|
||||
Q{"Do the Longhorn<br/>Volume CRDs<br/>still exist?"}:::q
|
||||
F["longhorn.yml<br/>CSI/CRD recovery (fast)"]:::fast
|
||||
S{"Raw replica<br/>.img files<br/>survive?"}:::q
|
||||
D["longhorn_data.yml<br/>block-device recovery (slow)"]:::slow
|
||||
X["Data unrecoverable<br/>(replicas zeroed)"]:::dead
|
||||
|
||||
Q -- "yes" --> F
|
||||
Q -- "no" --> S
|
||||
S -- "yes" --> D
|
||||
S -- "no" --> X
|
||||
```
|
||||
|
||||
1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast.
|
||||
2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk.
|
||||
3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume.
|
||||
4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this.
|
||||
|
||||
> [!NOTE]
|
||||
> This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo.
|
||||
|
||||
---
|
||||
|
||||
## `longhorn.yml` — CSI/CRD recovery (CRDs present)
|
||||
|
||||
Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`.
|
||||
|
||||
| Phase | What it does |
|
||||
| --- | --- |
|
||||
| **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. |
|
||||
| **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. |
|
||||
| **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. |
|
||||
| **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. |
|
||||
| **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. |
|
||||
| **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. |
|
||||
| **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe.
|
||||
|
||||
---
|
||||
|
||||
## `longhorn_data.yml` — block-device data recovery (CRDs gone)
|
||||
|
||||
This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file** — `vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
|
||||
|
||||
```sh
|
||||
ansible-playbook -i inventory/hosts.yml \
|
||||
playbooks/recover/longhorn_data.yml \
|
||||
-e @vars/recovery_volumes.yml
|
||||
```
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
|
||||
classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
|
||||
classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4;
|
||||
|
||||
P0["Pre-flight + Phase 0:<br/>auto-discover largest replica dir (>16K)"]:::pre
|
||||
P1["Phase 1: back up untouched replica dir<br/>(safe copy before any op)"]:::merge
|
||||
P2["Phase 2: merge-longhorn-layers.py<br/>→ single .img · test-mount RO"]:::merge
|
||||
P3["Phase 3: create Volume CRD<br/>(scale down workload, clear stuck PVCs)"]:::k8s
|
||||
P5["Phase 5: attach via maintenance ticket<br/>→ /dev/longhorn/<pv>"]:::k8s
|
||||
P6["Phase 6: mkfs + rsync merged image<br/>into live block device"]:::merge
|
||||
P8["Phase 8: recreate PV (Retain) + PVC<br/>pinned by volumeName"]:::k8s
|
||||
P9["Phase 9: scale workload up · verify"]:::done
|
||||
|
||||
P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9
|
||||
```
|
||||
|
||||
1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this.
|
||||
2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`.
|
||||
3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound.
|
||||
4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise).
|
||||
5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/<pv>` appears on the source node, with the frontend enabled.
|
||||
6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions).
|
||||
7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound.
|
||||
8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod.
|
||||
|
||||
> [!CAUTION]
|
||||
> The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
|
||||
|
||||
> [!NOTE]
|
||||
> **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed.
|
||||
> - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that.
|
||||
> - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica.
|
||||
> - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`.
|
||||
> - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves.
|
||||
|
||||
---
|
||||
|
||||
## Why this is rehearsed
|
||||
|
||||
A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate.
|
||||
|
||||
---
|
||||
|
||||
## Where this branch sits
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart LR
|
||||
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
s05["05 · Backup<br/>(produces .volumes dump)"]:::done
|
||||
rec["recover/*<br/>longhorn.yml · longhorn_data.yml"]:::here
|
||||
s01["01 · System<br/>(rejoin pipeline)"]:::done
|
||||
|
||||
s05 -. "on disaster" .-> rec
|
||||
rec -. "once recovered" .-> s01
|
||||
```
|
||||
|
||||
1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays.
|
||||
2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone).
|
||||
3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green.
|
||||
Reference in New Issue
Block a user