docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling

Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 12:55:18 +02:00
parent 934b62d922
commit 1ae28cb944
20 changed files with 5939 additions and 10 deletions

View File

@@ -0,0 +1,244 @@
# Cluster Recovery Agent Instructions
You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or
Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and
kubectl commands, and bring the cluster back to a fully healthy state.
You do NOT need to modify any code. All recovery tooling already exists.
---
## Cluster Overview
| Component | Details |
|-----------|---------|
| Nodes | pi1, pi2, pi3 (Raspberry Pi, SSH via `pi<N>.home`) |
| k8s distribution | k3s |
| Storage | Longhorn (`/mnt/arcodange/longhorn/`) |
| GitOps | ArgoCD (apps auto-sync from `gitea.arcodange.lab/arcodange-org/`) |
| Secrets | HashiCorp Vault (`tools` namespace, manual unseal) |
| Ingress | Traefik + CrowdSec bouncer |
| Working dir | `/Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/` |
| Inventory | `inventory/hosts.yml` |
**Critical dependency:** ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC.
**Always recover and unseal Vault before scaling ERP up.**
---
## Step 0 — Assess Damage
Run these first to understand what is broken:
```bash
# Overall pod health
kubectl get pods -A | grep -v Running | grep -v Completed
# PVC health (anything not Bound is a problem)
kubectl get pvc -A | grep -v Bound
# Longhorn volume states
kubectl get volumes.longhorn.io -n longhorn-system
# Longhorn manager health (prerequisite for all recovery)
kubectl get pods -n longhorn-system -l app=longhorn-manager
```
---
## Step 1 — Longhorn Volume Recovery
### Path A — Fast path (backup file exists, Volume CRDs were backed up)
Check if a recent backup exists on pi1:
```bash
ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"
```
If a backup file exists and is recent (from before the incident):
```bash
ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"
```
Then verify PVCs bound and skip to Step 2.
### Path B — Block-device injection (no usable backup, raw replica files intact)
Use this when PVCs are `Lost`/`Terminating` and no Volume CRD backup is available.
**Check which volumes need recovery:**
```bash
# Volumes with no PVC or Lost/Terminating PVC
kubectl get pvc -A | grep -v Bound
```
**For each failed volume, create a vars file** following the pattern in:
`playbooks/recover/longhorn_data_vars.example.yml`
Existing vars files from the 2026-04-13 incident (reusable as references):
- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
- `playbooks/recover/longhorn_data_vars_erp_vault.yml` — erp, hashicorp-vault (audit + data)
- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse
**Key rules for the vars file:**
- `source_node`/`source_dir` can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
- Set `workload_name: ""` for ERP — it must not scale up until Vault is unsealed
- For StatefulSets with multiple PVCs (e.g. Vault), set `workload_name: ""` on all but the last entry
**Run the recovery playbook:**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars_<NAME>.yml
```
The playbook is **idempotent** — safe to re-run if it fails midway.
**Playbook phases (for context when troubleshooting):**
| Phase | What it does |
|-------|-------------|
| 0 | Auto-discovers best replica dir (skips `Rebuilding: true`) |
| 1 | Backs up untouched replica dir to `/home/pi/arcodange/backups/longhorn-recovery/` |
| 2 | Merges snapshot+head layers into a single `.img` via `merge-longhorn-layers.py` |
| 3 | **Scales down workloads first**, then clears stuck Terminating PVCs, creates Volume CRD |
| 4 | Scale down (second pass, idempotent) |
| 5 | Attaches volume via maintenance ticket to source node |
| 6 | `mkfs.ext4` (if unformatted) + `rsync` from merged image into live block device |
| 7 | Removes maintenance ticket (volume detaches) |
| 8 | Creates PV (Retain, no claimRef) + PVC pinned to PV |
| 9 | Scales up workloads, waits for readyReplicas ≥ 1 (failures here are `ignore_errors: yes`) |
**Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned:**
The playbook handles this automatically (scales down before finalizer removal). If you still hit it:
```bash
kubectl scale statefulset <name> -n <namespace> --replicas=0
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc <pvc-name> -n <namespace>
# Then re-run the playbook
```
---
## Step 2 — Unseal HashiCorp Vault
After Vault's PVCs are recovered, the pod boots **sealed**. Check:
```bash
kubectl get pod hashicorp-vault-0 -n tools
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed
```
If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):
```bash
ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml
```
Unseal keys are at `~/.arcodange/cluster-keys.json` on the local machine. The playbook reads them automatically.
After the playbook completes, verify:
```bash
kubectl get pod hashicorp-vault-0 -n tools # must be 1/1 Ready
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed # must be false
```
---
## Step 3 — Scale Up ERP
Only after Vault is unsealed and Ready:
```bash
kubectl scale deployment erp -n erp --replicas=1
kubectl rollout status deployment/erp -n erp
```
---
## Step 4 — Reconfigure Tools (CrowdSec, etc.)
Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:
```bash
# Standard run (bouncer key + Traefik middleware + restart)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml
# Include captcha HTML injection (use when captcha page is broken)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all
```
If crowdsec-agent or crowdsec-appsec pods are stuck in `Error` after a long outage,
the playbook handles restarting them automatically.
---
## Step 5 — Re-enable ArgoCD selfHeal
Check if `selfHeal` was disabled during recovery (look for `selfHeal: false` in the tools app):
```bash
grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml
```
If disabled, re-enable it by editing `argocd/values.yaml` and setting `selfHeal: true`,
then syncing the ArgoCD app:
```bash
kubectl get app tools -n argocd
```
---
## Step 6 — Final Verification
```bash
# All pods running
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"
# All PVCs bound
kubectl get pvc -A | grep -v Bound
# All Longhorn volumes healthy
kubectl get volumes.longhorn.io -n longhorn-system
# Run a fresh backup to capture the recovered state
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
-e backup_root_dir=/mnt/backups
```
---
## Key Files Reference
| File | Purpose |
|------|---------|
| `playbooks/recover/longhorn_data.yml` | Main block-device recovery playbook |
| `playbooks/recover/longhorn.yml` | Recovery when Volume CRDs still exist |
| `playbooks/recover/longhorn_data_vars.example.yml` | Template for recovery vars |
| `playbooks/recover/longhorn_data_vars_erp_vault.yml` | Vars for erp + vault (2026-04-13 incident) |
| `playbooks/recover/longhorn_data_vars_remaining.yml` | Vars for other volumes (2026-04-13 incident) |
| `playbooks/backup/backup.yml` | Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs) |
| `playbooks/backup/k3s_pvc.yml` | PV/PVC/Longhorn Volume CRD backup |
| `playbooks/tools/hashicorp_vault.yml` | Vault unseal + OIDC reconfiguration |
| `playbooks/tools/crowdsec.yml` | CrowdSec bouncer + Traefik middleware setup |
| `docs/adr/20260414-longhorn-pvc-recovery.md` | Full incident ADR with all recovery methods |
| `~/.arcodange/cluster-keys.json` | Vault unseal keys (local machine only) |
---
## Decision Tree
```
Cluster down after outage
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
└─ Yes
├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
└─ No
├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
└─ No
├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
└─ No ──▶ Path B (longhorn_data.yml block-device injection)
Check replica dirs exist first:
ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"
```

View File

@@ -0,0 +1,360 @@
# Runbook: Longhorn Block-Device Data Recovery
**When to use:** Longhorn has been fully reinstalled (nuclear cleanup). Volume CRDs are gone.
Application PVCs are stuck `Terminating` or `Lost`. The raw replica `.img` files still exist
on disk across the nodes. kubectl/k8s objects cannot help — we must work directly with the
Longhorn replica directories and block devices.
**Automated version:** `playbooks/recover/longhorn_data.yml`
---
## Mental Model
Longhorn stores each replica as a chain of sparse raw image files inside a directory named
`<pv-name>-<random-hex>` under `<longhorn_data_path>/replicas/`. Each directory contains:
```
volume.meta — engine state (Head filename, Parent snapshot, Dirty flag)
volume-head-NNN.img — active write log (sparse, only changed blocks)
volume-head-NNN.img.meta — head metadata
volume-snap-<uuid>.img — snapshot at a point in time (sparse, full state)
volume-snap-<uuid>.img.meta — snapshot metadata
revision.counter — monotonically increasing write counter
```
After a nuclear cleanup + reinstall, Longhorn creates **new empty replica directories** with
new random hex suffixes. The old directories (with data) are left on disk but orphaned.
**Why directory-swap fails:** the old `volume.meta` has a different engine generation and
`Dirty: true`. Longhorn detects the inconsistency across replicas and rebuilds from the
"cleanest" source (the new empty pi1 replica), overwriting the old data.
**What works:** extract the filesystem from the untouched replica directory directly, then
inject the data files into the live Longhorn block device while the volume is temporarily
attached in maintenance mode.
---
## Decision Tree
```
Are Volume CRDs present in Longhorn?
├── YES → normal PV/PVC restore is enough, use playbooks/recover/longhorn.yml
└── NO
└── Are replica directories present on disk?
├── NO → data is lost, provision fresh volumes
└── YES
└── Is there an untouched replica dir (timestamps from before the incident)?
├── NO → data likely unrecoverable (all dirs were zeroed during reconciliation)
└── YES → follow this runbook
```
---
## Step 0 — Pre-flight: Inventory Surviving Replica Directories
On each node, list replica dirs and their sizes. Dirs with actual data are large (>16K).
New empty dirs created by Longhorn are always exactly 16K.
```bash
for node in pi1 pi2 pi3; do
echo "=== $node ==="
ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
done
```
**Key rule:** identify the replica dir that was **never touched** by the reinstall — it has
old timestamps (from before the incident) and its size matches the original volume usage.
This is your recovery source. **Back it up before touching anything.**
```bash
# On the node that has the untouched dir:
sudo mkdir -p /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
sudo cp -a /mnt/arcodange/longhorn/replicas/<pv-name>-<old-hex>/ \
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
```
---
## Step 1 — Reconstruct the Filesystem
The replica directory contains a snapshot chain. Each layer is a sparse raw image — unchanged
blocks appear as zeroed sparse regions, only written blocks contain data. To reconstruct the
full filesystem, layers must be merged: head takes priority, then snapshot.
Use `docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py`:
```bash
# On the node holding the backup:
sudo python3 merge-longhorn-layers.py \
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/<pv-name>-<old-hex>/ \
/tmp/<pvc-name>-merged.img
# Verify the filesystem mounts
sudo mkdir -p /mnt/recovery-<pvc-name>
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-<pvc-name>
sudo ls -lah /mnt/recovery-<pvc-name>/
sudo umount /mnt/recovery-<pvc-name>
```
If mount fails with "wrong fs type" or "bad superblock":
- The snapshot `.img` is all-zero (was overwritten by a prior Longhorn reconciliation)
- Try the next oldest replica dir from another node
- Check with `sudo od -A x -t x1z -v snap.img | grep -v ' 00 00...' | head -5`
---
## Step 2 — Create the Longhorn Volume CRD
Longhorn needs to know about the volume before its block device can be used.
```bash
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: <pv-name>
namespace: longhorn-system
spec:
accessMode: rwo # or rwx
dataEngine: v1
frontend: blockdev
numberOfReplicas: 3
size: "<size-in-bytes>" # e.g. "134217728" for 128Mi
EOF
```
Wait for replicas to appear:
```bash
kubectl get replicas.longhorn.io -n longhorn-system | grep <pv-name>
# Expect 3 replicas in "stopped" state
```
---
## Step 3 — Attach the Volume in Maintenance Mode
Longhorn only creates the block device (`/dev/longhorn/<pv-name>`) when the volume is
attached to a node. Use a `VolumeAttachment` ticket to attach without a pod.
Choose `<target-node>` = the same node where the backup/merged image is stored (avoids
copying large files across the network).
```bash
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
name: <pv-name>
namespace: longhorn-system
spec:
attachmentTickets:
recovery:
generation: 0
id: recovery
nodeID: <target-node>
parameters:
disableFrontend: "false"
type: longhorn-api
volume: <pv-name>
EOF
kubectl wait --for=jsonpath='{.status.state}'=attached \
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=120s
```
---
## Step 4 — Scale Down the Workload
Always stop the workload before touching the data to prevent concurrent writes and filesystem
corruption.
```bash
# For a Deployment:
kubectl scale deployment <name> -n <namespace> --replicas=0
# For a StatefulSet:
kubectl scale statefulset <name> -n <namespace> --replicas=0
```
---
## Step 5 — Inject Data Files via Block Device
```bash
ssh <target-node> bash <<'SHELL'
# Mount the live block device
sudo mkdir -p /mnt/recovery-live
sudo mount /dev/longhorn/<pv-name> /mnt/recovery-live
# Mount the reconstructed image (if not already mounted)
sudo mkdir -p /mnt/recovery-src
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-src
# Sync: only the application data files, not lost+found
sudo rsync -av --exclude='lost+found' /mnt/recovery-src/ /mnt/recovery-live/
# Verify
sudo ls -lah /mnt/recovery-live/
# Unmount both
sudo umount /mnt/recovery-src
sudo umount /mnt/recovery-live
SHELL
```
---
## Step 6 — Detach the Volume
```bash
kubectl patch volumeattachments.longhorn.io <pv-name> \
-n longhorn-system --type json \
-p '[{"op":"remove","path":"/spec/attachmentTickets/recovery"}]'
kubectl wait --for=jsonpath='{.status.state}'=detached \
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=60s
```
---
## Step 7 — Restore PV and PVC
Clear stuck Terminating PV/PVC finalizers first if they exist:
```bash
kubectl patch pv <pv-name> --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null
kubectl patch pvc <pvc-name> -n <namespace> --type=merge \
-p '{"metadata":{"finalizers":null}}' 2>/dev/null
# Wait a moment for them to delete
```
Recreate the PV with `Retain` policy and no `claimRef`:
```bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
name: <pv-name>
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
accessModes: [ReadWriteOnce] # match original
capacity:
storage: <size> # e.g. 128Mi
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: <pv-name>
volumeAttributes:
dataEngine: v1
dataLocality: disabled
disableRevisionCounter: "true"
numberOfReplicas: "3"
staleReplicaTimeout: "30"
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn
volumeMode: Filesystem
EOF
```
Recreate the PVC pinned to this PV:
```bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: <pvc-name>
namespace: <namespace>
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: <size>
storageClassName: longhorn
volumeMode: Filesystem
volumeName: <pv-name>
EOF
```
---
## Step 8 — Scale Up and Verify
```bash
kubectl scale deployment <name> -n <namespace> --replicas=1
kubectl wait --for=condition=Ready pod -l app=<name> -n <namespace> --timeout=120s
```
---
## Pitfalls Learned During 2026-04-13 Recovery
| Pitfall | What happened | Prevention |
|---------|--------------|------------|
| **Directory swap corrupts data** | Longhorn found old `Dirty: true` volume.meta + empty pi1 replica → rebuilt from empty source | Never swap dirs. Use merge tool + block device injection instead |
| **Snapshot is zeroed after swap** | Longhorn reconciliation overwrote snapshot images when rebuilding from empty replica | Back up the untouched dir FIRST before any rename |
| **Multiple dirs per volume on pi3** | Rebuild attempts during the incident created extra dirs | Identify the untouched dir by timestamp AND verify non-zero content with `od` |
| **`Rebuilding: true` replica → all-zeros merged image** | Phase 0 picked a replica mid-rebuild (1.3 GiB actual data, sparse files look large) — merge tool produced an all-zeros image | Check `volume.meta` and skip any dir with `"Rebuilding": true` before merging |
| **`du -sb` gives misleading apparent sizes** | Sparse replica files (8 GiB file, 1.3 GiB actual) appeared larger than healthy 11 GiB replicas | Use `du -sk` (actual disk blocks) not `du -sb` (apparent/logical size) to rank replicas |
| **Dirty journal prevents ro mount** | `mount -o loop,ro` fails with "bad superblock" on an ext4 with unclean shutdown | Use `mount -o loop,ro,noload` to skip journal replay for read-only access |
| **New volume is unformatted** | `mount /dev/longhorn/<pv>` fails with "wrong fs type" on a freshly created volume | Run `mkfs.ext4 -F` before mounting; guard with `blkid` to skip if already formatted |
| **rsync rc=23 on power-cut partitions** | Some filesystem blocks were unreadable ("Structure needs cleaning") → rsync exits 23 | Use `rsync --ignore-errors`; rc=23 is a partial transfer, not a total failure |
| **pod blocks volume re-attach** | Old Error-state pod held a volume attachment claim | Delete old Error pods before scaling up new ones |
| **`kubectl cp` needs `tar`** | Distroless container had no `tar` binary | Mount block device directly on the node instead |
| **VolumeAttachment ticket removal** | Deleting a VolumeAttachment object causes Longhorn to immediately recreate it | Patch the `recovery` key out of `spec.attachmentTickets` instead of deleting the object |
| **Phase 7 wait for `detached` times out** | After removing the recovery ticket, a workload may immediately create its own ticket | Wait for the `recovery` ticket to disappear from `spec.attachmentTickets`, not for full detach |
| **StatefulSet pods not found by label** | `kubectl get pod -l app=<name>` returns nothing for StatefulSet pods | Wait on `readyReplicas ≥ 1` on the StatefulSet object, not on pod labels |
| **`set_fact` overridden by `-e @file`** | Ansible extra vars have highest precedence — `set_fact: longhorn_recovery_volumes` was silently ignored | Use a different variable name (`_volumes`) for the resolved list, never reassign the extra var name |
---
## Identifying the Right Replica Directory
When multiple old dirs exist for the same volume on a node, pick the one to use for recovery:
1. **Skip `Rebuilding: true`:** check `volume.meta` first — a dir that was being rebuilt when
the incident happened has incomplete data (sparse files are allocated but mostly zeroed):
```bash
python3 -c "import json; d=json.load(open('volume.meta')); print('Rebuilding:', d['Rebuilding'])"
```
Only consider dirs where `Rebuilding: false`.
2. **Actual size:** `sudo du -sk <dir>` (actual disk usage in KB — not `du -sb` which returns
apparent/logical size and is misleading for sparse files). Pick the largest actual size.
3. **Timestamps:** prefer the most recently modified before the incident date.
4. **Snapshot chain:** if Rebuilding is false on multiple dirs, check `volume.meta` for
`"Dirty": false` (clean shutdown) vs `"Dirty": true`. Prefer clean if available.
5. **Content check:** verify the snapshot is not all zeros:
```bash
sudo od -A x -t x1z -v volume-snap-*.img | grep -v ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' | head -3
```
If the output is empty (all zeros), the snapshot was overwritten. Try another node.
**Summary rule:** `Rebuilding: false` → largest `du -sk` → non-zero snapshot content.
---
## Reference: Key Commands
```bash
# List all replica dirs for a volume across all nodes
for n in pi1 pi2 pi3; do echo "==$n=="; ssh $n "sudo ls /mnt/arcodange/longhorn/replicas/ | grep <pv-prefix>"; done
# Check Longhorn volume state
kubectl get volumes.longhorn.io -n longhorn-system <pv-name>
# Check VolumeAttachment tickets
kubectl get volumeattachments.longhorn.io -n longhorn-system <pv-name> \
-o jsonpath='{.spec.attachmentTickets}'
# Check Longhorn block device existence on a node
ssh <node> "ls /dev/longhorn/<pv-name>"
# Verify filesystem content without starting the app
ssh <node> "sudo mount /dev/longhorn/<pv-name> /mnt/check && sudo ls /mnt/check && sudo umount /mnt/check"
```