docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling

Captures the post-mortem of the April 13 power-cut: incident timeline, retrospective, and architecture/role diagrams. Adds an ADR explaining why Longhorn cannot re-associate orphaned replica directories after a nuclear reinstall (engine-id naming), plus block-device recovery runbooks and the `playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py` to rebuild PVCs from raw `volume-head-*.img` chains. Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs (needed for the fast-path restore) and rewrites the restore script with a fallback dir + English messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:18 +02:00
parent 934b62d922
commit 1ae28cb944
20 changed files with 5939 additions and 10 deletions
--- a/ansible/arcodange/factory/docs/runbooks/cluster-recovery-agent.md
+++ b/ansible/arcodange/factory/docs/runbooks/cluster-recovery-agent.md
@@ -0,0 +1,244 @@
+# Cluster Recovery Agent Instructions
+
+You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or
+Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and
+kubectl commands, and bring the cluster back to a fully healthy state.
+
+You do NOT need to modify any code. All recovery tooling already exists.
+
+---
+
+## Cluster Overview
+
+| Component | Details |
+|-----------|---------|
+| Nodes | pi1, pi2, pi3 (Raspberry Pi, SSH via `pi<N>.home`) |
+| k8s distribution | k3s |
+| Storage | Longhorn (`/mnt/arcodange/longhorn/`) |
+| GitOps | ArgoCD (apps auto-sync from `gitea.arcodange.lab/arcodange-org/`) |
+| Secrets | HashiCorp Vault (`tools` namespace, manual unseal) |
+| Ingress | Traefik + CrowdSec bouncer |
+| Working dir | `/Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/` |
+| Inventory | `inventory/hosts.yml` |
+
+**Critical dependency:** ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC.
+**Always recover and unseal Vault before scaling ERP up.**
+
+---
+
+## Step 0 — Assess Damage
+
+Run these first to understand what is broken:
+
+```bash
+# Overall pod health
+kubectl get pods -A | grep -v Running | grep -v Completed
+
+# PVC health (anything not Bound is a problem)
+kubectl get pvc -A | grep -v Bound
+
+# Longhorn volume states
+kubectl get volumes.longhorn.io -n longhorn-system
+
+# Longhorn manager health (prerequisite for all recovery)
+kubectl get pods -n longhorn-system -l app=longhorn-manager
+```
+
+---
+
+## Step 1 — Longhorn Volume Recovery
+
+### Path A — Fast path (backup file exists, Volume CRDs were backed up)
+
+Check if a recent backup exists on pi1:
+```bash
+ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"
+```
+
+If a backup file exists and is recent (from before the incident):
+```bash
+ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"
+```
+
+Then verify PVCs bound and skip to Step 2.
+
+### Path B — Block-device injection (no usable backup, raw replica files intact)
+
+Use this when PVCs are `Lost`/`Terminating` and no Volume CRD backup is available.
+
+**Check which volumes need recovery:**
+```bash
+# Volumes with no PVC or Lost/Terminating PVC
+kubectl get pvc -A | grep -v Bound
+```
+
+**For each failed volume, create a vars file** following the pattern in:
+`playbooks/recover/longhorn_data_vars.example.yml`
+
+Existing vars files from the 2026-04-13 incident (reusable as references):
+- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
+- `playbooks/recover/longhorn_data_vars_erp_vault.yml` — erp, hashicorp-vault (audit + data)
+- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse
+
+**Key rules for the vars file:**
+- `source_node`/`source_dir` can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
+- Set `workload_name: ""` for ERP — it must not scale up until Vault is unsealed
+- For StatefulSets with multiple PVCs (e.g. Vault), set `workload_name: ""` on all but the last entry
+
+**Run the recovery playbook:**
+```bash
+ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
+  -e @playbooks/recover/longhorn_data_vars_<NAME>.yml
+```
+
+The playbook is **idempotent** — safe to re-run if it fails midway.
+
+**Playbook phases (for context when troubleshooting):**
+| Phase | What it does |
+|-------|-------------|
+| 0 | Auto-discovers best replica dir (skips `Rebuilding: true`) |
+| 1 | Backs up untouched replica dir to `/home/pi/arcodange/backups/longhorn-recovery/` |
+| 2 | Merges snapshot+head layers into a single `.img` via `merge-longhorn-layers.py` |
+| 3 | **Scales down workloads first**, then clears stuck Terminating PVCs, creates Volume CRD |
+| 4 | Scale down (second pass, idempotent) |
+| 5 | Attaches volume via maintenance ticket to source node |
+| 6 | `mkfs.ext4` (if unformatted) + `rsync` from merged image into live block device |
+| 7 | Removes maintenance ticket (volume detaches) |
+| 8 | Creates PV (Retain, no claimRef) + PVC pinned to PV |
+| 9 | Scales up workloads, waits for readyReplicas ≥ 1 (failures here are `ignore_errors: yes`) |
+
+**Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned:**
+The playbook handles this automatically (scales down before finalizer removal). If you still hit it:
+```bash
+kubectl scale statefulset <name> -n <namespace> --replicas=0
+kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
+kubectl delete pvc <pvc-name> -n <namespace>
+# Then re-run the playbook
+```
+
+---
+
+## Step 2 — Unseal HashiCorp Vault
+
+After Vault's PVCs are recovered, the pod boots **sealed**. Check:
+```bash
+kubectl get pod hashicorp-vault-0 -n tools
+kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed
+```
+
+If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):
+```bash
+ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml
+```
+
+Unseal keys are at `~/.arcodange/cluster-keys.json` on the local machine. The playbook reads them automatically.
+
+After the playbook completes, verify:
+```bash
+kubectl get pod hashicorp-vault-0 -n tools   # must be 1/1 Ready
+kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed  # must be false
+```
+
+---
+
+## Step 3 — Scale Up ERP
+
+Only after Vault is unsealed and Ready:
+```bash
+kubectl scale deployment erp -n erp --replicas=1
+kubectl rollout status deployment/erp -n erp
+```
+
+---
+
+## Step 4 — Reconfigure Tools (CrowdSec, etc.)
+
+Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:
+```bash
+# Standard run (bouncer key + Traefik middleware + restart)
+ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml
+
+# Include captcha HTML injection (use when captcha page is broken)
+ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all
+```
+
+If crowdsec-agent or crowdsec-appsec pods are stuck in `Error` after a long outage,
+the playbook handles restarting them automatically.
+
+---
+
+## Step 5 — Re-enable ArgoCD selfHeal
+
+Check if `selfHeal` was disabled during recovery (look for `selfHeal: false` in the tools app):
+```bash
+grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml
+```
+
+If disabled, re-enable it by editing `argocd/values.yaml` and setting `selfHeal: true`,
+then syncing the ArgoCD app:
+```bash
+kubectl get app tools -n argocd
+```
+
+---
+
+## Step 6 — Final Verification
+
+```bash
+# All pods running
+kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"
+
+# All PVCs bound
+kubectl get pvc -A | grep -v Bound
+
+# All Longhorn volumes healthy
+kubectl get volumes.longhorn.io -n longhorn-system
+
+# Run a fresh backup to capture the recovered state
+ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
+  -e backup_root_dir=/mnt/backups
+```
+
+---
+
+## Key Files Reference
+
+| File | Purpose |
+|------|---------|
+| `playbooks/recover/longhorn_data.yml` | Main block-device recovery playbook |
+| `playbooks/recover/longhorn.yml` | Recovery when Volume CRDs still exist |
+| `playbooks/recover/longhorn_data_vars.example.yml` | Template for recovery vars |
+| `playbooks/recover/longhorn_data_vars_erp_vault.yml` | Vars for erp + vault (2026-04-13 incident) |
+| `playbooks/recover/longhorn_data_vars_remaining.yml` | Vars for other volumes (2026-04-13 incident) |
+| `playbooks/backup/backup.yml` | Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs) |
+| `playbooks/backup/k3s_pvc.yml` | PV/PVC/Longhorn Volume CRD backup |
+| `playbooks/tools/hashicorp_vault.yml` | Vault unseal + OIDC reconfiguration |
+| `playbooks/tools/crowdsec.yml` | CrowdSec bouncer + Traefik middleware setup |
+| `docs/adr/20260414-longhorn-pvc-recovery.md` | Full incident ADR with all recovery methods |
+| `~/.arcodange/cluster-keys.json` | Vault unseal keys (local machine only) |
+
+---
+
+## Decision Tree
+
+```
+Cluster down after outage
+│
+├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
+│
+└─ Yes
+   │
+   ├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
+   │
+   └─ No
+      │
+      ├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
+      │
+      └─ No
+         │
+         ├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
+         │
+         └─ No ──▶ Path B (longhorn_data.yml block-device injection)
+                   Check replica dirs exist first:
+                   ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"
+```
--- a/ansible/arcodange/factory/docs/runbooks/longhorn-block-device-recovery.md
+++ b/ansible/arcodange/factory/docs/runbooks/longhorn-block-device-recovery.md
@@ -0,0 +1,360 @@
+# Runbook: Longhorn Block-Device Data Recovery
+
+**When to use:** Longhorn has been fully reinstalled (nuclear cleanup). Volume CRDs are gone.
+Application PVCs are stuck `Terminating` or `Lost`. The raw replica `.img` files still exist
+on disk across the nodes. kubectl/k8s objects cannot help — we must work directly with the
+Longhorn replica directories and block devices.
+
+**Automated version:** `playbooks/recover/longhorn_data.yml`
+
+---
+
+## Mental Model
+
+Longhorn stores each replica as a chain of sparse raw image files inside a directory named
+`<pv-name>-<random-hex>` under `<longhorn_data_path>/replicas/`. Each directory contains:
+
+```
+volume.meta                          — engine state (Head filename, Parent snapshot, Dirty flag)
+volume-head-NNN.img                  — active write log (sparse, only changed blocks)
+volume-head-NNN.img.meta             — head metadata
+volume-snap-<uuid>.img               — snapshot at a point in time (sparse, full state)
+volume-snap-<uuid>.img.meta          — snapshot metadata
+revision.counter                     — monotonically increasing write counter
+```
+
+After a nuclear cleanup + reinstall, Longhorn creates **new empty replica directories** with
+new random hex suffixes. The old directories (with data) are left on disk but orphaned.
+
+**Why directory-swap fails:** the old `volume.meta` has a different engine generation and
+`Dirty: true`. Longhorn detects the inconsistency across replicas and rebuilds from the
+"cleanest" source (the new empty pi1 replica), overwriting the old data.
+
+**What works:** extract the filesystem from the untouched replica directory directly, then
+inject the data files into the live Longhorn block device while the volume is temporarily
+attached in maintenance mode.
+
+---
+
+## Decision Tree
+
+```
+Are Volume CRDs present in Longhorn?
+├── YES → normal PV/PVC restore is enough, use playbooks/recover/longhorn.yml
+└── NO
+    └── Are replica directories present on disk?
+        ├── NO → data is lost, provision fresh volumes
+        └── YES
+            └── Is there an untouched replica dir (timestamps from before the incident)?
+                ├── NO → data likely unrecoverable (all dirs were zeroed during reconciliation)
+                └── YES → follow this runbook
+```
+
+---
+
+## Step 0 — Pre-flight: Inventory Surviving Replica Directories
+
+On each node, list replica dirs and their sizes. Dirs with actual data are large (>16K).
+New empty dirs created by Longhorn are always exactly 16K.
+
+```bash
+for node in pi1 pi2 pi3; do
+  echo "=== $node ==="
+  ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
+done
+```
+
+**Key rule:** identify the replica dir that was **never touched** by the reinstall — it has
+old timestamps (from before the incident) and its size matches the original volume usage.
+This is your recovery source. **Back it up before touching anything.**
+
+```bash
+# On the node that has the untouched dir:
+sudo mkdir -p /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
+sudo cp -a /mnt/arcodange/longhorn/replicas/<pv-name>-<old-hex>/ \
+           /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
+```
+
+---
+
+## Step 1 — Reconstruct the Filesystem
+
+The replica directory contains a snapshot chain. Each layer is a sparse raw image — unchanged
+blocks appear as zeroed sparse regions, only written blocks contain data. To reconstruct the
+full filesystem, layers must be merged: head takes priority, then snapshot.
+
+Use `docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py`:
+
+```bash
+# On the node holding the backup:
+sudo python3 merge-longhorn-layers.py \
+  /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/<pv-name>-<old-hex>/ \
+  /tmp/<pvc-name>-merged.img
+
+# Verify the filesystem mounts
+sudo mkdir -p /mnt/recovery-<pvc-name>
+sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-<pvc-name>
+sudo ls -lah /mnt/recovery-<pvc-name>/
+sudo umount /mnt/recovery-<pvc-name>
+```
+
+If mount fails with "wrong fs type" or "bad superblock":
+- The snapshot `.img` is all-zero (was overwritten by a prior Longhorn reconciliation)
+- Try the next oldest replica dir from another node
+- Check with `sudo od -A x -t x1z -v snap.img | grep -v ' 00 00...' | head -5`
+
+---
+
+## Step 2 — Create the Longhorn Volume CRD
+
+Longhorn needs to know about the volume before its block device can be used.
+
+```bash
+kubectl apply -f - <<EOF
+apiVersion: longhorn.io/v1beta2
+kind: Volume
+metadata:
+  name: <pv-name>
+  namespace: longhorn-system
+spec:
+  accessMode: rwo          # or rwx
+  dataEngine: v1
+  frontend: blockdev
+  numberOfReplicas: 3
+  size: "<size-in-bytes>"  # e.g. "134217728" for 128Mi
+EOF
+```
+
+Wait for replicas to appear:
+```bash
+kubectl get replicas.longhorn.io -n longhorn-system | grep <pv-name>
+# Expect 3 replicas in "stopped" state
+```
+
+---
+
+## Step 3 — Attach the Volume in Maintenance Mode
+
+Longhorn only creates the block device (`/dev/longhorn/<pv-name>`) when the volume is
+attached to a node. Use a `VolumeAttachment` ticket to attach without a pod.
+
+Choose `<target-node>` = the same node where the backup/merged image is stored (avoids
+copying large files across the network).
+
+```bash
+kubectl apply -f - <<EOF
+apiVersion: longhorn.io/v1beta2
+kind: VolumeAttachment
+metadata:
+  name: <pv-name>
+  namespace: longhorn-system
+spec:
+  attachmentTickets:
+    recovery:
+      generation: 0
+      id: recovery
+      nodeID: <target-node>
+      parameters:
+        disableFrontend: "false"
+      type: longhorn-api
+  volume: <pv-name>
+EOF
+
+kubectl wait --for=jsonpath='{.status.state}'=attached \
+  volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=120s
+```
+
+---
+
+## Step 4 — Scale Down the Workload
+
+Always stop the workload before touching the data to prevent concurrent writes and filesystem
+corruption.
+
+```bash
+# For a Deployment:
+kubectl scale deployment <name> -n <namespace> --replicas=0
+
+# For a StatefulSet:
+kubectl scale statefulset <name> -n <namespace> --replicas=0
+```
+
+---
+
+## Step 5 — Inject Data Files via Block Device
+
+```bash
+ssh <target-node> bash <<'SHELL'
+  # Mount the live block device
+  sudo mkdir -p /mnt/recovery-live
+  sudo mount /dev/longhorn/<pv-name> /mnt/recovery-live
+
+  # Mount the reconstructed image (if not already mounted)
+  sudo mkdir -p /mnt/recovery-src
+  sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-src
+
+  # Sync: only the application data files, not lost+found
+  sudo rsync -av --exclude='lost+found' /mnt/recovery-src/ /mnt/recovery-live/
+
+  # Verify
+  sudo ls -lah /mnt/recovery-live/
+
+  # Unmount both
+  sudo umount /mnt/recovery-src
+  sudo umount /mnt/recovery-live
+SHELL
+```
+
+---
+
+## Step 6 — Detach the Volume
+
+```bash
+kubectl patch volumeattachments.longhorn.io <pv-name> \
+  -n longhorn-system --type json \
+  -p '[{"op":"remove","path":"/spec/attachmentTickets/recovery"}]'
+
+kubectl wait --for=jsonpath='{.status.state}'=detached \
+  volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=60s
+```
+
+---
+
+## Step 7 — Restore PV and PVC
+
+Clear stuck Terminating PV/PVC finalizers first if they exist:
+```bash
+kubectl patch pv <pv-name> --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null
+kubectl patch pvc <pvc-name> -n <namespace> --type=merge \
+  -p '{"metadata":{"finalizers":null}}' 2>/dev/null
+# Wait a moment for them to delete
+```
+
+Recreate the PV with `Retain` policy and no `claimRef`:
+```bash
+kubectl apply -f - <<EOF
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: <pv-name>
+  annotations:
+    pv.kubernetes.io/provisioned-by: driver.longhorn.io
+spec:
+  accessModes: [ReadWriteOnce]   # match original
+  capacity:
+    storage: <size>              # e.g. 128Mi
+  csi:
+    driver: driver.longhorn.io
+    fsType: ext4
+    volumeHandle: <pv-name>
+    volumeAttributes:
+      dataEngine: v1
+      dataLocality: disabled
+      disableRevisionCounter: "true"
+      numberOfReplicas: "3"
+      staleReplicaTimeout: "30"
+  persistentVolumeReclaimPolicy: Retain
+  storageClassName: longhorn
+  volumeMode: Filesystem
+EOF
+```
+
+Recreate the PVC pinned to this PV:
+```bash
+kubectl apply -f - <<EOF
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: <pvc-name>
+  namespace: <namespace>
+spec:
+  accessModes: [ReadWriteOnce]
+  resources:
+    requests:
+      storage: <size>
+  storageClassName: longhorn
+  volumeMode: Filesystem
+  volumeName: <pv-name>
+EOF
+```
+
+---
+
+## Step 8 — Scale Up and Verify
+
+```bash
+kubectl scale deployment <name> -n <namespace> --replicas=1
+kubectl wait --for=condition=Ready pod -l app=<name> -n <namespace> --timeout=120s
+```
+
+---
+
+## Pitfalls Learned During 2026-04-13 Recovery
+
+| Pitfall | What happened | Prevention |
+|---------|--------------|------------|
+| **Directory swap corrupts data** | Longhorn found old `Dirty: true` volume.meta + empty pi1 replica → rebuilt from empty source | Never swap dirs. Use merge tool + block device injection instead |
+| **Snapshot is zeroed after swap** | Longhorn reconciliation overwrote snapshot images when rebuilding from empty replica | Back up the untouched dir FIRST before any rename |
+| **Multiple dirs per volume on pi3** | Rebuild attempts during the incident created extra dirs | Identify the untouched dir by timestamp AND verify non-zero content with `od` |
+| **`Rebuilding: true` replica → all-zeros merged image** | Phase 0 picked a replica mid-rebuild (1.3 GiB actual data, sparse files look large) — merge tool produced an all-zeros image | Check `volume.meta` and skip any dir with `"Rebuilding": true` before merging |
+| **`du -sb` gives misleading apparent sizes** | Sparse replica files (8 GiB file, 1.3 GiB actual) appeared larger than healthy 11 GiB replicas | Use `du -sk` (actual disk blocks) not `du -sb` (apparent/logical size) to rank replicas |
+| **Dirty journal prevents ro mount** | `mount -o loop,ro` fails with "bad superblock" on an ext4 with unclean shutdown | Use `mount -o loop,ro,noload` to skip journal replay for read-only access |
+| **New volume is unformatted** | `mount /dev/longhorn/<pv>` fails with "wrong fs type" on a freshly created volume | Run `mkfs.ext4 -F` before mounting; guard with `blkid` to skip if already formatted |
+| **rsync rc=23 on power-cut partitions** | Some filesystem blocks were unreadable ("Structure needs cleaning") → rsync exits 23 | Use `rsync --ignore-errors`; rc=23 is a partial transfer, not a total failure |
+| **pod blocks volume re-attach** | Old Error-state pod held a volume attachment claim | Delete old Error pods before scaling up new ones |
+| **`kubectl cp` needs `tar`** | Distroless container had no `tar` binary | Mount block device directly on the node instead |
+| **VolumeAttachment ticket removal** | Deleting a VolumeAttachment object causes Longhorn to immediately recreate it | Patch the `recovery` key out of `spec.attachmentTickets` instead of deleting the object |
+| **Phase 7 wait for `detached` times out** | After removing the recovery ticket, a workload may immediately create its own ticket | Wait for the `recovery` ticket to disappear from `spec.attachmentTickets`, not for full detach |
+| **StatefulSet pods not found by label** | `kubectl get pod -l app=<name>` returns nothing for StatefulSet pods | Wait on `readyReplicas ≥ 1` on the StatefulSet object, not on pod labels |
+| **`set_fact` overridden by `-e @file`** | Ansible extra vars have highest precedence — `set_fact: longhorn_recovery_volumes` was silently ignored | Use a different variable name (`_volumes`) for the resolved list, never reassign the extra var name |
+
+---
+
+## Identifying the Right Replica Directory
+
+When multiple old dirs exist for the same volume on a node, pick the one to use for recovery:
+
+1. **Skip `Rebuilding: true`:** check `volume.meta` first — a dir that was being rebuilt when
+   the incident happened has incomplete data (sparse files are allocated but mostly zeroed):
+   ```bash
+   python3 -c "import json; d=json.load(open('volume.meta')); print('Rebuilding:', d['Rebuilding'])"
+   ```
+   Only consider dirs where `Rebuilding: false`.
+
+2. **Actual size:** `sudo du -sk <dir>` (actual disk usage in KB — not `du -sb` which returns
+   apparent/logical size and is misleading for sparse files). Pick the largest actual size.
+
+3. **Timestamps:** prefer the most recently modified before the incident date.
+
+4. **Snapshot chain:** if Rebuilding is false on multiple dirs, check `volume.meta` for
+   `"Dirty": false` (clean shutdown) vs `"Dirty": true`. Prefer clean if available.
+
+5. **Content check:** verify the snapshot is not all zeros:
+   ```bash
+   sudo od -A x -t x1z -v volume-snap-*.img | grep -v ' 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00' | head -3
+   ```
+   If the output is empty (all zeros), the snapshot was overwritten. Try another node.
+
+**Summary rule:** `Rebuilding: false` → largest `du -sk` → non-zero snapshot content.
+
+---
+
+## Reference: Key Commands
+
+```bash
+# List all replica dirs for a volume across all nodes
+for n in pi1 pi2 pi3; do echo "==$n=="; ssh $n "sudo ls /mnt/arcodange/longhorn/replicas/ | grep <pv-prefix>"; done
+
+# Check Longhorn volume state
+kubectl get volumes.longhorn.io -n longhorn-system <pv-name>
+
+# Check VolumeAttachment tickets
+kubectl get volumeattachments.longhorn.io -n longhorn-system <pv-name> \
+  -o jsonpath='{.spec.attachmentTickets}'
+
+# Check Longhorn block device existence on a node
+ssh <node> "ls /dev/longhorn/<pv-name>"
+
+# Verify filesystem content without starting the app
+ssh <node> "sudo mount /dev/longhorn/<pv-name> /mnt/check && sudo ls /mnt/check && sudo umount /mnt/check"
+```