docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling
Captures the post-mortem of the April 13 power-cut: incident timeline, retrospective, and architecture/role diagrams. Adds an ADR explaining why Longhorn cannot re-associate orphaned replica directories after a nuclear reinstall (engine-id naming), plus block-device recovery runbooks and the `playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py` to rebuild PVCs from raw `volume-head-*.img` chains. Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs (needed for the fast-path restore) and rewrites the restore script with a fallback dir + English messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,244 @@
|
||||
# Cluster Recovery Agent Instructions
|
||||
|
||||
You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or
|
||||
Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and
|
||||
kubectl commands, and bring the cluster back to a fully healthy state.
|
||||
|
||||
You do NOT need to modify any code. All recovery tooling already exists.
|
||||
|
||||
---
|
||||
|
||||
## Cluster Overview
|
||||
|
||||
| Component | Details |
|
||||
|-----------|---------|
|
||||
| Nodes | pi1, pi2, pi3 (Raspberry Pi, SSH via `pi<N>.home`) |
|
||||
| k8s distribution | k3s |
|
||||
| Storage | Longhorn (`/mnt/arcodange/longhorn/`) |
|
||||
| GitOps | ArgoCD (apps auto-sync from `gitea.arcodange.lab/arcodange-org/`) |
|
||||
| Secrets | HashiCorp Vault (`tools` namespace, manual unseal) |
|
||||
| Ingress | Traefik + CrowdSec bouncer |
|
||||
| Working dir | `/Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/` |
|
||||
| Inventory | `inventory/hosts.yml` |
|
||||
|
||||
**Critical dependency:** ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC.
|
||||
**Always recover and unseal Vault before scaling ERP up.**
|
||||
|
||||
---
|
||||
|
||||
## Step 0 — Assess Damage
|
||||
|
||||
Run these first to understand what is broken:
|
||||
|
||||
```bash
|
||||
# Overall pod health
|
||||
kubectl get pods -A | grep -v Running | grep -v Completed
|
||||
|
||||
# PVC health (anything not Bound is a problem)
|
||||
kubectl get pvc -A | grep -v Bound
|
||||
|
||||
# Longhorn volume states
|
||||
kubectl get volumes.longhorn.io -n longhorn-system
|
||||
|
||||
# Longhorn manager health (prerequisite for all recovery)
|
||||
kubectl get pods -n longhorn-system -l app=longhorn-manager
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 1 — Longhorn Volume Recovery
|
||||
|
||||
### Path A — Fast path (backup file exists, Volume CRDs were backed up)
|
||||
|
||||
Check if a recent backup exists on pi1:
|
||||
```bash
|
||||
ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"
|
||||
```
|
||||
|
||||
If a backup file exists and is recent (from before the incident):
|
||||
```bash
|
||||
ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"
|
||||
```
|
||||
|
||||
Then verify PVCs bound and skip to Step 2.
|
||||
|
||||
### Path B — Block-device injection (no usable backup, raw replica files intact)
|
||||
|
||||
Use this when PVCs are `Lost`/`Terminating` and no Volume CRD backup is available.
|
||||
|
||||
**Check which volumes need recovery:**
|
||||
```bash
|
||||
# Volumes with no PVC or Lost/Terminating PVC
|
||||
kubectl get pvc -A | grep -v Bound
|
||||
```
|
||||
|
||||
**For each failed volume, create a vars file** following the pattern in:
|
||||
`playbooks/recover/longhorn_data_vars.example.yml`
|
||||
|
||||
Existing vars files from the 2026-04-13 incident (reusable as references):
|
||||
- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
|
||||
- `playbooks/recover/longhorn_data_vars_erp_vault.yml` — erp, hashicorp-vault (audit + data)
|
||||
- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse
|
||||
|
||||
**Key rules for the vars file:**
|
||||
- `source_node`/`source_dir` can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
|
||||
- Set `workload_name: ""` for ERP — it must not scale up until Vault is unsealed
|
||||
- For StatefulSets with multiple PVCs (e.g. Vault), set `workload_name: ""` on all but the last entry
|
||||
|
||||
**Run the recovery playbook:**
|
||||
```bash
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
|
||||
-e @playbooks/recover/longhorn_data_vars_<NAME>.yml
|
||||
```
|
||||
|
||||
The playbook is **idempotent** — safe to re-run if it fails midway.
|
||||
|
||||
**Playbook phases (for context when troubleshooting):**
|
||||
| Phase | What it does |
|
||||
|-------|-------------|
|
||||
| 0 | Auto-discovers best replica dir (skips `Rebuilding: true`) |
|
||||
| 1 | Backs up untouched replica dir to `/home/pi/arcodange/backups/longhorn-recovery/` |
|
||||
| 2 | Merges snapshot+head layers into a single `.img` via `merge-longhorn-layers.py` |
|
||||
| 3 | **Scales down workloads first**, then clears stuck Terminating PVCs, creates Volume CRD |
|
||||
| 4 | Scale down (second pass, idempotent) |
|
||||
| 5 | Attaches volume via maintenance ticket to source node |
|
||||
| 6 | `mkfs.ext4` (if unformatted) + `rsync` from merged image into live block device |
|
||||
| 7 | Removes maintenance ticket (volume detaches) |
|
||||
| 8 | Creates PV (Retain, no claimRef) + PVC pinned to PV |
|
||||
| 9 | Scales up workloads, waits for readyReplicas ≥ 1 (failures here are `ignore_errors: yes`) |
|
||||
|
||||
**Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned:**
|
||||
The playbook handles this automatically (scales down before finalizer removal). If you still hit it:
|
||||
```bash
|
||||
kubectl scale statefulset <name> -n <namespace> --replicas=0
|
||||
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
|
||||
kubectl delete pvc <pvc-name> -n <namespace>
|
||||
# Then re-run the playbook
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2 — Unseal HashiCorp Vault
|
||||
|
||||
After Vault's PVCs are recovered, the pod boots **sealed**. Check:
|
||||
```bash
|
||||
kubectl get pod hashicorp-vault-0 -n tools
|
||||
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed
|
||||
```
|
||||
|
||||
If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):
|
||||
```bash
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml
|
||||
```
|
||||
|
||||
Unseal keys are at `~/.arcodange/cluster-keys.json` on the local machine. The playbook reads them automatically.
|
||||
|
||||
After the playbook completes, verify:
|
||||
```bash
|
||||
kubectl get pod hashicorp-vault-0 -n tools # must be 1/1 Ready
|
||||
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed # must be false
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3 — Scale Up ERP
|
||||
|
||||
Only after Vault is unsealed and Ready:
|
||||
```bash
|
||||
kubectl scale deployment erp -n erp --replicas=1
|
||||
kubectl rollout status deployment/erp -n erp
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4 — Reconfigure Tools (CrowdSec, etc.)
|
||||
|
||||
Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:
|
||||
```bash
|
||||
# Standard run (bouncer key + Traefik middleware + restart)
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml
|
||||
|
||||
# Include captcha HTML injection (use when captcha page is broken)
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all
|
||||
```
|
||||
|
||||
If crowdsec-agent or crowdsec-appsec pods are stuck in `Error` after a long outage,
|
||||
the playbook handles restarting them automatically.
|
||||
|
||||
---
|
||||
|
||||
## Step 5 — Re-enable ArgoCD selfHeal
|
||||
|
||||
Check if `selfHeal` was disabled during recovery (look for `selfHeal: false` in the tools app):
|
||||
```bash
|
||||
grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml
|
||||
```
|
||||
|
||||
If disabled, re-enable it by editing `argocd/values.yaml` and setting `selfHeal: true`,
|
||||
then syncing the ArgoCD app:
|
||||
```bash
|
||||
kubectl get app tools -n argocd
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 6 — Final Verification
|
||||
|
||||
```bash
|
||||
# All pods running
|
||||
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"
|
||||
|
||||
# All PVCs bound
|
||||
kubectl get pvc -A | grep -v Bound
|
||||
|
||||
# All Longhorn volumes healthy
|
||||
kubectl get volumes.longhorn.io -n longhorn-system
|
||||
|
||||
# Run a fresh backup to capture the recovered state
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
|
||||
-e backup_root_dir=/mnt/backups
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Files Reference
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `playbooks/recover/longhorn_data.yml` | Main block-device recovery playbook |
|
||||
| `playbooks/recover/longhorn.yml` | Recovery when Volume CRDs still exist |
|
||||
| `playbooks/recover/longhorn_data_vars.example.yml` | Template for recovery vars |
|
||||
| `playbooks/recover/longhorn_data_vars_erp_vault.yml` | Vars for erp + vault (2026-04-13 incident) |
|
||||
| `playbooks/recover/longhorn_data_vars_remaining.yml` | Vars for other volumes (2026-04-13 incident) |
|
||||
| `playbooks/backup/backup.yml` | Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs) |
|
||||
| `playbooks/backup/k3s_pvc.yml` | PV/PVC/Longhorn Volume CRD backup |
|
||||
| `playbooks/tools/hashicorp_vault.yml` | Vault unseal + OIDC reconfiguration |
|
||||
| `playbooks/tools/crowdsec.yml` | CrowdSec bouncer + Traefik middleware setup |
|
||||
| `docs/adr/20260414-longhorn-pvc-recovery.md` | Full incident ADR with all recovery methods |
|
||||
| `~/.arcodange/cluster-keys.json` | Vault unseal keys (local machine only) |
|
||||
|
||||
---
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
Cluster down after outage
|
||||
│
|
||||
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
|
||||
│
|
||||
└─ Yes
|
||||
│
|
||||
├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
|
||||
│
|
||||
└─ No
|
||||
│
|
||||
├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
|
||||
│
|
||||
└─ No
|
||||
│
|
||||
├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
|
||||
│
|
||||
└─ No ──▶ Path B (longhorn_data.yml block-device injection)
|
||||
Check replica dirs exist first:
|
||||
ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"
|
||||
```
|
||||
@@ -0,0 +1,360 @@
|
||||
# Runbook: Longhorn Block-Device Data Recovery
|
||||
|
||||
**When to use:** Longhorn has been fully reinstalled (nuclear cleanup). Volume CRDs are gone.
|
||||
Application PVCs are stuck `Terminating` or `Lost`. The raw replica `.img` files still exist
|
||||
on disk across the nodes. kubectl/k8s objects cannot help — we must work directly with the
|
||||
Longhorn replica directories and block devices.
|
||||
|
||||
**Automated version:** `playbooks/recover/longhorn_data.yml`
|
||||
|
||||
---
|
||||
|
||||
## Mental Model
|
||||
|
||||
Longhorn stores each replica as a chain of sparse raw image files inside a directory named
|
||||
`<pv-name>-<random-hex>` under `<longhorn_data_path>/replicas/`. Each directory contains:
|
||||
|
||||
```
|
||||
volume.meta — engine state (Head filename, Parent snapshot, Dirty flag)
|
||||
volume-head-NNN.img — active write log (sparse, only changed blocks)
|
||||
volume-head-NNN.img.meta — head metadata
|
||||
volume-snap-<uuid>.img — snapshot at a point in time (sparse, full state)
|
||||
volume-snap-<uuid>.img.meta — snapshot metadata
|
||||
revision.counter — monotonically increasing write counter
|
||||
```
|
||||
|
||||
After a nuclear cleanup + reinstall, Longhorn creates **new empty replica directories** with
|
||||
new random hex suffixes. The old directories (with data) are left on disk but orphaned.
|
||||
|
||||
**Why directory-swap fails:** the old `volume.meta` has a different engine generation and
|
||||
`Dirty: true`. Longhorn detects the inconsistency across replicas and rebuilds from the
|
||||
"cleanest" source (the new empty pi1 replica), overwriting the old data.
|
||||
|
||||
**What works:** extract the filesystem from the untouched replica directory directly, then
|
||||
inject the data files into the live Longhorn block device while the volume is temporarily
|
||||
attached in maintenance mode.
|
||||
|
||||
---
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
Are Volume CRDs present in Longhorn?
|
||||
├── YES → normal PV/PVC restore is enough, use playbooks/recover/longhorn.yml
|
||||
└── NO
|
||||
└── Are replica directories present on disk?
|
||||
├── NO → data is lost, provision fresh volumes
|
||||
└── YES
|
||||
└── Is there an untouched replica dir (timestamps from before the incident)?
|
||||
├── NO → data likely unrecoverable (all dirs were zeroed during reconciliation)
|
||||
└── YES → follow this runbook
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 0 — Pre-flight: Inventory Surviving Replica Directories
|
||||
|
||||
On each node, list replica dirs and their sizes. Dirs with actual data are large (>16K).
|
||||
New empty dirs created by Longhorn are always exactly 16K.
|
||||
|
||||
```bash
|
||||
for node in pi1 pi2 pi3; do
|
||||
echo "=== $node ==="
|
||||
ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
|
||||
done
|
||||
```
|
||||
|
||||
**Key rule:** identify the replica dir that was **never touched** by the reinstall — it has
|
||||
old timestamps (from before the incident) and its size matches the original volume usage.
|
||||
This is your recovery source. **Back it up before touching anything.**
|
||||
|
||||
```bash
|
||||
# On the node that has the untouched dir:
|
||||
sudo mkdir -p /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
|
||||
sudo cp -a /mnt/arcodange/longhorn/replicas/<pv-name>-<old-hex>/ \
|
||||
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 1 — Reconstruct the Filesystem
|
||||
|
||||
The replica directory contains a snapshot chain. Each layer is a sparse raw image — unchanged
|
||||
blocks appear as zeroed sparse regions, only written blocks contain data. To reconstruct the
|
||||
full filesystem, layers must be merged: head takes priority, then snapshot.
|
||||
|
||||
Use `docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py`:
|
||||
|
||||
```bash
|
||||
# On the node holding the backup:
|
||||
sudo python3 merge-longhorn-layers.py \
|
||||
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/<pv-name>-<old-hex>/ \
|
||||
/tmp/<pvc-name>-merged.img
|
||||
|
||||
# Verify the filesystem mounts
|
||||
sudo mkdir -p /mnt/recovery-<pvc-name>
|
||||
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-<pvc-name>
|
||||
sudo ls -lah /mnt/recovery-<pvc-name>/
|
||||
sudo umount /mnt/recovery-<pvc-name>
|
||||
```
|
||||
|
||||
If mount fails with "wrong fs type" or "bad superblock":
|
||||
- The snapshot `.img` is all-zero (was overwritten by a prior Longhorn reconciliation)
|
||||
- Try the next oldest replica dir from another node
|
||||
- Check with `sudo od -A x -t x1z -v snap.img | grep -v ' 00 00...' | head -5`
|
||||
|
||||
---
|
||||
|
||||
## Step 2 — Create the Longhorn Volume CRD
|
||||
|
||||
Longhorn needs to know about the volume before its block device can be used.
|
||||
|
||||
```bash
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: longhorn.io/v1beta2
|
||||
kind: Volume
|
||||
metadata:
|
||||
name: <pv-name>
|
||||
namespace: longhorn-system
|
||||
spec:
|
||||
accessMode: rwo # or rwx
|
||||
dataEngine: v1
|
||||
frontend: blockdev
|
||||
numberOfReplicas: 3
|
||||
size: "<size-in-bytes>" # e.g. "134217728" for 128Mi
|
||||
EOF
|
||||
```
|
||||
|
||||
Wait for replicas to appear:
|
||||
```bash
|
||||
kubectl get replicas.longhorn.io -n longhorn-system | grep <pv-name>
|
||||
# Expect 3 replicas in "stopped" state
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3 — Attach the Volume in Maintenance Mode
|
||||
|
||||
Longhorn only creates the block device (`/dev/longhorn/<pv-name>`) when the volume is
|
||||
attached to a node. Use a `VolumeAttachment` ticket to attach without a pod.
|
||||
|
||||
Choose `<target-node>` = the same node where the backup/merged image is stored (avoids
|
||||
copying large files across the network).
|
||||
|
||||
```bash
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: longhorn.io/v1beta2
|
||||
kind: VolumeAttachment
|
||||
metadata:
|
||||
name: <pv-name>
|
||||
namespace: longhorn-system
|
||||
spec:
|
||||
attachmentTickets:
|
||||
recovery:
|
||||
generation: 0
|
||||
id: recovery
|
||||
nodeID: <target-node>
|
||||
parameters:
|
||||
disableFrontend: "false"
|
||||
type: longhorn-api
|
||||
volume: <pv-name>
|
||||
EOF
|
||||
|
||||
kubectl wait --for=jsonpath='{.status.state}'=attached \
|
||||
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=120s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4 — Scale Down the Workload
|
||||
|
||||
Always stop the workload before touching the data to prevent concurrent writes and filesystem
|
||||
corruption.
|
||||
|
||||
```bash
|
||||
# For a Deployment:
|
||||
kubectl scale deployment <name> -n <namespace> --replicas=0
|
||||
|
||||
# For a StatefulSet:
|
||||
kubectl scale statefulset <name> -n <namespace> --replicas=0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5 — Inject Data Files via Block Device
|
||||
|
||||
```bash
|
||||
ssh <target-node> bash <<'SHELL'
|
||||
# Mount the live block device
|
||||
sudo mkdir -p /mnt/recovery-live
|
||||
sudo mount /dev/longhorn/<pv-name> /mnt/recovery-live
|
||||
|
||||
# Mount the reconstructed image (if not already mounted)
|
||||
sudo mkdir -p /mnt/recovery-src
|
||||
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-src
|
||||
|
||||
# Sync: only the application data files, not lost+found
|
||||
sudo rsync -av --exclude='lost+found' /mnt/recovery-src/ /mnt/recovery-live/
|
||||
|
||||
# Verify
|
||||
sudo ls -lah /mnt/recovery-live/
|
||||
|
||||
# Unmount both
|
||||
sudo umount /mnt/recovery-src
|
||||
sudo umount /mnt/recovery-live
|
||||
SHELL
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 6 — Detach the Volume
|
||||
|
||||
```bash
|
||||
kubectl patch volumeattachments.longhorn.io <pv-name> \
|
||||
-n longhorn-system --type json \
|
||||
-p '[{"op":"remove","path":"/spec/attachmentTickets/recovery"}]'
|
||||
|
||||
kubectl wait --for=jsonpath='{.status.state}'=detached \
|
||||
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=60s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 7 — Restore PV and PVC
|
||||
|
||||
Clear stuck Terminating PV/PVC finalizers first if they exist:
|
||||
```bash
|
||||
kubectl patch pv <pv-name> --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null
|
||||
kubectl patch pvc <pvc-name> -n <namespace> --type=merge \
|
||||
-p '{"metadata":{"finalizers":null}}' 2>/dev/null
|
||||
# Wait a moment for them to delete
|
||||
```
|
||||
|
||||
Recreate the PV with `Retain` policy and no `claimRef`:
|
||||
```bash
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: <pv-name>
|
||||
annotations:
|
||||
pv.kubernetes.io/provisioned-by: driver.longhorn.io
|
||||
spec:
|
||||
accessModes: [ReadWriteOnce] # match original
|
||||
capacity:
|
||||
storage: <size> # e.g. 128Mi
|
||||
csi:
|
||||
driver: driver.longhorn.io
|
||||
fsType: ext4
|
||||
volumeHandle: <pv-name>
|
||||
volumeAttributes:
|
||||
dataEngine: v1
|
||||
dataLocality: disabled
|
||||
disableRevisionCounter: "true"
|
||||
numberOfReplicas: "3"
|
||||
staleReplicaTimeout: "30"
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: longhorn
|
||||
volumeMode: Filesystem
|
||||
EOF
|
||||
```
|
||||
|
||||
Recreate the PVC pinned to this PV:
|
||||
```bash
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: <pvc-name>
|
||||
namespace: <namespace>
|
||||
spec:
|
||||
accessModes: [ReadWriteOnce]
|
||||
resources:
|
||||
requests:
|
||||
storage: <size>
|
||||
storageClassName: longhorn
|
||||
volumeMode: Filesystem
|
||||
volumeName: <pv-name>
|
||||
EOF
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 8 — Scale Up and Verify
|
||||
|
||||
```bash
|
||||
kubectl scale deployment <name> -n <namespace> --replicas=1
|
||||
kubectl wait --for=condition=Ready pod -l app=<name> -n <namespace> --timeout=120s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pitfalls Learned During 2026-04-13 Recovery
|
||||
|
||||
| Pitfall | What happened | Prevention |
|
||||
|---------|--------------|------------|
|
||||
| **Directory swap corrupts data** | Longhorn found old `Dirty: true` volume.meta + empty pi1 replica → rebuilt from empty source | Never swap dirs. Use merge tool + block device injection instead |
|
||||
| **Snapshot is zeroed after swap** | Longhorn reconciliation overwrote snapshot images when rebuilding from empty replica | Back up the untouched dir FIRST before any rename |
|
||||
| **Multiple dirs per volume on pi3** | Rebuild attempts during the incident created extra dirs | Identify the untouched dir by timestamp AND verify non-zero content with `od` |
|
||||
| **`Rebuilding: true` replica → all-zeros merged image** | Phase 0 picked a replica mid-rebuild (1.3 GiB actual data, sparse files look large) — merge tool produced an all-zeros image | Check `volume.meta` and skip any dir with `"Rebuilding": true` before merging |
|
||||
| **`du -sb` gives misleading apparent sizes** | Sparse replica files (8 GiB file, 1.3 GiB actual) appeared larger than healthy 11 GiB replicas | Use `du -sk` (actual disk blocks) not `du -sb` (apparent/logical size) to rank replicas |
|
||||
| **Dirty journal prevents ro mount** | `mount -o loop,ro` fails with "bad superblock" on an ext4 with unclean shutdown | Use `mount -o loop,ro,noload` to skip journal replay for read-only access |
|
||||
| **New volume is unformatted** | `mount /dev/longhorn/<pv>` fails with "wrong fs type" on a freshly created volume | Run `mkfs.ext4 -F` before mounting; guard with `blkid` to skip if already formatted |
|
||||
| **rsync rc=23 on power-cut partitions** | Some filesystem blocks were unreadable ("Structure needs cleaning") → rsync exits 23 | Use `rsync --ignore-errors`; rc=23 is a partial transfer, not a total failure |
|
||||
| **pod blocks volume re-attach** | Old Error-state pod held a volume attachment claim | Delete old Error pods before scaling up new ones |
|
||||
| **`kubectl cp` needs `tar`** | Distroless container had no `tar` binary | Mount block device directly on the node instead |
|
||||
| **VolumeAttachment ticket removal** | Deleting a VolumeAttachment object causes Longhorn to immediately recreate it | Patch the `recovery` key out of `spec.attachmentTickets` instead of deleting the object |
|
||||
| **Phase 7 wait for `detached` times out** | After removing the recovery ticket, a workload may immediately create its own ticket | Wait for the `recovery` ticket to disappear from `spec.attachmentTickets`, not for full detach |
|
||||
| **StatefulSet pods not found by label** | `kubectl get pod -l app=<name>` returns nothing for StatefulSet pods | Wait on `readyReplicas ≥ 1` on the StatefulSet object, not on pod labels |
|
||||
| **`set_fact` overridden by `-e @file`** | Ansible extra vars have highest precedence — `set_fact: longhorn_recovery_volumes` was silently ignored | Use a different variable name (`_volumes`) for the resolved list, never reassign the extra var name |
|
||||
|
||||
---
|
||||
|
||||
## Identifying the Right Replica Directory
|
||||
|
||||
When multiple old dirs exist for the same volume on a node, pick the one to use for recovery:
|
||||
|
||||
1. **Skip `Rebuilding: true`:** check `volume.meta` first — a dir that was being rebuilt when
|
||||
the incident happened has incomplete data (sparse files are allocated but mostly zeroed):
|
||||
```bash
|
||||
python3 -c "import json; d=json.load(open('volume.meta')); print('Rebuilding:', d['Rebuilding'])"
|
||||
```
|
||||
Only consider dirs where `Rebuilding: false`.
|
||||
|
||||
2. **Actual size:** `sudo du -sk <dir>` (actual disk usage in KB — not `du -sb` which returns
|
||||
apparent/logical size and is misleading for sparse files). Pick the largest actual size.
|
||||
|
||||
3. **Timestamps:** prefer the most recently modified before the incident date.
|
||||
|
||||
4. **Snapshot chain:** if Rebuilding is false on multiple dirs, check `volume.meta` for
|
||||
`"Dirty": false` (clean shutdown) vs `"Dirty": true`. Prefer clean if available.
|
||||
|
||||
5. **Content check:** verify the snapshot is not all zeros:
|
||||
```bash
|
||||
sudo od -A x -t x1z -v volume-snap-*.img | grep -v ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' | head -3
|
||||
```
|
||||
If the output is empty (all zeros), the snapshot was overwritten. Try another node.
|
||||
|
||||
**Summary rule:** `Rebuilding: false` → largest `du -sk` → non-zero snapshot content.
|
||||
|
||||
---
|
||||
|
||||
## Reference: Key Commands
|
||||
|
||||
```bash
|
||||
# List all replica dirs for a volume across all nodes
|
||||
for n in pi1 pi2 pi3; do echo "==$n=="; ssh $n "sudo ls /mnt/arcodange/longhorn/replicas/ | grep <pv-prefix>"; done
|
||||
|
||||
# Check Longhorn volume state
|
||||
kubectl get volumes.longhorn.io -n longhorn-system <pv-name>
|
||||
|
||||
# Check VolumeAttachment tickets
|
||||
kubectl get volumeattachments.longhorn.io -n longhorn-system <pv-name> \
|
||||
-o jsonpath='{.spec.attachmentTickets}'
|
||||
|
||||
# Check Longhorn block device existence on a node
|
||||
ssh <node> "ls /dev/longhorn/<pv-name>"
|
||||
|
||||
# Verify filesystem content without starting the app
|
||||
ssh <node> "sudo mount /dev/longhorn/<pv-name> /mnt/check && sudo ls /mnt/check && sudo umount /mnt/check"
|
||||
```
|
||||
Reference in New Issue
Block a user