Captures the post-mortem of the April 13 power-cut: incident timeline, retrospective, and architecture/role diagrams. Adds an ADR explaining why Longhorn cannot re-associate orphaned replica directories after a nuclear reinstall (engine-id naming), plus block-device recovery runbooks and the `playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py` to rebuild PVCs from raw `volume-head-*.img` chains. Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs (needed for the fast-path restore) and rewrites the restore script with a fallback dir + English messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.5 KiB
Cluster Recovery Agent Instructions
You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and kubectl commands, and bring the cluster back to a fully healthy state.
You do NOT need to modify any code. All recovery tooling already exists.
Cluster Overview
| Component | Details |
|---|---|
| Nodes | pi1, pi2, pi3 (Raspberry Pi, SSH via pi<N>.home) |
| k8s distribution | k3s |
| Storage | Longhorn (/mnt/arcodange/longhorn/) |
| GitOps | ArgoCD (apps auto-sync from gitea.arcodange.lab/arcodange-org/) |
| Secrets | HashiCorp Vault (tools namespace, manual unseal) |
| Ingress | Traefik + CrowdSec bouncer |
| Working dir | /Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/ |
| Inventory | inventory/hosts.yml |
Critical dependency: ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC. Always recover and unseal Vault before scaling ERP up.
Step 0 — Assess Damage
Run these first to understand what is broken:
# Overall pod health
kubectl get pods -A | grep -v Running | grep -v Completed
# PVC health (anything not Bound is a problem)
kubectl get pvc -A | grep -v Bound
# Longhorn volume states
kubectl get volumes.longhorn.io -n longhorn-system
# Longhorn manager health (prerequisite for all recovery)
kubectl get pods -n longhorn-system -l app=longhorn-manager
Step 1 — Longhorn Volume Recovery
Path A — Fast path (backup file exists, Volume CRDs were backed up)
Check if a recent backup exists on pi1:
ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"
If a backup file exists and is recent (from before the incident):
ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"
Then verify PVCs bound and skip to Step 2.
Path B — Block-device injection (no usable backup, raw replica files intact)
Use this when PVCs are Lost/Terminating and no Volume CRD backup is available.
Check which volumes need recovery:
# Volumes with no PVC or Lost/Terminating PVC
kubectl get pvc -A | grep -v Bound
For each failed volume, create a vars file following the pattern in:
playbooks/recover/longhorn_data_vars.example.yml
Existing vars files from the 2026-04-13 incident (reusable as references):
playbooks/recover/longhorn_data_vars_remaining.yml— prometheus, alertmanager, redis, backups-rwxplaybooks/recover/longhorn_data_vars_erp_vault.yml— erp, hashicorp-vault (audit + data)playbooks/recover/longhorn_data_vars_clickhouse.yml— clickhouse
Key rules for the vars file:
source_node/source_dircan be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica- Set
workload_name: ""for ERP — it must not scale up until Vault is unsealed - For StatefulSets with multiple PVCs (e.g. Vault), set
workload_name: ""on all but the last entry
Run the recovery playbook:
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars_<NAME>.yml
The playbook is idempotent — safe to re-run if it fails midway.
Playbook phases (for context when troubleshooting):
| Phase | What it does |
|---|---|
| 0 | Auto-discovers best replica dir (skips Rebuilding: true) |
| 1 | Backs up untouched replica dir to /home/pi/arcodange/backups/longhorn-recovery/ |
| 2 | Merges snapshot+head layers into a single .img via merge-longhorn-layers.py |
| 3 | Scales down workloads first, then clears stuck Terminating PVCs, creates Volume CRD |
| 4 | Scale down (second pass, idempotent) |
| 5 | Attaches volume via maintenance ticket to source node |
| 6 | mkfs.ext4 (if unformatted) + rsync from merged image into live block device |
| 7 | Removes maintenance ticket (volume detaches) |
| 8 | Creates PV (Retain, no claimRef) + PVC pinned to PV |
| 9 | Scales up workloads, waits for readyReplicas ≥ 1 (failures here are ignore_errors: yes) |
Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned: The playbook handles this automatically (scales down before finalizer removal). If you still hit it:
kubectl scale statefulset <name> -n <namespace> --replicas=0
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc <pvc-name> -n <namespace>
# Then re-run the playbook
Step 2 — Unseal HashiCorp Vault
After Vault's PVCs are recovered, the pod boots sealed. Check:
kubectl get pod hashicorp-vault-0 -n tools
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed
If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):
ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml
Unseal keys are at ~/.arcodange/cluster-keys.json on the local machine. The playbook reads them automatically.
After the playbook completes, verify:
kubectl get pod hashicorp-vault-0 -n tools # must be 1/1 Ready
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed # must be false
Step 3 — Scale Up ERP
Only after Vault is unsealed and Ready:
kubectl scale deployment erp -n erp --replicas=1
kubectl rollout status deployment/erp -n erp
Step 4 — Reconfigure Tools (CrowdSec, etc.)
Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:
# Standard run (bouncer key + Traefik middleware + restart)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml
# Include captcha HTML injection (use when captcha page is broken)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all
If crowdsec-agent or crowdsec-appsec pods are stuck in Error after a long outage,
the playbook handles restarting them automatically.
Step 5 — Re-enable ArgoCD selfHeal
Check if selfHeal was disabled during recovery (look for selfHeal: false in the tools app):
grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml
If disabled, re-enable it by editing argocd/values.yaml and setting selfHeal: true,
then syncing the ArgoCD app:
kubectl get app tools -n argocd
Step 6 — Final Verification
# All pods running
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"
# All PVCs bound
kubectl get pvc -A | grep -v Bound
# All Longhorn volumes healthy
kubectl get volumes.longhorn.io -n longhorn-system
# Run a fresh backup to capture the recovered state
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
-e backup_root_dir=/mnt/backups
Key Files Reference
| File | Purpose |
|---|---|
playbooks/recover/longhorn_data.yml |
Main block-device recovery playbook |
playbooks/recover/longhorn.yml |
Recovery when Volume CRDs still exist |
playbooks/recover/longhorn_data_vars.example.yml |
Template for recovery vars |
playbooks/recover/longhorn_data_vars_erp_vault.yml |
Vars for erp + vault (2026-04-13 incident) |
playbooks/recover/longhorn_data_vars_remaining.yml |
Vars for other volumes (2026-04-13 incident) |
playbooks/backup/backup.yml |
Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs) |
playbooks/backup/k3s_pvc.yml |
PV/PVC/Longhorn Volume CRD backup |
playbooks/tools/hashicorp_vault.yml |
Vault unseal + OIDC reconfiguration |
playbooks/tools/crowdsec.yml |
CrowdSec bouncer + Traefik middleware setup |
docs/adr/20260414-longhorn-pvc-recovery.md |
Full incident ADR with all recovery methods |
~/.arcodange/cluster-keys.json |
Vault unseal keys (local machine only) |
Decision Tree
Cluster down after outage
│
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
│
└─ Yes
│
├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
│
└─ No
│
├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
│
└─ No
│
├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
│
└─ No ──▶ Path B (longhorn_data.yml block-device injection)
Check replica dirs exist first:
ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"