Files
factory/ansible/arcodange/factory/docs/runbooks/cluster-recovery-agent.md
Gabriel Radureau 1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:18 +02:00

8.5 KiB

Cluster Recovery Agent Instructions

You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and kubectl commands, and bring the cluster back to a fully healthy state.

You do NOT need to modify any code. All recovery tooling already exists.


Cluster Overview

Component Details
Nodes pi1, pi2, pi3 (Raspberry Pi, SSH via pi<N>.home)
k8s distribution k3s
Storage Longhorn (/mnt/arcodange/longhorn/)
GitOps ArgoCD (apps auto-sync from gitea.arcodange.lab/arcodange-org/)
Secrets HashiCorp Vault (tools namespace, manual unseal)
Ingress Traefik + CrowdSec bouncer
Working dir /Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/
Inventory inventory/hosts.yml

Critical dependency: ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC. Always recover and unseal Vault before scaling ERP up.


Step 0 — Assess Damage

Run these first to understand what is broken:

# Overall pod health
kubectl get pods -A | grep -v Running | grep -v Completed

# PVC health (anything not Bound is a problem)
kubectl get pvc -A | grep -v Bound

# Longhorn volume states
kubectl get volumes.longhorn.io -n longhorn-system

# Longhorn manager health (prerequisite for all recovery)
kubectl get pods -n longhorn-system -l app=longhorn-manager

Step 1 — Longhorn Volume Recovery

Path A — Fast path (backup file exists, Volume CRDs were backed up)

Check if a recent backup exists on pi1:

ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"

If a backup file exists and is recent (from before the incident):

ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"

Then verify PVCs bound and skip to Step 2.

Path B — Block-device injection (no usable backup, raw replica files intact)

Use this when PVCs are Lost/Terminating and no Volume CRD backup is available.

Check which volumes need recovery:

# Volumes with no PVC or Lost/Terminating PVC
kubectl get pvc -A | grep -v Bound

For each failed volume, create a vars file following the pattern in: playbooks/recover/longhorn_data_vars.example.yml

Existing vars files from the 2026-04-13 incident (reusable as references):

  • playbooks/recover/longhorn_data_vars_remaining.yml — prometheus, alertmanager, redis, backups-rwx
  • playbooks/recover/longhorn_data_vars_erp_vault.yml — erp, hashicorp-vault (audit + data)
  • playbooks/recover/longhorn_data_vars_clickhouse.yml — clickhouse

Key rules for the vars file:

  • source_node/source_dir can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
  • Set workload_name: "" for ERP — it must not scale up until Vault is unsealed
  • For StatefulSets with multiple PVCs (e.g. Vault), set workload_name: "" on all but the last entry

Run the recovery playbook:

ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
  -e @playbooks/recover/longhorn_data_vars_<NAME>.yml

The playbook is idempotent — safe to re-run if it fails midway.

Playbook phases (for context when troubleshooting):

Phase What it does
0 Auto-discovers best replica dir (skips Rebuilding: true)
1 Backs up untouched replica dir to /home/pi/arcodange/backups/longhorn-recovery/
2 Merges snapshot+head layers into a single .img via merge-longhorn-layers.py
3 Scales down workloads first, then clears stuck Terminating PVCs, creates Volume CRD
4 Scale down (second pass, idempotent)
5 Attaches volume via maintenance ticket to source node
6 mkfs.ext4 (if unformatted) + rsync from merged image into live block device
7 Removes maintenance ticket (volume detaches)
8 Creates PV (Retain, no claimRef) + PVC pinned to PV
9 Scales up workloads, waits for readyReplicas ≥ 1 (failures here are ignore_errors: yes)

Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned: The playbook handles this automatically (scales down before finalizer removal). If you still hit it:

kubectl scale statefulset <name> -n <namespace> --replicas=0
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc <pvc-name> -n <namespace>
# Then re-run the playbook

Step 2 — Unseal HashiCorp Vault

After Vault's PVCs are recovered, the pod boots sealed. Check:

kubectl get pod hashicorp-vault-0 -n tools
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed

If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):

ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml

Unseal keys are at ~/.arcodange/cluster-keys.json on the local machine. The playbook reads them automatically.

After the playbook completes, verify:

kubectl get pod hashicorp-vault-0 -n tools   # must be 1/1 Ready
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed  # must be false

Step 3 — Scale Up ERP

Only after Vault is unsealed and Ready:

kubectl scale deployment erp -n erp --replicas=1
kubectl rollout status deployment/erp -n erp

Step 4 — Reconfigure Tools (CrowdSec, etc.)

Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:

# Standard run (bouncer key + Traefik middleware + restart)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml

# Include captcha HTML injection (use when captcha page is broken)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all

If crowdsec-agent or crowdsec-appsec pods are stuck in Error after a long outage, the playbook handles restarting them automatically.


Step 5 — Re-enable ArgoCD selfHeal

Check if selfHeal was disabled during recovery (look for selfHeal: false in the tools app):

grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml

If disabled, re-enable it by editing argocd/values.yaml and setting selfHeal: true, then syncing the ArgoCD app:

kubectl get app tools -n argocd

Step 6 — Final Verification

# All pods running
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"

# All PVCs bound
kubectl get pvc -A | grep -v Bound

# All Longhorn volumes healthy
kubectl get volumes.longhorn.io -n longhorn-system

# Run a fresh backup to capture the recovered state
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
  -e backup_root_dir=/mnt/backups

Key Files Reference

File Purpose
playbooks/recover/longhorn_data.yml Main block-device recovery playbook
playbooks/recover/longhorn.yml Recovery when Volume CRDs still exist
playbooks/recover/longhorn_data_vars.example.yml Template for recovery vars
playbooks/recover/longhorn_data_vars_erp_vault.yml Vars for erp + vault (2026-04-13 incident)
playbooks/recover/longhorn_data_vars_remaining.yml Vars for other volumes (2026-04-13 incident)
playbooks/backup/backup.yml Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs)
playbooks/backup/k3s_pvc.yml PV/PVC/Longhorn Volume CRD backup
playbooks/tools/hashicorp_vault.yml Vault unseal + OIDC reconfiguration
playbooks/tools/crowdsec.yml CrowdSec bouncer + Traefik middleware setup
docs/adr/20260414-longhorn-pvc-recovery.md Full incident ADR with all recovery methods
~/.arcodange/cluster-keys.json Vault unseal keys (local machine only)

Decision Tree

Cluster down after outage
│
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
│
└─ Yes
   │
   ├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
   │
   └─ No
      │
      ├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
      │
      └─ No
         │
         ├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
         │
         └─ No ──▶ Path B (longhorn_data.yml block-device injection)
                   Check replica dirs exist first:
                   ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"