Files

Gabriel Radureau 1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling

Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 12:55:18 +02:00

8.5 KiB

Raw Blame History

Cluster Recovery Agent Instructions

You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and kubectl commands, and bring the cluster back to a fully healthy state.

You do NOT need to modify any code. All recovery tooling already exists.

Cluster Overview

Component	Details
Nodes	pi1, pi2, pi3 (Raspberry Pi, SSH via `pi<N>.home`)
k8s distribution	k3s
Storage	Longhorn (`/mnt/arcodange/longhorn/`)
GitOps	ArgoCD (apps auto-sync from `gitea.arcodange.lab/arcodange-org/`)
Secrets	HashiCorp Vault (`tools` namespace, manual unseal)
Ingress	Traefik + CrowdSec bouncer
Working dir	`/Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/`
Inventory	`inventory/hosts.yml`

Critical dependency: ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC. Always recover and unseal Vault before scaling ERP up.

Step 0 — Assess Damage

Run these first to understand what is broken:

# Overall pod health
kubectl get pods -A | grep -v Running | grep -v Completed

# PVC health (anything not Bound is a problem)
kubectl get pvc -A | grep -v Bound

# Longhorn volume states
kubectl get volumes.longhorn.io -n longhorn-system

# Longhorn manager health (prerequisite for all recovery)
kubectl get pods -n longhorn-system -l app=longhorn-manager

Step 1 — Longhorn Volume Recovery

Path A — Fast path (backup file exists, Volume CRDs were backed up)

Check if a recent backup exists on pi1:

ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"

If a backup file exists and is recent (from before the incident):

ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"

Then verify PVCs bound and skip to Step 2.

Path B — Block-device injection (no usable backup, raw replica files intact)

Use this when PVCs are Lost/Terminating and no Volume CRD backup is available.

Check which volumes need recovery:

# Volumes with no PVC or Lost/Terminating PVC
kubectl get pvc -A | grep -v Bound

For each failed volume, create a vars file following the pattern in: playbooks/recover/longhorn_data_vars.example.yml

Existing vars files from the 2026-04-13 incident (reusable as references):

playbooks/recover/longhorn_data_vars_remaining.yml — prometheus, alertmanager, redis, backups-rwx
playbooks/recover/longhorn_data_vars_erp_vault.yml — erp, hashicorp-vault (audit + data)
playbooks/recover/longhorn_data_vars_clickhouse.yml — clickhouse

Key rules for the vars file:

source_node/source_dir can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
Set workload_name: "" for ERP — it must not scale up until Vault is unsealed
For StatefulSets with multiple PVCs (e.g. Vault), set workload_name: "" on all but the last entry

Run the recovery playbook:

ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
  -e @playbooks/recover/longhorn_data_vars_<NAME>.yml

The playbook is idempotent — safe to re-run if it fails midway.

Playbook phases (for context when troubleshooting):

Phase	What it does
0	Auto-discovers best replica dir (skips `Rebuilding: true`)
1	Backs up untouched replica dir to `/home/pi/arcodange/backups/longhorn-recovery/`
2	Merges snapshot+head layers into a single `.img` via `merge-longhorn-layers.py`
3	Scales down workloads first, then clears stuck Terminating PVCs, creates Volume CRD
4	Scale down (second pass, idempotent)
5	Attaches volume via maintenance ticket to source node
6	`mkfs.ext4` (if unformatted) + `rsync` from merged image into live block device
7	Removes maintenance ticket (volume detaches)
8	Creates PV (Retain, no claimRef) + PVC pinned to PV
9	Scales up workloads, waits for readyReplicas ≥ 1 (failures here are `ignore_errors: yes`)

Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned: The playbook handles this automatically (scales down before finalizer removal). If you still hit it:

kubectl scale statefulset <name> -n <namespace> --replicas=0
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc <pvc-name> -n <namespace>
# Then re-run the playbook

Step 2 — Unseal HashiCorp Vault

After Vault's PVCs are recovered, the pod boots sealed. Check:

kubectl get pod hashicorp-vault-0 -n tools
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed

If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):

ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml

Unseal keys are at ~/.arcodange/cluster-keys.json on the local machine. The playbook reads them automatically.

After the playbook completes, verify:

kubectl get pod hashicorp-vault-0 -n tools   # must be 1/1 Ready
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed  # must be false

Step 3 — Scale Up ERP

Only after Vault is unsealed and Ready:

kubectl scale deployment erp -n erp --replicas=1
kubectl rollout status deployment/erp -n erp

Step 4 — Reconfigure Tools (CrowdSec, etc.)

Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:

# Standard run (bouncer key + Traefik middleware + restart)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml

# Include captcha HTML injection (use when captcha page is broken)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all

If crowdsec-agent or crowdsec-appsec pods are stuck in Error after a long outage, the playbook handles restarting them automatically.

Step 5 — Re-enable ArgoCD selfHeal

Check if selfHeal was disabled during recovery (look for selfHeal: false in the tools app):

grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml

If disabled, re-enable it by editing argocd/values.yaml and setting selfHeal: true, then syncing the ArgoCD app:

kubectl get app tools -n argocd

Step 6 — Final Verification

# All pods running
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"

# All PVCs bound
kubectl get pvc -A | grep -v Bound

# All Longhorn volumes healthy
kubectl get volumes.longhorn.io -n longhorn-system

# Run a fresh backup to capture the recovered state
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
  -e backup_root_dir=/mnt/backups

Key Files Reference

File	Purpose
`playbooks/recover/longhorn_data.yml`	Main block-device recovery playbook
`playbooks/recover/longhorn.yml`	Recovery when Volume CRDs still exist
`playbooks/recover/longhorn_data_vars.example.yml`	Template for recovery vars
`playbooks/recover/longhorn_data_vars_erp_vault.yml`	Vars for erp + vault (2026-04-13 incident)
`playbooks/recover/longhorn_data_vars_remaining.yml`	Vars for other volumes (2026-04-13 incident)
`playbooks/backup/backup.yml`	Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs)
`playbooks/backup/k3s_pvc.yml`	PV/PVC/Longhorn Volume CRD backup
`playbooks/tools/hashicorp_vault.yml`	Vault unseal + OIDC reconfiguration
`playbooks/tools/crowdsec.yml`	CrowdSec bouncer + Traefik middleware setup
`docs/adr/20260414-longhorn-pvc-recovery.md`	Full incident ADR with all recovery methods
`~/.arcodange/cluster-keys.json`	Vault unseal keys (local machine only)

Decision Tree

Cluster down after outage
│
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
│
└─ Yes
   │
   ├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
   │
   └─ No
      │
      ├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
      │
      └─ No
         │
         ├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
         │
         └─ No ──▶ Path B (longhorn_data.yml block-device injection)
                   Check replica dirs exist first:
                   ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"

8.5 KiB Raw Blame History