docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling

Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 12:55:18 +02:00
parent 934b62d922
commit 1ae28cb944
20 changed files with 5939 additions and 10 deletions

View File

@@ -0,0 +1,550 @@
# ADR 20260414: Longhorn PVC Recovery When Reinstalled
---
## 📋 **Executive Summary**
After the April 13, 2026 power cut incident and subsequent cluster recovery, we discovered a **critical gap** in Longhorn volume restoration. While the **raw replica data files** (`volume-head-*.img`) remain intact on disk across all nodes, Longhorn cannot automatically **re-associate** them with new Volume CRDs due to its internal engine ID naming scheme. This document explains the problem and provides three recovery approaches.
---
---
## 🔍 **The Root Problem**
### **What Happened**
1. **Power cut** → Longhorn CSI driver lost connection
2. **Force-deletion of Longhorn pods** → Webhook circular dependency
3. **Nuclear cleanup** → All Longhorn CRDs (Volume, Engine, Replica) were deleted
4. **Reinstallation** → New Volume CRDs created with new engine IDs
### **Directory Structure Issue**
Longhorn stores replica data in directories named by **volume name + engine ID**:
```
/mnt/arcodange/longhorn/replicas/
├── pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-cd16e459/ # ← OLD (orphaned)
│ ├── volume-head-002.img # ← Actual Traefik data (128Mi)
│ ├── volume-head-002.img.meta
│ └── volume-snap-*.img
├── pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4/ # ← NEW (empty)
│ ├── volume-head-002.img # ← Empty 128Mi
│ └── volume-head-002.img.meta
└── ...
```
**The Problem:** When you recreate a Volume CRD, Longhorn generates a **new engine ID** (e.g., `8c7d8ab4`), creating a **new empty directory** instead of adopting the existing one (`cd16e459`).
### **Why This Matters**
| Component | Persistence | Recovery Path |
|-----------|-------------|---------------|
| **Replica `.img` files** | ✅ **Survives** on disk | Manual intervention required |
| **Volume CRD** | ❌ **Deleted** | Must recreate |
| **Engine/Replica CRDs** | ❌ **Deleted** | Auto-recreated by Longhorn |
| **Engine ID** | ❌ **Changes** | ** Cannot be recovered without backup ** |
**Without the original Volume CRD backup, Longhorn cannot match orphaned replica directories to new Volume CRDs.**
---
---
## 🎯 **Recovery Methods Comparison**
| Method | Complexity | Data Safety | Downtime | Best For |
|--------|------------|-------------|----------|----------|
| **[A: Manual `dd` Copy](#method-a-manual-dd-copy)** | ⭐⭐⭐⭐ | ✅✅✅✅ | Medium | Critical data, no app backup |
| **[B: Directory Rename](#method-b-directory-rename)** | ⭐⭐⭐ | ✅✅ | Low | Small volumes, no Rebuilding replicas |
| **[C: Fresh Volume + App Restore](#method-c-fresh-volume--app-restore)** | ⭐⭐ | ✅✅✅✅✅ | Low | Non-critical data, app backups exist |
| **[D: Block-Device Injection (Automated)](#method-d-block-device-injection-automated)** | ⭐⭐⭐ | ✅✅✅✅ | Medium | **Recommended — any volume, no dir swap needed** |
| **[E: Longhorn Google Storage Restore](#method-e-longhorn-google-storage-restore)** | ⭐⭐ | ✅✅✅✅✅ | Low | Volumes with Longhorn backup configured |
**Method B was proven risky** (2026-04-13 recovery): Longhorn reconciliation finds `Dirty: true`
metadata + a clean empty pi1 replica → silently rebuilds from the empty source, destroying data.
Use Method D for any volume larger than ~128Mi or with Rebuilding replicas.
---
---
## 🛠️ **Method A: Manual `dd` Copy**
### **Concept**
Manually copy the data from the orphaned `.img` file to the new replica directory that Longhorn created for the new Volume CRD.
### **Prerequisites**
- Root access to all nodes
- Volume CRD already recreated (with new engine ID)
- Longhorn has created new empty replica directories
- `dd` and `qemu-img` tools available
### **Steps**
```bash
# 1. Identify source (old data) and destination (new empty)
SOURCE_NODE=pi2
SOURCE_DIR=/mnt/arcodange/longhorn/replicas/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-cd16e459
SOURCE_IMG=$(ssh $SOURCE_NODE "ls $SOURCE_DIR/volume-head-*.img | head -1")
DEST_DIRS=(
pi1:/mnt/arcodange/longhorn/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4
pi2:/mnt/arcodange/longhorn/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4
pi3:/mnt/arcodange/longhorn/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4
)
# 2. Copy data to each node
for DEST in "${DEST_DIRS[@]}"; do
NODE=${DEST%%:*}
PATH=${DEST#*:}
ssh $NODE "sudo mkdir -p $PATH && sudo dd if=$SOURCE_IMG of=$PATH/volume-head-002.img bs=4M"
done
# 3. Restart Longhorn engine pods to pick up new data
kubectl delete pod -n longhorn-system -l longhorn.io/component=engine
# 4. Verify data is accessible
kubectl get volume -n longhorn-system pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90
# Should show: state=attached, robustness=healthy
```
### **Pros**
- ✅ Guaranteed data recovery
- ✅ Works for any volume size
- ✅ Preserves all snapshots and metadata
### **Cons**
- ⚠️ Requires manual intervention on each node
- ⚠️ Must know source and destination paths
- ⚠️ Risk of data corruption if `dd` fails mid-copy
- ⚠️ Volume must be in detached state during copy
### **Risk Mitigation**
- Verify checksums after copy: `sha256sum /path/to/image.img`
- Copy to one node at a time, verify between each
- Use `pv` for progress: `pv $SOURCE_IMG | ssh $NODE "sudo dd of=$PATH/volume-head-002.img bs=4M"`
---
---
## 🏷️ **Method B: Directory Rename**
### **Concept**
Rename the orphaned replica directory to match the **engine ID** that Longhorn expects for the new Volume CRD.
### **Prerequisites**
- Volume CRD already recreated
- Longhorn has created engine CRDs (check: `kubectl get engines -n longhorn-system`)
- Must act quickly before Longhorn initializes new empty replicas
### **Steps**
```bash
# 1. Find the new engine ID for the volume
ENGINE=$(kubectl get engines -n longhorn-system -l longhorn.io/volume=pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 -o jsonpath='{.items[0].metadata.name}')
# Example: pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-e-0
ENGINE_ID=${ENGINE#*-} # Extract suffix: e-0
# But the directory uses a different format...
# 2. Check actual directory names
kubectl get replicas -n longhorn-system | grep pvc-cc8a
# Output: pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-r-8c7d8ab4
# 3. Rename on the node where orphaned data exists
NEW_DIR_SUFFIX=$(kubectl get replicas -n longhorn-system pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-r-8c7d8ab4 -o jsonpath='{.metadata.labels.longhorn\.io/last-attached-node}')
ssh $NEW_DIR_SUFFIX "sudo mv /mnt/arcodange/longhorn/replicas/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-cd16e459 \
/mnt/arcodange/longhorn/replicas/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4"
# 4. Restart the replica pod
kubectl delete pod -n longhorn-system $(kubectl get pods -n longhorn-system -o jsonpath='{.items[?(@.metadata.labels.longhorn\.io/replica)=pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90].metadata.name}')
```
### **Pros**
- ✅ Fastest method
- ✅ No data copying required
- ✅ Preserves all existing data and snapshots
### **Cons**
- ⚠️ **High risk of mismatch** - wrong directory rename = data loss
- ⚠️ Must identify correct engine ID for each node
- ⚠️ Replica directories exist on multiple nodes - must rename on ALL
- ⚠️ Longhorn may have already initialized new empty replicas
### **Critical Warning**
**Each volume has replicas on ALL nodes.** You must:
1. Identify which node has which orphaned directory
2. Rename each to match the corresponding new engine's expected path
3. Ensure consistency across all nodes
**Example for pvc-cc8a:**
```bash
# Orphaned dirs:
# pi2: pvc-cc8a...-cd16e459
# pi3: pvc-cc8a...-011b54b3
# New engine paths (from kubectl get replicas):
# pi1: pvc-cc8a...-r-8c7d8ab4
# pi2: pvc-cc8a...-r-32aa3e1e
# pi3: pvc-cc8a...-r-3e84c460
# Must rename EACH orphaned dir to match new engine on SAME node
```
---
---
## 🆕 **Method C: Fresh Volume + App Restore** *(Recommended for Traefik)*
### **Concept**
1. Let Longhorn create a **new empty volume** for the PVC
2. Restore the **application data** (Traefik's `acme.json`) from application-level backups
### **Prerequisites**
- Application-level backup exists (e.g., Traefik config, certificates)
- Data is non-critical or easily restorable
- Storage requirements are small (128Mi for Traefik)
### **Steps**
```bash
# 1. Delete the problematic Volume CRD (if any)
kubectl delete volume -n longhorn-system pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 --ignore-not-found
# 2. Delete the PVC
kubectl delete pvc -n kube-system traefik
# 3. Let StorageClass provision a fresh volume
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: traefik
namespace: kube-system
spec:
accessModes: [ReadWriteOnce]
resources: {requests: {storage: 128Mi}}
storageClassName: longhorn
volumeMode: Filesystem
EOF
# 4. Wait for PV to be provisioned
kubectl wait --for=jsonpath='{.status.phase}'=Bound pvc -n kube-system traefik
# 5. Restore Traefik data from backup
BACKUP_FILE="/path/to/traefik-backup/acme.json"
kubectl cp $BACKUP_FILE kube-system/traefik-XXXXXX-XXXX:/data/acme.json
kubectl exec -n kube-system traefik-XXXXXX-XXXX -- chown 65532:65532 /data/acme.json
kubectl exec -n kube-system traefik-XXXXXX-XXXX -- chmod 600 /data/acme.json
```
### **Traefik-Specific Recovery**
For Traefik, the critical data is:
- `/data/acme.json` - TLS certificates obtained from Let's Encrypt
- `/data/tls.yml` - (if used)
- Secrets in Kubernetes (separate from PVC)
**Backup locations to check:**
```bash
# Check if we have Traefik data backups
ssh pi1 "ls -la /home/pi/arcodange/backups/traefik/ 2>/dev/null || echo 'No backup found'"
# Check ArgoCD apps (if Traefik was deployed via GitOps)
kubectl get app -n argocd | grep traefik
```
### **Pros**
-**Simplest and safest** method
- ✅ No risk of Longhorn directory mismatches
- ✅ Works even without Longhorn CRD backups
- ✅ Verifiable - you can confirm data was restored
- ✅ Clean state - no orphaned directories
### **Cons**
- ⚠️ Requires application-level backups
- ⚠️ TLS certificates may have expired (need to re-issue)
---
---
## 🏆 **Recommendation: Method C for Traefik**
### **Why Method C is Best for This Case**
| Factor | Assessment |
|--------|------------|
| **Volume Size** | 128Mi (small) |
| **Data Criticality** | TLS certs can be re-generated |
| **Backup Availability** | Likely exists in ArgoCD/Git |
| **Complexity** | Low |
| **Risk** | Minimal |
| **Time Required** | ~5 minutes |
### **Data Loss Assessment for Traefik**
The **worst case** (no Traefik backup):
- TLS certificates will be **re-issued** automatically by cert-manager + Let's Encrypt
- No permanent data loss - certificates are ephemeral
- Client impact: Brief TLS warning during re-issuance (~1-2 minutes)
**Verdict:** 🟢 **Method C is the safest and most practical approach.**
---
## 🔧 **Prevention: What We Must Fix**
### **1. Update Backup Playbook** (`playbooks/backup/k3s_pvc.yml`) ✅ Done 2026-04-16
`backup_cmd` now captures:
1. All PersistentVolumes (PV)
2. All PersistentVolumeClaims (PVC)
3. **All Longhorn Volumes** (critical — enables fast restore via `kubectl apply` instead of block-device injection)
4. All Longhorn Settings (backup target configuration)
### **2. Test Backups Regularly**
```bash
# Monthly test: Restore a non-critical volume
# Pick a test volume, delete it, restore from backup
kubectl delete volume -n longhorn-system <test-volume>
kubectl apply -f <backup-file>
kubectl get volume -n longhorn-system <test-volume> -w
```
### **3. Validate Backup Files**
```bash
# Check backup contains Longhorn resources
grep "longhorn.io/v1beta2" /path/to/backup-*.volumes
grep "kind: Volume" /path/to/backup-*.volumes
```
### **4. Document Recovery Procedure**
- [ ] Create `docs/admin/longhorn-recovery.md` with these steps
- [ ] Add to team runbook
- [ ] Include in incident response training
---
## 📊 **Test Scenario: Battle Testing PVC Recovery**
### **Test Setup**
```bash
# 1. Create a test namespace
kubectl create ns longhorn-test
# 2. Create a test PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-longhorn-recovery
namespace: longhorn-test
labels:
purpose: test
spec:
accessModes: [ReadWriteOnce]
resources: {requests: {storage: 1Gi}}
storageClassName: longhorn
EOF
# 3. Deploy a test pod to write data
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-writer
namespace: longhorn-test
spec:
containers:
- name: writer
image: alpine
command: [sh, -c, "echo 'test data for recovery' > /data/testfile.txt && echo 'more data' >> /data/testfile.txt && tail -f /dev/null"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-longhorn-recovery
EOF
# 4. Write and verify data
kubectl exec -n longhorn-test test-writer -- cat /data/testfile.txt
# Should show: "test data for recovery\nmore data"
# 5. Backup everything
kubectl get -A pv,pvc -o yaml > /tmp/test-backup-pv-pvc.yaml
kubectl get -A volumes.longhorn.io -o yaml >> /tmp/test-backup-pv-pvc.yaml
echo '---' >> /tmp/test-backup-pv-pvc.yaml
kubectl get -A settings.longhorn.io -o yaml >> /tmp/test-backup-pv-pvc.yaml
```
### **Test Execution: Simulate Disaster**
```bash
# 6. Simulate disaster - delete everything
kubectl delete pvc -n longhorn-test test-longhorn-recovery
kubectl delete pod -n longhorn-test test-writer
kubectl delete volume -n longhorn-system pvc-$(kubectl get pvc -n longhorn-test test-longhorn-recovery -o jsonpath='{.spec.volumeName}')
# 7. Restore from backup
kubectl apply -f /tmp/test-backup-pv-pvc.yaml
# 8. Verify recovery
kubectl get pvc -n longhorn-test test-longhorn-recovery
kubectl get volumes -n longhorn-system | grep test-longhorn-recovery
# 9. Deploy test reader pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-reader
namespace: longhorn-test
spec:
containers:
- name: reader
image: alpine
command: [sh, -c, "cat /data/testfile.txt && tail -f /dev/null"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-longhorn-recovery
EOF
# 10. Check if data is recovered
kubectl logs -n longhorn-test test-reader
# Should show: "test data for recovery\nmore data"
```
### **Expected Results**
| Test Step | Pass Criteria |
|-----------|---------------|
| Volume CRD restored | `kubectl get volumes` shows the test volume |
| PVC bound | `kubectl get pvc` shows status=Bound |
| Data accessible | Test reader pod shows original data |
### **Test Cleanup**
```bash
kubectl delete ns longhorn-test
```
---
---
---
## 🛠️ **Method D: Block-Device Injection (Automated)**
### **Concept**
Bypass Longhorn's replica reconciliation entirely. Create a fresh Volume CRD, attach it in
maintenance mode, then inject the recovered filesystem directly into the live block device via
`rsync`. The old replica dirs are never renamed or touched — the data is copied into the new
Longhorn-managed volume.
### **Implementation**
See `playbooks/recover/longhorn_data.yml` — a 9-phase Ansible playbook that automates the full
sequence for one or more volumes in a single run.
### **Key Steps**
```
Phase 0: Auto-discover best replica dir (skip Rebuilding:true, rank by actual disk usage)
Phase 1: Backup untouched replica dir
Phase 2: Merge sparse snapshot+head layers → single flat image (merge-longhorn-layers.py)
Phase 3: Create Longhorn Volume CRD, wait for replicas
Phase 4: Scale down workload
Phase 5: Attach via VolumeAttachment maintenance ticket
Phase 6: mkfs.ext4 + mount + rsync from merged image
Phase 7: Remove maintenance ticket
Phase 8: Recreate PV (Retain, no claimRef) + PVC (volumeName pinned)
Phase 9: Scale up, wait readyReplicas ≥ 1
```
### **Usage**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars.yml
```
Vars file format:
```yaml
longhorn_recovery_volumes:
- pv_name: pvc-abc123
pvc_name: myapp-data
namespace: myapp
size_bytes: "134217728"
size_human: 128Mi
access_mode: ReadWriteOnce
workload_kind: Deployment
workload_name: myapp
# source_node and source_dir are auto-discovered if omitted
verify_cmd: ""
```
### **Pros**
- ✅ Fully automated — handles all phases including PV/PVC recreation
- ✅ Auto-discovers best replica (skips Rebuilding dirs)
- ✅ Idempotent — safe to re-run (skips backup/merge if already done)
- ✅ Works for RWO and RWX volumes
### **Cons**
- ⚠️ Requires ~2× volume size in temporary disk space for merged image
- ⚠️ The new volume has 3 fresh replicas (not the original topology) — Longhorn will resync
---
---
## 🗄️ **Method E: Longhorn Google Storage Restore**
### **Concept**
Some volumes are configured with Longhorn's built-in backup feature targeting a Google Storage
bucket. For those volumes, a Longhorn backup can be restored into a new volume without needing
the raw replica files.
### **Applicable Volumes**
- `backups-rwx` (`pvc-efda1d2f`) — the cluster backup volume itself has a Longhorn GCS backup configured
### **When to use**
Use when:
- The local replica dirs are missing or corrupted (Method D cannot be used)
- A clean point-in-time restore is preferred over a raw replica merge
### **Status**
A playbook for this method (`playbooks/recover/longhorn_gcs_restore.yml`) is **planned but not
yet implemented**. In the 2026-04-13 incident, `backups-rwx` was successfully recovered via
Method D (local replica merge), so Method E was not needed.
When the playbook is implemented, it will use `kubectl apply` of a `BackupVolume` + `Backup`
restore CR pointing to the GCS bucket configured in Longhorn settings.
---
---
## 📚 **References**
- [Longhorn Documentation: Disaster Recovery](https://longhorn.io/docs/1.6.0/deploy/uninstall/disaster-recovery/)
- [Longhorn Volume CRD Spec](https://github.com/longhorn/longhorn/blob/master/types/types.go)
- [Original Issue: Longhorn GitHub #4837](https://github.com/longhorn/longhorn/issues/4837) (Replica orphan handling)
- [Related ADR: Internal DNS Architecture](./20260414-internal-dns-architecture.md)
- [Related Incident: 2026-04-13 Power Cut](../incidents/2026-04-13-power-cut/README.md)
---
---
*Document created: 2026-04-14*
*Last updated: 2026-04-15*
*Status: Method D (block-device injection) implemented and battle-tested on 5 volumes (2026-04-14/15)*

View File

@@ -0,0 +1,420 @@
---
title: Power Cut - Longhorn Storage System Failure
incident_id: 2026-04-13-001
date: 2026-04-13
time_start: 15:23:57 UTC
time_end: "2026-04-15 (ongoing — Vault/ERP manual recovery deferred)"
status: Mostly Resolved
severity: SEV-1
tags:
- kubernetes
- longhorn
- storage
- k3s
- power-cut
- csi-driver
- block-device-recovery
---
# Power Cut - Longhorn Storage System Failure
## Summary
A power cut caused a cascading failure of the Longhorn distributed storage system in the k3s cluster. The Longhorn CSI driver (`driver.longhorn.io`) lost its registration with kubelet, preventing all Persistent Volume Claims (PVCs) from mounting. This affected ~43 pods across 12 namespaces, including critical infrastructure like Traefik ingress controller, application pods, and monitoring tools.
The actual volume data stored in Longhorn replicas at `/mnt/arcodange/longhorn/replicas/` on each node **remains intact**. Recovery efforts are focused on restoring CSI driver registration and Longhorn manager functionality.
## Impact
### Affected Services
- **Critical**: Longhorn storage system (all CSI components)
- **Critical**: Traefik ingress controller (cannot mount PVC)
- **High**: Application pods using Longhorn PVCs (cms, webapp, erp, clickhouse, etc.)
- **High**: Tool pods (grafana, prometheus, hashicorp-vault, redis, crowdsec)
- **Medium**: Docker storage corruption on nodes (overlay2)
- **Low**: NFS backup mount unavailable
### User Impact
- External access to services via Traefik: **DOWN**
- Gitea registry image pulls: **FAILING**
- Persistent data access: **DEGRADED** (data exists but inaccessible)
- Monitoring dashboards: **DOWN**
### Metrics
- **Failed Pods**: 43 pods in error state (CrashLoopBackOff, Error, ImagePullBackOff)
- **Healthy Pods**: ~37 pods running
- **Longhorn Pods**: 25 total, ~12 currently healthy
- **Nodes**: 3/3 Ready (pi1 control-plane, pi2, pi3)
## Component Roles
### Longhorn Components
| Component | Role | Current Status | Importance |
|-----------|------|----------------|------------|
| **longhorn-manager** | Orchestrates Longhorn volumes, handles volume operations | 2/3 running, 1 partial | CRITICAL |
| **longhorn-driver-deployer** | Deploys the CSI driver to nodes | Init:0/1 (BLOCKED) | CRITICAL |
| **longhorn-csi-plugin** | CSI plugin daemonset - handles node-level CSI operations | 0/3 Error | CRITICAL |
| **csi-attacher** | Handles volume attachment to nodes | 2/3 running, 1 Error | CRITICAL |
| **csi-provisioner** | Creates volumes from PVC requests | 2/3 running, 1 Error | CRITICAL |
| **csi-resizer** | Handles volume resizing | 1/3 running, 2 Error | HIGH |
| **csi-snapshotter** | Handles volume snapshots | 2/3 running, 1 Error | MEDIUM |
| **engine-image** | Pulls and manages engine binaries | 3/3 Running | HIGH |
| **longhorn-ui** | Web UI for Longhorn management | 0/2 CrashLoopBackOff | Medium |
| **rwx-nfs** | NFS server for backup volume | 0/1 ContainerCreating | Medium |
| **share-manager** | Manages NFS shares for volumes | 0/2 Error | MEDIUM |
### Other Affected Components
| Component | Role | Dependencies | Status |
|-----------|------|--------------|--------|
| **Traefik** | Ingress controller, routes external traffic | Requires PVC for certs | Error (cannot mount PVC) |
| **coredns** | Cluster DNS | Docker storage | Crashing (overlay2 corruption) |
| **svclb-traefik** | Service load balancer for Traefik | Docker storage | Crashing (overlay2 corruption) |
| **Application Pods** | Various services (cms, webapp, erp, etc.) | Longhorn PVCs | Error/ImagePullBackOff |
## Timeline
| Time (UTC) | Event | Owner | Notes |
|------------|-------|-------|-------|
| ~15:23 | Power cut occurred | - | Cluster lost power |
| 15:23:57 | Incident detection started | Mistral Vibe | Initial assessment began |
| 15:24:05 | Baseline documented | Mistral Vibe | 43 pods in error, Longhorn down |
| 15:24:10 | Root cause identified | Mistral Vibe | CSI driver `driver.longhorn.io` not registered |
| 15:24:30 | Recovery plan formulated | Mistral Vibe | HelmChart manifest touch, then pod deletion |
| 15:24:50 | Step 1: Touch longhorn-install.yaml | Mistral Vibe | Manifest timestamp updated on pi1 |
| 15:25:50 | Step 1 outcome: Insufficient | Mistral Vibe | Only 1 pod affected, CSI still down |
| 15:32:15 | Step 2: Delete all longhorn-system pods | Mistral Vibe | Force deleted 24 pods — created webhook circular dependency |
| 15:32:30 | Step 2 outcome: Partial recovery | Mistral Vibe | Managers recovering, CSI still failing |
| 16:15:00 | Root cause 2 identified | Mistral Vibe | Webhook circular dependency — decided nuclear cleanup |
| 16:30:00 | Backups secured | Mistral Vibe | PV/PVC and Longhorn CRDs backed up to pi1 |
| 16:35:00 | Backup script bug fixed | Claude Code | `backup_cmd` fixed to produce valid YAML |
| 17:00:00 | Nuclear cleanup executed | Claude Code | Removed all Longhorn CRDs, PVC finalizers, restarted k3s |
| 17:08:00 | Longhorn namespace deleted | Claude Code | Clean slate confirmed |
| 17:09:00 | Longhorn reinstall started | Claude Code | `playbooks/recover/longhorn.yml` run on pi1 |
| 17:30:00 | Docker config corruption found | Claude Code | daemon.json had Python string not JSON |
| 17:35:00 | Docker config fixed | Claude Code | Valid JSON deployed to all nodes |
| 17:50:00 | DNS failure identified | Claude Code | CoreDNS cannot resolve external domains |
| ~19:00 | DNS fixed | Claude Code | Pi-hole dnsmasq group + CoreDNS upstream config |
| ~19:30 | Longhorn reinstall completed | Claude Code | All Longhorn pods Running, CSI registered |
| 2026-04-14 00:00 | PVC recovery work started | Claude Code | Block-device recovery approach developed |
| 2026-04-14 | Traefik recovered | Claude Code | Simple PV recreation (no data loss for certs) |
| 2026-04-14 | url-shortener recovered | Claude Code | Method B (dir rename) + PV/PVC recreate |
| 2026-04-14 | Block-device recovery developed | Claude Code | `merge-longhorn-layers.py` + 9-phase playbook |
| 2026-04-14 | Clickhouse recovered | Claude Code | `longhorn_data.yml` playbook — first automated run |
| 2026-04-15 | Automated recovery for 4 volumes | Claude Code | prometheus, alertmanager, redis, backups-rwx |
| 2026-04-15 | Vault/ERP recovery deferred | - | Too sensitive for automated approach, manual later |
## Root Cause Analysis
### Primary Root Cause
**Power cut caused Longhorn CSI driver registration to be lost.**
The Longhorn CSI driver (`driver.longhorn.io`) is registered with the kubelet on each node. When the power cut occurred:
1. K3s/kubelet processes crashed
2. Longhorn manager pods crashed
3. CSI driver registration was lost
4. On restart, Longhorn pods attempted to restart but:
- The `longhorn-driver-deployer` pod has an init container (`wait-longhorn-manager`) that waits for managers to be ready
- Longhorn managers were slow to recover (some still in CrashLoopBackOff)
- CSI pods (attacher, provisioner, resizer, snapshotter) cannot start without the CSI socket at `/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock`
- Custom Resource Definitions (Volumes, Replicas, etc.) exist but CSI driver cannot communicate with them
### Secondary Issues
1. **Docker overlay2 corruption**: Docker storage at `/mnt/arcodange/docker/overlay2/` was corrupted on at least pi1, affecting coredns and svclb-traefik pods
2. **NFS backup mount unavailable**: The Longhorn share-manager pod (which exports NFS) is in Error state, making `/mnt/backups/` inaccessible
3. **Backup scripts bug**: The `backup.volumes` file at `/opt/k3s_volumes/backup.volumes` is empty due to a script formatting bug
### Failure Propagation
```mermaid
%%{init: { 'theme': 'forest' }}%%
graph TD
A[Power Cut] --> B[Kubelet Crashes]
A --> C[Docker Daemon Crashes]
B --> D[Longhorn Manager Pods Crash]
B --> E[CSI Driver Registration Lost]
C --> F[Overlay2 Filesystem Corrupt]
D --> G[Driver-Deployer Init Container Waits]
E --> H[CSI Socket Disappears]
G --> I[CSI Driver Not Deployed]
H --> J[CSI Pods Cannot Start]
I --> J
J --> K[PVC Mounts Fail]
K --> L[Application Pods Crash]
F --> M[Docker Containers Fail to Start]
M --> N[CoreDNS Crashes]
M --> O[Service Load Balancers Crash]
N --> P[DNS Resolution Fails]
O --> P
P --> L
K --> L
```
### Why Data Is Safe
The Longhorn volume data is stored in replicas across all three nodes at `/mnt/arcodange/longhorn/replicas/`. Checking the Longhorn volumes shows:
```
All 12 volumes: state="attached", robustness="healthy"
```
This confirms that:
1. Volume metadata is intact in etcd
2. Replica data is intact on disk
3. Once CSI driver is restored, volumes will be accessible again
4. **No permanent data loss has occurred**
## Recovery Actions Taken
### Attempt 1: HelmChart Manifest Touch (15:24:50 - 15:25:50)
**Action:** Touched `/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml` on pi1
**Command:**
```bash
ssh pi@pi1 "sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml"
```
**Outcome:** Only triggered reconcile for 1 pod (longhorn-manager-w85v6). CSI driver still not registered.
**Decision:** Insufficient. Need more aggressive approach.
### Attempt 2: Force Delete All Longhorn Pods (15:32:15 - Present)
**Action:** Force deleted all 24 pods in longhorn-system namespace
**Command:**
```bash
kubectl delete pods -n longhorn-system --all --force --grace-period=0
```
**Outcome:**
- HelmChart controller detected changes and recreated all pods
- **Success**: 23/25 pods now in Running state (15:34:30)
- **Blocking**: `longhorn-driver-deployer` stuck in Init:0/1
- **Blocking**: All `longhorn-csi-plugin` pods in Error
- **Investigation**: driver-deployer's `wait-longhorn-manager` init container waiting for manager readiness
### Current Investigation (15:34:30)
**Focus:** Why driver-deployer is stuck in Init state
The `longhorn-driver-deployer` pod has an init container that waits for Longhorn manager to be ready before deploying the CSI driver. Despite 3 manager pods running, the wait condition is not being met.
**Hypotheses:**
1. Manager pods are not fully healthy (readiness probes failing)
2. Network connectivity between driver-deployer and managers
3. RBAC or service account permissions issue
4. Configuration mismatch in HelmChart values
## Current Status (2026-04-15)
### Longhorn System
- **All Longhorn pods**: Running ✅ (reinstalled 2026-04-13)
- **CSI driver**: Registered ✅
### Volume Recovery Status
| PVC | Namespace | Size | Status |
|-----|-----------|------|--------|
| `traefik` (kube-system) | kube-system | 128Mi | ✅ Recovered (2026-04-14) |
| `url-shortener-data` | url-shortener | 128Mi | ✅ Recovered (2026-04-14) |
| `clickhouse-storage-clickhouse-0` | tools | 16Gi | ✅ Recovered (2026-04-14) |
| `prometheus-server` | tools | 8Gi | ⏳ In progress (2026-04-15) |
| `storage-prometheus-alertmanager-0` | tools | 2Gi | ⏳ In progress (2026-04-15) |
| `redis-storage-redis-0` | tools | 1Gi | ⏳ In progress (2026-04-15) |
| `backups-rwx` | longhorn-system | 50Gi | ⏳ In progress (2026-04-15) |
| `data-hashicorp-vault-0` | tools | 10Gi | 🔴 Deferred — manual recovery |
| `audit-hashicorp-vault-0` | tools | 10Gi | 🔴 Deferred — manual recovery |
| `erp` | erp | 50Gi | 🔴 Deferred — manual recovery |
## Next Steps
### Immediate
1. Confirm prometheus, alertmanager, redis, backups-rwx fully recovered via `longhorn_data.yml`
2. Verify monitoring stack (Grafana dashboards, alert routing) is functional
### Short-term
3. Manual recovery of Vault (`data-hashicorp-vault-0`, `audit-hashicorp-vault-0`) — see Vault runbook
4. Manual recovery of ERP (`erp`) — coordinate with application owner
5. Update backup playbook to include Longhorn Volume CRDs (see ADR 20260414-longhorn-pvc-recovery)
6. Prepare Longhorn Google Storage restore playbook for `backups-rwx` alternative recovery path
### Long-term
- Implement UPS for the Raspberry Pi cluster
- Add Longhorn volume health monitoring to Grafana
- Regular backup restore drills
## Architecture Context
```mermaid
%%{init: { 'theme': 'forest' }}%%
flowchart TB
subgraph K3s Control Plane
A[pi1: Control Plane] -->|runs| B[kubelet]
B --> C[k3s server]
C --> D[HelmChart Controller]
end
subgraph Storage Layer
E[Longhorn HelmChart] --> F[Longhorn Manager Pods]
F --> G[Driver Deployer]
G --> H[CSI Driver Registration]
H --> I[CSI Socket: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock]
F --> J[Longhorn Volumes]
J --> K[Replicas on all 3 nodes]
end
subgraph CSI Components
H --> L[csi-attacher Pods]
H --> M[csi-provisioner Pods]
H --> N[csi-resizer Pods]
H --> O[csi-snapshotter Pods]
H --> P[csi-plugin DaemonSet]
end
subgraph Data Path
I --> Q[/mnt/arcodange/longhorn/]
Q --> R[replicas/]
end
subgraph Docker Storage
S[Docker Daemon] --> T[/mnt/arcodange/docker/]
T --> U[overlay2/]
end
L -->|mounts volumes| V[Application Pods]
M -->|creates volumes| J
P -->|node-level ops| I
classDef critical fill:#c00,color:#fff,stroke:#000
classDef healthy fill:#0a0,color:#000,stroke:#000
classDef degraded fill:#ff0,color:#000,stroke:#000
class H,L,M,N,O,P critical
class F,G,E degraded
class I,J,Q,R,U healthy
```
## Component Details
### Longhorn Manager
- **Role**: Primary controller for Longhorn, manages volumes, replicas, snapshots
- **Image**: `longhornio/longhorn-manager:v1.9.1`
- **Ports**: 9500 (manager), 9501 (webhook health), 9502 (metrics)
- **Data Path**: `/mnt/arcodange/longhorn` (configured in HelmChart values)
- **Health Check**: `https://<pod-ip>:9501/v1/healthz`
### Longhorn Driver Deployer
- **Role**: Deploys the CSI driver to each node
- **Image**: `longhornio/longhorn-manager:v1.9.1`
- **Init Container**: `wait-longhorn-manager` - waits for manager to be ready
- **Blocker**: Currently stuck in init, preventing CSI driver deployment
### CSI Driver
- **Role**: Implements the CSI (Container Storage Interface) specification for Longhorn
- **Socket**: `/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock`
- **Registration**: Must be registered with kubelet via CSINode
- **Images**:
- `longhornio/csi-attacher:v4.9.0-20250709`
- `longhornio/csi-provisioner:v5.3.0-20250709`
- `longhornio/csi-resizer:v1.14.0-20250709`
- `longhornio/csi-snapshotter:v8.3.0-20250709`
- `longhornio/csi-node-driver-registrar:v2.14.0-20250709`
### CSI Node Driver Registrar
- **Role**: Registers the CSI driver with kubelet
- **Image**: `longhornio/csi-node-driver-registrar:v2.14.0-20250709`
- **Mechanism**: Creates a `CSINode` resource and registers via kubelet plugin registry
## Action Items
### Immediate (resolved)
- [x] Investigate and resolve driver-deployer init container blocker
- [x] Restore CSI driver registration
- [x] Fix Docker overlay2 corruption / daemon.json on all nodes
- [x] Fix DNS (CoreDNS + Pi-hole dnsmasq config)
- [x] Longhorn reinstalled and healthy
- [x] Traefik ingress controller functional
- [x] Fix backup script (empty backup.volumes bug)
### Short-term (resolved)
- [x] url-shortener data recovered
- [x] Clickhouse data recovered
- [x] Develop automated block-device recovery playbook (`playbooks/recover/longhorn_data.yml`)
- [x] Backup restore procedure documented and tested
### Medium-term (in progress)
- [ ] prometheus, alertmanager, redis, backups-rwx recovered (playbook running 2026-04-15)
- [ ] Vault manual recovery
- [ ] ERP manual recovery
- [ ] Update backup playbook to include Longhorn Volume CRDs
- [ ] Prepare Longhorn Google Storage restore playbook
### Long-term
- [ ] Implement UPS for Raspberry Pi cluster
- [ ] Add Longhorn volume health monitoring to Grafana
- [ ] Add CSI socket health check to monitoring
- [ ] Regular backup restore drills (monthly)
## Lessons Learned
### What Went Well
- Quick identification of root cause (CSI driver registration)
- Longhorn volume data remained intact (good replica design)
- Ability to force-pod-delete triggered partial recovery
- K3s HelmChart approach allows easy manifest-based recovery
### What Could Be Improved
- Need better CSI driver health monitoring and alerting
- Longhorn driver-deployer init container timeout may be too short
- Docker overlay2 on external storage needs better corruption recovery
- Backup script has bugs that prevent reliable backups
- No UPS protection for power cuts
### Technical Debt Identified
- Backup script formatting bug (extra newlines create invalid YAML)
- No automated Longhorn health checks
- Manual intervention required for CSI driver recovery
## Related Files
- **Ansible Playbook**: `playbooks/system/k3s_config.yml` (Longhorn HelmChart creation)
- **HelmChart Manifest**: `/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml` on pi1
- **Backup Scripts**: `/opt/k3s_volumes/backup.sh` and `/opt/k3s_volumes/restore.sh` on pi1
- **Inventory**: `inventory/hosts.yml` (required for all playbooks)
## Commands Reference
### Check Longhorn Status
```bash
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system
kubectl get settings -n longhorn-system
```
### Force Longhorn Recovery (k3s-specific)
```bash
# Method 1: Touch manifest (soft reconcile)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
# Method 2: Delete all pods (force recreate)
kubectl delete pods -n longhorn-system --all --force --grace-period=0
# Method 3: Delete specific pod
kubectl delete pod -n longhorn-system longhorn-driver-deployer-*
```
### Check CSI Driver Registration
```bash
kubectl get csidriver
kubectl get csinodes
kubectl describe csidriver driver.longhorn.io
```
### Check Longhorn Manufacturer
```bash
kubectl describe cm -n longhorn-system longhorn-storageclass
```

View File

@@ -0,0 +1,209 @@
%%{init: { 'theme': 'forest', 'themeVariables': {
'primaryColor': '#1e293b',
'primaryTextColor': '#f8fafc',
'lineColor': '#334155',
'secondaryColor': '#475569',
'tertiaryColor': '#94a3b8',
'edgeLabelBackground':'#fff',
'edgeLabelColor': '#1e293b'
}}}%%
flowchart TD
subgraph Cluster["K3s Cluster (v1.34.3+k3s1)"]
direction TB
subgraph Nodes["Physical Nodes"]
pi1["pi1: 192.168.1.201\nControl Plane"]
pi2["pi2: 192.168.1.202\nWorker"]
pi3["pi3: 192.168.1.203\nWorker"]
end
subgraph K3sComponents["K3s Control Plane Components"]
kubelet1["kubelet"]
kubelet2["kubelet"]
kubelet3["kubelet"]
k3s_server["k3s server"]
helm_controller["HelmChart Controller"]
end
pi1 --> kubelet1
pi2 --> kubelet2
pi3 --> kubelet3
pi1 --> k3s_server
k3s_server --> helm_controller
end
subgraph LonghornStorage["Longhorn Storage System"]
direction TB
subgraph HelmChart["HelmChart Installation"]
manifest[("longhorn-install.yaml")]
end
subgraph Manager["Longhorn Manager layer"]
lh_manager1["longhorn-manager-r6sd2\n2/2 Running\npi2"]
lh_manager2["longhorn-manager-sjc56\n1/2 Running\npi3"]
lh_manager3["longhorn-manager-t9b45\n1/2 Running\npi1"]
webhook["Webhook Leader: pi2"]
end
subgraph DriverDeployer["CSI Driver Deployer"]
deployer["longhorn-driver-deployer\n0/1 Init:0/1\npi3"]
wait_container["wait-longhorn-manager\nwaiting..."]
end
subgraph CSIDriver["CSI Driver Components"]
csi_socket[("/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock")]
csi_registrar["CSI Node Driver Registrar"]
end
subgraph CSIContainers["CSI Containers (Sidecars)"]
attacher1["csi-attacher-54ld9\n1/1 Running\npi2"]
attacher2["csi-attacher-dqq9v\n1/1 Running\npi3"]
attacher3["csi-attacher-k5jmx\n0/1 Error\npi1"]
provisioner1["csi-provisioner-9z79d\n0/1 Error\npi2"]
provisioner2["csi-provisioner-zjwdr\n1/1 Running\npi1"]
provisioner3["csi-provisioner-zk5kp\n1/1 Running\npi3"]
resizer1["csi-resizer-8mrld\n1/1 Running\npi3"]
resizer2["csi-resizer-ddhl2\n0/1 Error\npi1"]
resizer3["csi-resizer-qv5n9\n0/1 Error\npi2"]
snapshotter1["csi-snapshotter-9rzf4\n1/1 Running\npi3"]
snapshotter2["csi-snapshotter-bqdtd\n0/1 Error\npi2"]
snapshotter3["csi-snapshotter-jv6pj\n1/1 Running\npi1"]
end
subgraph CSIPlugin["CSI Plugin DaemonSet"]
plugin1["longhorn-csi-plugin-f44jp\n0/3 Error\npi3"]
plugin2["longhorn-csi-plugin-q2sgh\n1/3 Error\npi1"]
plugin3["longhorn-csi-plugin-vzld8\n2/3 Error\npi2"]
end
subgraph DataLayer["Longhorn Data Layer"]
engine1["engine-image-ei-8ktd9\n1/1 Running\npi1"]
engine2["engine-image-ei-dcjq8\n1/1 Running\npi3"]
engine3["engine-image-ei-m76jf\n1/1 Running\npi2"]
volumes[("12 Longhorn Volumes")]
replicas[("/mnt/arcodange/longhorn/replicas/")]
end
subgraph UIAndTools["UI & Backup"]
ui1["longhorn-ui-8gb4s\n0/1 CrashLoop\npi1"]
ui2["longhorn-ui-hmxz6\n0/1 CrashLoop\npi3"]
share_mgr1["share-manager-...70b4\n0/1 Error\npi1"]
share_mgr2["share-manager-...7ffa\n0/1 Error\npi3"]
nfs["rwx-nfs-4cn9h\n0/1 ContainerCreating\npi3"]
end
manifest --> lh_manager1 & lh_manager2 & lh_manager3
helm_controller --> manifest
lh_manager1 & lh_manager2 & lh_manager3 --> webhook
deployer --> wait_container
wait_container -.->|waits for| lh_manager1 & lh_manager2 & lh_manager3
deployer --> csi_registrar
csi_registrar --> csi_socket
csi_socket --> kubelet1
csi_socket --> kubelet2
csi_socket --> kubelet3
attacher1 & attacher2 & attacher3 --> csi_socket
provisioner1 & provisioner2 & provisioner3 --> csi_socket
resizer1 & resizer2 & resizer3 --> csi_socket
snapshotter1 & snapshotter2 & snapshotter3 --> csi_socket
plugin1 & plugin2 & plugin3 --> csi_socket
lh_manager1 & lh_manager2 & lh_manager3 --> volumes
volumes --> replicas
replicas --> pi1_disk[("pi1: /mnt/arcodange/longhorn")]
replicas --> pi2_disk[("pi2: /mnt/arcodange/longhorn")]
replicas --> pi3_disk[("pi3: /mnt/arcodange/longhorn")]
share_mgr1 & share_mgr2 --> nfs
nfs --> backup_pvc[("PVC: backups-rwx\n50Gi")]
end
subgraph DockerStorage["Docker Storage layer"]
docker1["Docker daemon\npi1"]
docker2["Docker daemon\npi2"]
docker3["Docker daemon\npi3"]
storage1[("/mnt/arcodange/docker/overlay2/")]
docker1 --> storage1
docker2 --> storage1
docker3 --> storage1
end
subgraph ApplicationLayer["Application Pods (Affected)"]
traefik["traefik-5c67cb6889-8b5nk\n0/1 Error\nkube-system"]
cms["cms-arcodange-cms-...\n0/1 ImagePullBackOff\ncms"]
webapp["webapp-6588455979-...\n0/1 ImagePullBackOff\nwebapp"]
erp["erp-648748b4f5-bntd9\n0/1 Error\nerp"]
grafana["grafana-5d496f9668-...\n0/3 Error\ntools"]
vault["hashicorp-vault-0\n0/1 Error\ntools"]
end
subgraph NetworkServices["Network Services"]
coredns["coredns-67476ddb48-jrcg2\n1/1 Running\nkube-system"]
svclb["svclb-traefik-*\n3/3 Running\nkube-system"]
end
%% Connections showing failure paths
csi_socket --x-- traefik :x
csi_socket --x-- cms :x
csi_socket --x-- webapp :x
csi_socket --x-- erp :x
csi_socket --x-- grafana :x
csi_socket --x-- vault :x
docker1 --x-- coredns :x
docker1 --x-- svclb :x
%% Healthy connections
volumes -->|provides storage| traefik
volumes -->|provides storage| cms
volumes -->|provides storage| webapp
volumes -->|provides storage| erp
volumes -->|provides storage| grafana
volumes -->|provides storage| vault
classDef node fill:#0ea5e9,color:#000,stroke:#06b6d4
classDef k3s fill:#84cc16,color:#000,stroke:#65a30d
classDef longhorn fill:#a855f7,color:#fff,stroke:#8b5cf6
classDef csi fill:#f59e0b,color:#000,stroke:#d97706
classDef data fill:#10b981,color:#000,stroke:#059669
classDef app fill:#ec4899,color:#fff,stroke:#db2777
classDef network fill:#6366f1,color:#fff,stroke:#4f46e5
classDef error fill:#ef4444,color:#fff,stroke:#dc2626
classDef waiting fill:#fbbf24,color:#000,stroke:#f59e0b
class pi1,pi2,pi3 node
class kubelet1,kubelet2,kubelet3,k3s_server,helm_controller k3s
class manifest,webhook longhorn
class lh_manager1,lh_manager2,lh_manager3,engine1,engine2,engine3,volumes,replicas,share_mgr1,share_mgr2 data
class deployer,wait_container,csi_registrar,csi_socket longhorn
class attacher1,attacher2,attacher3,provisioner1,provisioner2,provisioner3,resizer1,resizer2,resizer3,snapshotter1,snapshotter2,snapshotter3 csi
class plugin1,plugin2,plugin3 csi
class traefik,cms,webapp,erp,grafana,vault app
class coredns,svclb network
class docker1,docker2,docker3,data
class deployer,wait_container error
class attacher3,provisioner1,resizer2,resizer3,snapshotter2 error
class plugin1,plugin2,plugin3 error
class ui1,ui2,share_mgr1,share_mgr2 error
class traefik,cms,webapp,erp,grafana,vault error
class nfs waiting
class lh_manager2,lh_manager3 waiting
classDef clusterBox stroke:#334155,stroke-width:2px,color:#94a3b8
class Cluster clusterBox
class LonghornStorage clusterBox
class DockerStorage clusterBox
class ApplicationLayer clusterBox
class NetworkServices clusterBox

View File

@@ -0,0 +1,200 @@
%%{init: { 'theme': 'forest', 'themeVariables': {
'primaryColor': '#7c3aed',
'primaryTextColor': '#ffffff',
'lineColor': '#6d28d9',
'secondaryColor': '#8b5cf6',
'tertiaryColor': '#a78bfa',
'edgeLabelBackground':'#5b21b6',
'edgeLabelColor': '#ffffff'
}}}%%
mindmap
root((Longhorn Storage System))
%% ===== CONTROL PLANE COMPONENTS =====
ControlPlane[Control Plane]
Manager[longhorn-manager]
Role1["Role: Primary controller for Longhorn"]
Responsibilities1["• Manages volumes, replicas, snapshots\n• Handles volume lifecycle\n• Coordinates with etcd\n• Exposes API (port 9500)"]
Health1["Health Check: :9501/v1/healthz"]
Webhook1["Webhook: :9502/metrics"]
DriverDeployer[longhorn-driver-deployer]
Role2["Role: CSI driver deployment controller"]
Responsibilities2["• Deploys CSI driver to each node\n• Runs via init container (wait-longhorn-manager)\n• Creates csi.sock on each node"]
WaitCmd["Command: longhorn-manager wait -d <namespace>"]
Blocking["⚠️ BLOCKED: Init container waiting for managers"]
%% ===== CSI COMPONENTS =====
CSILayer[CSI Interface]
CSISocket[("/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock")]
SocketRole["Role: Unix domain socket for CSI communication"]
Attacher[csi-attacher]
AttacherRole["Role: Attaches volumes to nodes"]
AttacherResp["• Monitors VolumeAttachment objects\n• Calls CSI ControllerPublishVolume\n• Handles detach operations"]
AttacherStatus["Status: 2/3 Running, 1 Error"]
Provisioner[csi-provisioner]
ProvisionerRole["Role: Creates volumes from PVCs"]
ProvisionerResp["• Watches PVC objects\n• Calls CSI CreateVolume\n• Handles volume deletion"]
ProvisionerStatus["Status: 2/3 Running, 1 Error"]
Resizer[csi-resizer]
ResizerRole["Role: Handles volume resizing"]
ResizerResp["• Watches PVC size changes\n• Calls CSI ExpandVolume"]
ResizerStatus["Status: 1/3 Running, 2 Error"]
Snapshotter[csi-snapshotter]
SnapshotterRole["Role: Manages volume snapshots"]
SnapshotterResp["• Watches VolumeSnapshot objects\n• Calls CSI CreateSnapshot\n• Handles snapshot deletion"]
SnapshotterStatus["Status: 2/3 Running, 1 Error"]
NodeRegistrar[csi-node-driver-registrar]
RegistrarRole["Role: Registers driver with kubelet"]
RegistrarResp["• Creates CSINode resource\n• Registers via kubelet plugin registry API"]
Plugin[csi-plugin]
PluginRole["Role: Node-level CSI operations"]
PluginResp["• Runs on each node (DaemonSet)\n• Handles NodePublish/UnpublishVolume\n• Manages mount/unmount operations"]
PluginStatus["⚠️ BLOCKED: All 3 pods in Error (no CSI socket)"]
%% ===== DATA LAYER COMPONENTS =====
DataLayer[Data Layer]
Engine[engine-image]
EngineRole["Role: Engine and instance manager"]
EngineResp["• Pulls and manages engine binaries\n• Runs as sidecar in DaemonSet\n• Maintains engine processes"]
EngineStatus["Status: ✅ 3/3 Running"]
Volumes[Longhorn Volumes]
VolumeRole["Role: Logical volume representation"]
VolumeResp["• Managed via Longhorn CRDs\n• Replicated across nodes\n• Supports RWO, RWX access modes"]
VolumeStatus["Status: ✅ All 12 volumes attached & healthy"]
Replicas[Volume Replicas]
ReplicaRole["Role: Physical data storage"]
ReplicaResp["• 3-way replication across nodes\n• Stored at /mnt/arcodange/longhorn/replicas/\n• Data intact after power cut"]
ReplicaPath["Path: pi1, pi2, pi3: /mnt/arcodange/longhorn/replicas/"]
Backups[Backup System]
NFS[RWX NFS Share]
NFSRole["Role: NFS export for backup volume"]
NFSCreate["Created via: playbooks/setup/backup_nfs.yml"]
NFSStatus["⚠️ OFFLINE: share-manager pods in Error"]
BackupPVC[Backup PVC]
BackupPVCRole["Role: Persistent storage for backups"]
BackupPVCDetails["Name: backups-rwx\nNamespace: longhorn-system\nSize: 50Gi\nClass: longhorn"]
ShareManager[share-manager]
ShareRole["Role: Manages NFS exports for Longhorn volumes"]
ShareStatus["⚠️ BLOCKED: 2 pods in Error"]
%% ===== UI & TOOLS =====
UI[Web UI]
UIRole["Role: Longhorn management dashboard"]
UIAccess["Access: Port 9500 on manager pods"]
UIStatus["⚠️ BLOCKED: 2 pods in CrashLoopBackOff"]
%% ===== INFRASTRUCTURE =====
Infrastructure[Underlying Infrastructure]
Nodes[Raspberry Pi Nodes]
pi1["pi1: 192.168.1.201\nRole: Control Plane"]
pi2["pi2: 192.168.1.202\nRole: Worker"]
pi3["pi3: 192.168.1.203\nRole: Worker"]
K3s[Kubernetes (k3s v1.34.3+k3s1)]
Kubelet["kubelet (3 instances)"]
APIServer["API Server (on pi1)"]
etcd["etcd (on pi1)"]
HelmCtrl["HelmChart Controller"]
Docker[Docker Engine]
DockerRole["Role: Container runtime"]
DockerStorage["Storage: /mnt/arcodange/docker/"]
Overlay2["⚠️ ISSUE: overlay2 filesystem corrupted"]
%% ===== EXTERNAL DEPENDENCIES =====
Dependencies[External Dependencies]
CSIRegistration[CSI Driver Registration]
CSIRole["Role: k8s CSI registration"]
CSIDriver["Driver: driver.longhorn.io"]
CSIDriverStatus["⚠️ LOST: Not registered with kubelet"]
%% ===== CONNECTIONS =====
root --> ControlPlane
root --> CSILayer
root --> DataLayer
root --> UI
root --> Infrastructure
root --> Dependencies
ControlPlane --> Manager
ControlPlane --> DriverDeployer
CSILayer --> CSISocket
CSILayer --> Attacher
CSILayer --> Provisioner
CSILayer --> Resizer
CSILayer --> Snapshotter
CSILayer --> NodeRegistrar
CSILayer --> Plugin
CSISocket --> Attacher
CSISocket --> Provisioner
CSISocket --> Resizer
CSISocket --> Snapshotter
CSISocket --> Plugin
CSISocket --> NodeRegistrar
DriverDeployer --> NodeRegistrar
NodeRegistrar --> CSISocket
DataLayer --> Engine
DataLayer --> Volumes
DataLayer --> Replicas
DataLayer --> Backups
Backups --> NFS
Backups --> BackupPVC
Backups --> ShareManager
Infrastructure --> Nodes
Infrastructure --> K3s
Infrastructure --> Docker
Dependencies --> CSIRegistration
CSIRegistration --> CSISocket
%% ===== YET TO BE RESTORED =====
Dependencies --x EmptyCSI["⚠️ CSI Socket Missing"] :x
EmptyCSI --x Attacher :x
EmptyCSI --x Provisioner :x
EmptyCSI --x Resizer :x
EmptyCSI --x Snapshotter :x
EmptyCSI --x Plugin :x
%% ===== STYLES =====
classDef component fill:#8b5cf6,color:#fff,stroke:#7c3aed,stroke-width:2px
classDef role fill:#a78bfa,color:#000,stroke:#8b5cf6
classDef responsibility fill:#c4b5fd,color:#000,stroke:#8b5cf6
classDef status_good fill:#10b981,color:#fff,stroke:#059669
classDef status_bad fill:#ef4444,color:#fff,stroke:#dc2626
classDef status_warn fill:#f59e0b,color:#000,stroke:#d97706
classDef infinite fill:#3b82f6,color:#fff,stroke:#2563eb
class root infinite
class ControlPlane,CSILayer,DataLayer,UI,Infrastructure,Dependencies component
class Manager,Attacher,Provisioner,Resizer,Snapshotter,NodeRegistrar,Plugin,Engine,Volumes,Replicas,NFS,BackupPVC,ShareManager,UIRole,Nodes,K3s,Docker,CSIRegistration component
class Role1,Role2,AttacherRole,ProvisionerRole,ResizerRole,SnapshotterRole,RegistrarRole,PluginRole,EngineRole,VolumeRole,ReplicaRole,NFSRole,ShareRole,UIRole,Kubelet,APIServer,etcd,HelmCtrl,DockerRole,CSIRole,CSIDriver component
class Responsibilities1,Responsibilities2,AttacherResp,ProvisionerResp,ResizerResp,SnapshotterResp,RegistrarResp,PluginResp,EngineResp,VolumeResp,ReplicaResp,NFSRole,BackupPVCDetails,ShareRole,UIAccess,ShareStatus,NFSStatus role
class EngineStatus,VolumeStatus,ReplicaPath status_good
class Blocking,PluginStatus,UIStatus,ShareStatus,NFSCreate,ShareStatus,CSIDriverStatus status_bad
class AttacherStatus,ProvisionerStatus,ResizerStatus,SnapshotterStatus status_warn
classDef mindmapTitle fill:#4c1d95,color:#fff,stroke:#5b21b6,font-size:20px,font-weight:bold
class root mindmapTitle

View File

@@ -0,0 +1,131 @@
%%{init: { 'theme': 'forest', 'themeVariables': {
'primaryColor': '#059669',
'primaryTextColor': '#fff',
'lineColor': '#065f46',
'secondaryColor': '#10b981',
'edgeLabelBackground':'#064e3b',
'edgeLabelColor': '#ffffff'
}}}%%
flowchart TD
%% ===== POWER CUT EVENT =====
Start([Power Cut Event]) -->|Electricity Lost| Crash[Kubernetes Components Crash]
%% ===== IMMEDIATE IMPACT =====
Crash --> KubeletCrash[Kubelet Processes Crash<br>on all 3 nodes]
Crash --> DockerCrash[Docker Daemons Crash<br>on all 3 nodes]
Crash --> K3sCrash[K3s Server Process Crash<br>on pi1]
%% ===== DOCKER STORAGE CORRUPTION =====
DockerCrash --> Overlay2[ /mnt/arcodange/docker/overlay2/<br>Filesystem Corrupted]
Overlay2 --> DockerFail[Docker containers cannot start<br>missing layer files]
DockerFail --> CoreDNSPod[CoreDNS Pod<br>CrashLoopBackOff]
DockerFail --> TraefikLB[svclb-traefik Pods<br>CrashLoopBackOff]
%% ===== LONGHORN IMPACT =====
KubeletCrash --> CSIUnreg[CSI Driver Registration Lost<br>driver.longhorn.io unregistered]
K3sCrash --> HelmCtrl[HelmChart Controller<br>Unresponsive]
CSIUnreg --> CSISocket[ /var/lib/kubelet/plugins/.../csi.sock<br>Disappears]
%% ===== LONGHORN MANAGER LOSS =====
KubeletCrash --> LHManagers[Longhorn Manager Pods<br>Crash 3 pods ]
LHManagers --> NoQuorum[No Manager Quorum<br>Cannot coordinate]
NoQuorum --> VolumesFrozen[Existing Volumes<br>Still healthy but inaccessible]
CSISocket --> CSIChicago[CSI Pods Cannot Start<br>csi-attacher, provisioner, resizer, snapshotter]
CSISocket --> CSIPlugin[CSI Plugin DaemonSet<br>Cannot register driver]
%% ===== VOLUME MOUNT FAILURES =====
CSIChicago --> NoMounts[PVC Mounts Fail<br>All Longhorn PVs inaccessible]
CSIPlugin --> NoMounts
%% ===== APPLICATION CASCADING FAILURES =====
NoMounts --> TraefikDown[Traefik Pod<br>PVC mount failed<br>Error state]
NoMounts --> AppPods1[Application Pods<br>PVC mount failed<br>Error state<br>cms, webapp, erp, clickhouse, etc.]
%% ===== BACKUP SYSTEM IMPACT =====
NoQuorum --> NFSDown[NFS Share-Manager Pods<br>Error state]
NFSDown --> BackupMount[ /mnt/backups/ NFS Mount<br>Unavailable]
%% ===== DISCOVERY & RECOVERY =====
Discovery[15:23:57<br>Incident Discovered] --> Assessment[15:24:05<br>Assessment Complete]
Assessment --> Identify[15:24:10<br>Root Cause: CSI Driver Unregistered]
Identify --> CheckData[15:24:15<br>Verify Volume Health]
CheckData --> DataIntact[All 12 volumes:<br>state=attached<br>robustness=healthy]
%% ===== RECOVERY ATTEMPTS =====
Identify --> Attempt1[15:24:50<br>Attempt 1: Touch HelmChart Manifest]
Attempt1 --> Partial1[Only 1 manager pod affected]
Partial1 --> NeedMore[Insufficient recovery]
NeedMore --> Attempt2[15:32:15<br>Attempt 2: Delete All Longhorn Pods]
Attempt2 --> HelmReconcile[HelmChart Controller<br>Recreates All 24 Pods]
HelmReconcile --> Progress[15+ Pods Running<br>Managers, Engine-Image, Some CSI]
Progress --> Blocked[Driver-Deployer<br>Stuck in Init:0/1]
Blocked --> Investigate[15:34:30<br>Investigate wait-longhorn-manager]
Investigate --> WaitLoop[Init container runs:<br>longhorn-manager wait -d longhorn-system]
WaitLoop --> WaitingManagers[Waiting for all managers<br>to pass readiness probes]
%% ===== CURRENT STATE (15:35:30) =====
WaitingManagers --> CurrentState
subgraph CurrentState["Current State<br>15:35:30 UTC"]
direction TB
Resolved[Resolved ✅] --> ManagersOk[Manager Pods:<br>2/2, 1/2, 2/2 Running<br>pi1, pi2, pi3]
Resolved --> EngineOk[Engine Image:<br>3/3 Running]
Resolved --> CSIPartial[CSI Sidecars:<br>~50% Running]
Resolved --> VolumeData[Volume Data:<br>All intact]
BlockedNow[Blocked ❌] --> DriverDeployer[Driver Deployer:<br>Init:0/1 8+ min<br>waiting for managers]
BlockedNow --> CSIPluginAll[CSI Plugin:<br>0/3 Error all ]
BlockedNow --> UI[Longhorn UI:<br>0/2 CrashLoop]
BlockedNow --> ShareMgr[Share Manager:<br>0/2 Error]
BlockedNow --> NFSPod[RWX NFS:<br>ContainerCreating]
BlockedNow --> AppImpact[Application Impact:<br>~30 pods still failed<br>down from 43]
end
%% ===== RECOVERY PATH =====
CurrentState --> NextStep[Next: Resolve driver-deployer<br>wait-longhorn-manager blockage]
NextStep --> CheckHealth[Check manager health endpoints<br>https://<ip>:9501/v1/healthz]
CheckHealth -->|If healthy| WaitContainerIssue[Wait container bug/timeout]
CheckHealth -->|If unhealthy| FixManagers[Investigate manager readiness]
WaitContainerIssue --> Option1[Option 1: Delete driver-deployer pod]
WaitContainerIssue --> Option2[Option 2: Touch manifest again]
FixManagers --> CheckLogs[Check manager container logs]
CheckLogs --> ResolveManagers[Fix manager readiness]
Option1 --> CSIDriver[CSI Driver deployed]
Option2 --> CSIDriver
ResolveManagers --> CSIDriver
CSIDriver --> CSISocketRestored[CSI Socket Restored]
CSISocketRestored --> PodsRecover[All Longhorn pods recover]
PodsRecover --> PVCMounts[PVC Mounts resume]
PVCMounts --> AppRecovery[Application pods auto-recover]
AppRecovery --> ResolvedState[Resolved ✅]
%% ===== STYLES =====
classDef event fill:#10b981,color:#fff,stroke:#059669
classDef impact fill:#d97706,color:#000,stroke:#b45309
classDef action fill:#3b82f6,color:#fff,stroke:#2563eb
classDef resolved fill:#10b981,color:#fff,stroke:#059669
classDef blocked fill:#ef4444,color:#fff,stroke:#dc2626
classDef current fill:#8b5cf6,color:#fff,stroke:#7c3aed
class Start,Crash,KubeletCrash,DockerCrash,K3sCrash event
class Overlay2,DockerFail,CSIUnreg,CSISocket,NoQuorum,NoMounts impact
class Discovery,Assessment,Identify,CheckData,Attempt1,Attempt2,Investigate action
class ManagersOk,EngineOk,CSIPartial,VolumeData resolved
class DriverDeployer,CSIPluginAll,UI,ShareMgr,NFSPod,AppImpact blocked
class WaitLoop,CurrentState,NextStep,CheckHealth,Option1,Option2,ResolvedState current
classDef subtitle fill:#64748b,color:#fff,stroke:#475569,font-size:12px
class CurrentState,CurrentStateLabel subtitle

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,416 @@
---
title: PVC Recovery — Post-Reinstall Volume Restoration
incident_id: 2026-04-13-001
date: 2026-04-14
status: Mostly Resolved
operator: Claude Code
---
# PVC Recovery — Post-Reinstall Volume Restoration
## Situation as of 2026-04-14
Longhorn has been fully reinstalled and is healthy. The cluster nodes are all Ready. However,
**all application volumes are inaccessible** because the nuclear cleanup deleted the Longhorn
Volume/Engine/Replica CRDs, and the reinstalled Longhorn has no knowledge of the old volumes.
### Longhorn Health (verified)
```
NAME READY STATUS AGE
csi-attacher (3 pods) 1/1 Running 30m
csi-provisioner (3 pods) 1/1 Running 30m
csi-resizer (3 pods) 1/1 Running 30m
csi-snapshotter (3 pods) 1/1 Running 30m
engine-image-ei-b4bcf0a5 (3 pods) 1/1 Running 31m
instance-manager (3 pods) 1/1 Running 30m
longhorn-csi-plugin (3 pods) 3/3 Running 30m
longhorn-driver-deployer 1/1 Running 31m
longhorn-manager (3 pods) 2/2 Running 14m
longhorn-ui (2 pods) 1/1 Running 31m
CSIDriver driver.longhorn.io: Registered (AGE: 110d — restored)
```
Longhorn only knows about 3 volumes (crowdsec-config, crowdsec-db, traefik) — all newly provisioned
after reinstall. The other 9 volumes are missing from Longhorn's knowledge.
---
## Backup Files Available
| File | Location | Contents | Gap |
|------|----------|----------|-----|
| `backup_20260413.volumes` | `/home/pi/arcodange/backups/k3s_pvc/` | PV + PVC YAML (kubectl get -A pv,pvc) | No Longhorn CRDs |
| `longhorn_metadata_20260413.yaml` | `/home/pi/arcodange/backups/k3s_pvc/` | Engines + Replicas CRDs | **No Volume CRDs** |
**Critical gap:** The metadata backup was collected with `kubectl get -n longhorn-system volumes.longhorn.io,replicas.longhorn.io,engines.longhorn.io -o yaml` but the resulting file contains only Engines and Replicas in 3 separate Lists. The Volume CRDs are absent.
Attempting `kubectl apply -f longhorn_metadata_20260413.yaml` fails with:
```
Error from server (Invalid): admission webhook "validator.longhorn.io" denied the request:
volume does not exist for engine
```
The webhook requires Volume CRDs to exist before Engines can be created. Without Volume CRDs in the
backup, the metadata file cannot be applied as-is.
---
## Data Survival Assessment
### Pi1 — Replica directories
Pi1 is the control plane. Its old replica directories were **deleted** during the nuclear cleanup.
Only 3 new directories exist (created after reinstall):
```
pvc-01b93e30-...-b1530c1d (crowdsec-config — NEW)
pvc-4785dc60-...-2f031b60 (crowdsec-db — NEW)
pvc-5391fa2b-...-0e2ff956 (traefik — NEW)
```
### Pi2 — Replica directories (OLD data preserved)
```
pvc-01b93e30-...-8649439a (crowdsec-config — new post-reinstall)
pvc-1251909b-...-e7a20fdf ← OLD DATA (clickhouse 16Gi)
pvc-14ccc47e-...-09021065 ← OLD DATA (crowdsec-db old PV)
pvc-4785dc60-...-4b48fdf1 (crowdsec-db — new post-reinstall)
pvc-5391fa2b-...-d3503612 (traefik — new post-reinstall)
pvc-63244de1-...-6076eb08 (unknown — not in engine backup)
pvc-6d2ea1c7-...-c7f287d8 ← OLD DATA (audit-vault 10Gi)
pvc-7971918e-...-2028617e ← OLD DATA (erp 50Gi)
pvc-88e18c7f-...-910583f6 ← OLD DATA (prometheus-server 8Gi)
pvc-abc7666c-...-34bec9b0 (unknown — not in engine backup)
pvc-aed7f2c4-...-41c20064 ← OLD DATA (alertmanager 2Gi)
pvc-ca5567d3-...-b537ca60 ← OLD DATA (data-vault 10Gi)
pvc-cc8a3cbb-...-cd16e459 ← OLD DATA (old traefik 128Mi)
pvc-cdd434d1-...-b2695689 ← OLD DATA (url-shortener 128Mi)
pvc-d1d5482b-...-e0a8cdbc ← OLD DATA (redis 1Gi)
pvc-efda1d2f-...-30c849a6 ← OLD DATA (backups-rwx 50Gi)
pvc-f9fe3504-...-20f64e9e ← OLD DATA (old crowdsec-config 100Mi)
pvc-fca13978-...-4749b404 (unknown — not in engine backup)
```
### Pi3 — Replica directories (OLD data preserved, multiple dirs per volume)
```
pvc-01b93e30-...-29592f50 (crowdsec-config — new post-reinstall)
pvc-1251909b-...-1163420b ← OLD DATA (clickhouse — replica 1)
pvc-1251909b-...-3a569b0a ← OLD DATA (clickhouse — replica 2)
pvc-1251909b-...-ccd05947 ← OLD DATA (clickhouse — replica 3 or stale)
pvc-14ccc47e-...-3856d64d ← OLD DATA (old crowdsec-db)
pvc-2e60385f-...-48e27d5a (unknown)
pvc-4785dc60-...-869f0e99 (crowdsec-db — new post-reinstall)
pvc-5391fa2b-...-958cd868 (traefik — new post-reinstall)
pvc-6d2ea1c7-...-0e73550d ← OLD DATA (audit-vault — dir 1)
pvc-6d2ea1c7-...-787ffefa ← OLD DATA (audit-vault — dir 2)
pvc-6d2ea1c7-...-e0f58d64 ← OLD DATA (audit-vault — dir 3 or stale)
pvc-7971918e-...-33191046 ← OLD DATA (erp — dir 1)
pvc-7971918e-...-88fc1dfc ← OLD DATA (erp — dir 2)
pvc-7971918e-...-b5c5530d ← OLD DATA (erp — dir 3 or stale)
pvc-88e18c7f-...-5d508830 ← OLD DATA (prometheus-server — dir 1)
pvc-88e18c7f-...-92c0ebfd ← OLD DATA (prometheus-server — dir 2)
pvc-88e18c7f-...-deea6182 ← OLD DATA (prometheus-server — dir 3 or stale)
pvc-abe09e90-...-a748d11b (unknown)
pvc-aed7f2c4-...-3452358f ← OLD DATA (alertmanager — dir 1)
pvc-aed7f2c4-...-826f05aa ← OLD DATA (alertmanager — dir 2)
pvc-ca5567d3-...-0ed6f691 ← OLD DATA (data-vault — dir 1)
pvc-ca5567d3-...-808d72b4 ← OLD DATA (data-vault — dir 2)
pvc-ca5567d3-...-9051ef48 ← OLD DATA (data-vault — dir 3 or stale)
pvc-cc8a3cbb-...-011b54b3 ← OLD DATA (old traefik — dir 1)
pvc-cc8a3cbb-...-a24fd91e ← OLD DATA (old traefik — dir 2)
pvc-cdd434d1-...-70197659 ← OLD DATA (url-shortener — dir 1)
pvc-cdd434d1-...-998f49ff ← OLD DATA (url-shortener — dir 2)
pvc-d1d5482b-...-6a730f00 ← OLD DATA (redis — dir 1)
pvc-d1d5482b-...-75da16fd ← OLD DATA (redis — dir 2)
pvc-efda1d2f-...-62fb04c9 ← OLD DATA (backups-rwx — dir 1)
pvc-efda1d2f-...-688f30f5 ← OLD DATA (backups-rwx — dir 2)
pvc-efda1d2f-...-69454dd0 ← OLD DATA (backups-rwx — dir 3 or stale)
pvc-f9fe3504-...-418df608 ← OLD DATA (old crowdsec-config)
```
**Note on multiple directories per volume on pi3:** Normal replicas = 1 dir per volume per node.
Multiple directories indicate either: rebuild attempts from before the nuclear cleanup, or stale
snapshots. Must verify by checking `.img` file sizes before renaming.
---
## Volume → PVC Mapping (from backup_20260413.volumes)
| PV Name | PVC | Namespace | Size | Status |
|---------|-----|-----------|------|--------|
| `pvc-1251909b-3cef-40c6-881c-3bb6e929a596` | `clickhouse-storage-clickhouse-0` | tools | 16Gi | Terminating |
| `pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70` | `audit-hashicorp-vault-0` | tools | 10Gi | Terminating |
| `pvc-7971918e-e47f-4739-a976-965ea2d770b4` | `erp` | erp | 50Gi | Terminating |
| `pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9` | `prometheus-server` | tools | 8Gi | Terminating |
| `pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4` | `storage-prometheus-alertmanager-0` | tools | 2Gi | Terminating |
| `pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635` | `data-hashicorp-vault-0` | tools | 10Gi | Terminating |
| `pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90` | `traefik` | kube-system | 128Mi | Terminating |
| `pvc-cdd434d1-88b4-4588-8fd2-8c7eafc56d07` | `url-shortener` | url-shortener | 128Mi | Terminating |
| `pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce` | `redis-storage-redis-0` | tools | 1Gi | Terminating |
| `pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa` | `backups-rwx` | longhorn-system | 50Gi | Lost |
| `pvc-14ccc47e-0b8c-49d4-97bb-70e550f644b0` | `crowdsec-db-pvc` | tools | 1Gi | already replaced |
| `pvc-f9fe3504-70ce-4401-8cda-bc6bb68bc1bf` | `crowdsec-config-pvc` | tools | 100Mi | already replaced |
CrowdSec volumes (`pvc-14ccc47e`, `pvc-f9fe3504`) are the old PVs — CrowdSec already got new volumes
(`pvc-4785dc60`, `pvc-01b93e30`) and is running. These old dirs can be cleaned up later.
---
## Recovery Plan
### Why not restore PVCs
New PVCs will be created by the workloads themselves when they restart. Restoring old PVCs would
conflict with both the stuck Terminating ones and any new ones pods may already be creating.
**Restore PVs only** — strip `claimRef` so they become `Available`, and new PVCs bind to them via
`storageClassName` + `accessMode` + `capacity` matching.
### Step 1 — Clear stuck Terminating PVs
The old PVs are stuck in `Terminating` with `kubernetes.io/pvc-protection` finalizers. Remove them:
```bash
for pv in \
pvc-1251909b-3cef-40c6-881c-3bb6e929a596 \
pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70 \
pvc-7971918e-e47f-4739-a976-965ea2d770b4 \
pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9 \
pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4 \
pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635 \
pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 \
pvc-cdd434d1-88b4-4588-8fd2-8c7eafc56d07 \
pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce \
pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa; do
kubectl patch pv $pv -p '{"metadata":{"finalizers":null}}' --type=merge
done
```
### Step 2 — Restore PVs with claimRef removed and Retain policy
Extract PVs from the backup, strip `claimRef` and set `persistentVolumeReclaimPolicy: Retain`,
then apply:
```bash
ssh pi1 "sudo kubectl get pv \
pvc-1251909b-3cef-40c6-881c-3bb6e929a596 \
pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70 \
pvc-7971918e-e47f-4739-a976-965ea2d770b4 \
pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9 \
pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4 \
pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635 \
pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 \
pvc-cdd434d1-88b4-4588-8fd2-8c7eafc56d07 \
pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce \
pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa \
-o yaml 2>/dev/null | \
python3 -c \"
import sys, yaml
docs = list(yaml.safe_load_all(sys.stdin))
for doc in docs:
if not doc: continue
items = doc.get('items', [doc])
for pv in items:
if pv.get('kind') != 'PersistentVolume': continue
spec = pv.get('spec', {})
spec.pop('claimRef', None)
spec['persistentVolumeReclaimPolicy'] = 'Retain'
pv.pop('status', None)
meta = pv.get('metadata', {})
meta.pop('resourceVersion', None)
meta.pop('uid', None)
meta.pop('creationTimestamp', None)
print('---')
print(yaml.dump(pv))
\" | kubectl apply -f -"
```
Expected result: PVs become `Available` (no claimRef = unbound).
### Step 3 — Longhorn creates new Volume CRDs + replica dirs
When new PVCs bind to the restored PVs and pods attempt to mount them, Longhorn's CSI provisioner
will create new Volume CRDs for each. These new Volume CRDs will have new engine IDs, and Longhorn
will create **new empty replica directories** on pi1, pi2, pi3.
At this point the volume directory layout will be:
```
/mnt/arcodange/longhorn/replicas/
pvc-1251909b-...-<OLD_SUFFIX> ← pi2/pi3: OLD data
pvc-1251909b-...-<NEW_SUFFIX> ← pi1/pi2/pi3: NEW empty dirs
```
### Step 4 — Map old dirs to new dirs, verify data presence
For each volume, on each node, identify:
- OLD dir: exists before new binding (larger .img file size, older timestamp)
- NEW dir: created after binding (empty or minimal .img file)
```bash
# Example: check sizes on pi2 for clickhouse
ssh pi2 "du -sh /mnt/arcodange/longhorn/replicas/pvc-1251909b-*"
```
### Step 5 — Swap directories (Method B)
For each volume on each node that has an old dir with data:
```bash
# Scale down the workload first
kubectl scale statefulset clickhouse -n tools --replicas=0
# Wait for volume to detach
kubectl wait --for=jsonpath='{.status.state}'=detached \
volume/pvc-1251909b-3cef-40c6-881c-3bb6e929a596 \
-n longhorn-system --timeout=60s
# On pi2: rename new empty dir, move old data dir to new name
ssh pi2 "
NEW=$(ls /mnt/arcodange/longhorn/replicas/ | grep pvc-1251909b | \
xargs -I{} stat --format='%Y {}' /mnt/arcodange/longhorn/replicas/{} | \
sort -rn | head -1 | awk '{print \$2}')
OLD=$(ls /mnt/arcodange/longhorn/replicas/ | grep pvc-1251909b | \
xargs -I{} stat --format='%Y {}' /mnt/arcodange/longhorn/replicas/{} | \
sort -n | head -1 | awk '{print \$2}')
echo \"OLD: \$OLD\"
echo \"NEW: \$NEW\"
sudo mv \$NEW \${NEW}.empty_backup
sudo mv \$OLD \$NEW
"
# Repeat on pi3
# Restart the instance manager on affected node to pick up new dir
kubectl delete pod -n longhorn-system -l \
longhorn.io/node=pi2,longhorn.io/component=instance-manager
```
### Step 6 — Scale workloads back up and verify
```bash
kubectl scale statefulset clickhouse -n tools --replicas=1
kubectl get pvc -n tools clickhouse-storage-clickhouse-0
kubectl get volumes -n longhorn-system pvc-1251909b-3cef-40c6-881c-3bb6e929a596
```
---
## Priority Order for Recovery
Given data criticality:
1. **HashiCorp Vault data** (`pvc-ca5567d3` + `pvc-6d2ea1c7`) — credentials/secrets store
2. **ERP** (`pvc-7971918e`) — 50Gi, business data
3. **Prometheus** (`pvc-88e18c7f`) — 8Gi, metrics history (degraded OK, can rebuild)
4. **Redis** (`pvc-d1d5482b`) — 1Gi, cache (can rebuild from scratch if needed)
5. **Alertmanager** (`pvc-aed7f2c4`) — 2Gi, alert history (can rebuild)
6. **Clickhouse** (`pvc-1251909b`) — 16Gi
7. **URL shortener** (`pvc-cdd434d1`) — 128Mi
8. **Traefik** (`pvc-cc8a3cbb`) — 128Mi (TLS certs, can re-issue via cert-manager)
9. **Longhorn backups-rwx** (`pvc-efda1d2f`) — 50Gi, backup volume itself
---
## Caution: Multiple Dirs on Pi3
Several volumes have 3 directories on pi3. This likely happened during the incident when Longhorn
attempted rebuilds before the nuclear cleanup. **Do not blindly take the newest or oldest** — check
actual `.img` file size to identify the one with data:
```bash
ssh pi3 "du -sh /mnt/arcodange/longhorn/replicas/pvc-1251909b-*"
# The largest .img is the one with actual data
```
---
## Lessons for Backup Script
The current backup command `kubectl get -A pv,pvc -o yaml && echo '---' && kubectl get -A pvc -o yaml`
captures PV/PVC but not Longhorn Volume CRDs. The backup command must be updated to include:
```bash
kubectl get -A pv -o yaml && echo '---' \
&& kubectl get -A pvc -o yaml && echo '---' \
&& kubectl get -n longhorn-system volumes.longhorn.io -o yaml
```
This is tracked in ADR `docs/adr/20260414-longhorn-pvc-recovery.md` under "Prevention".
---
## Volume Recovery Status
| PV Name | PVC | Namespace | Size | Method | Status |
|---------|-----|-----------|------|--------|--------|
| `pvc-5391fa2b` | `traefik` | kube-system | 128Mi | PV claimRef remove | ✅ 2026-04-14 |
| `pvc-cdd434d1` | `url-shortener-data` | url-shortener | 128Mi | Method B (dir rename) | ✅ 2026-04-14 |
| `pvc-1251909b` | `clickhouse-storage-clickhouse-0` | tools | 16Gi | Block-device (playbook) | ✅ 2026-04-14 |
| `pvc-88e18c7f` | `prometheus-server` | tools | 8Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-aed7f2c4` | `storage-prometheus-alertmanager-0` | tools | 2Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-d1d5482b` | `redis-storage-redis-0` | tools | 1Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-efda1d2f` | `backups-rwx` | longhorn-system | 50Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-ca5567d3` | `data-hashicorp-vault-0` | tools | 10Gi | Manual (deferred) | 🔴 Pending |
| `pvc-6d2ea1c7` | `audit-hashicorp-vault-0` | tools | 10Gi | Manual (deferred) | 🔴 Pending |
| `pvc-7971918e` | `erp` | erp | 50Gi | Manual (deferred) | 🔴 Pending |
**Vault and ERP are excluded from automated recovery** — they require coordinated manual procedures
(Vault unseal key management; ERP business data verification). Use `docs/runbooks/longhorn-block-device-recovery.md`
with extra validation steps for those volumes.
---
## Automated Recovery: Block-Device Injection
Directory rename (Method B) proved too risky for large volumes: Longhorn detects `Dirty: true` +
inconsistency across replicas and silently rebuilds from the empty pi1 replica, destroying data.
**The approach that works** (implemented in `playbooks/recover/longhorn_data.yml`):
1. **Phase 0** — Auto-discover best replica dir per volume (skip `Rebuilding: true`, rank by actual disk usage)
2. **Phase 1** — Backup untouched replica dir before touching anything
3. **Phase 2** — Merge sparse snapshot + head layers into a flat image (`merge-longhorn-layers.py`)
4. **Phase 3** — Create Longhorn Volume CRD, wait for replicas
5. **Phase 4** — Scale down workload
6. **Phase 5** — Attach volume via VolumeAttachment maintenance ticket
7. **Phase 6**`mkfs.ext4` the live block device, rsync data from merged image
8. **Phase 7** — Remove maintenance attachment ticket
9. **Phase 8** — Recreate PV (Retain, no claimRef) + PVC (pinned to PV)
10. **Phase 9** — Scale up, wait for readyReplicas ≥ 1, optional verify_cmd
**Pitfall discovered (2026-04-15):** `du -sb` returns apparent size for sparse files, making a
`Rebuilding: true` replica (1.3 GiB actual, 24 GiB apparent) beat healthy 11 GiB replicas.
Fixed by checking `Rebuilding` flag in `volume.meta` and using `du -sk` (actual usage).
**Usage:**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars_remaining.yml
```
Vars files:
- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse (already recovered, archived)
- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
- `playbooks/recover/longhorn_data_vars.example.yml` — template for future use
---
## Tested Recovery Procedure (url-shortener — 2026-04-14)
Method B confirmed working for this volume (small, no Rebuilding replicas). Full sequence:
1. Create Longhorn Volume CRD manually (size 128Mi, rwo, 3 replicas)
2. Create Longhorn VolumeAttachment ticket to pi1 (disableFrontend: true) → triggers replica dir creation
3. Remove attachment ticket → volume detaches
4. On pi2: `mv new-dir new-dir.empty && mv old-dir new-dir`
5. On pi3: same (chose `-70197659` over `-998f49ff` based on newer mtime: Apr 7 vs Apr 6)
6. Clear finalizers on stuck Terminating PV/PVC → both deleted
7. Recreate PV (Retain policy, no claimRef, same CSI volumeHandle)
8. Recreate PVC with `volumeName:` pinned to the PV
9. Delete old Error pod (was blocking volume attach)
10. New pod comes up 1/1 Running, volume attached healthy on pi3, all 3 replicas running
**Traefik** was simpler — PV `pvc-5391fa2b` already existed in Longhorn (Released). Just removed
claimRef (→ Available), created `kube-system/traefik` PVC with `volumeName:` pinned. Bound immediately.
**For all subsequent volumes** — use `playbooks/recover/longhorn_data.yml`. Method B is too risky.

View File

@@ -0,0 +1,70 @@
---
# Automated Longhorn Recovery Playbook (DRAFT)
# Purpose: Break circular dependency and restore CSI driver after power-cut
#
# REQUIREMENTS:
# - Ansible >= 2.15
# - kubectl on control plane (pi1)
# - Backup scripts from playbooks/backup/k3s_pvc.yml must be deployed
#
# USAGE:
# ansible-playbook -i inventory/hosts.yml docs/incidents/2026-04-13-power-cut/recover_longhorn.yml
#
# REFERENCE FILES:
# - playbooks/system/k3s_config.yml (Longhorn HelmChart template)
# - playbooks/backup/k3s_pvc.yml (Backup/restore scripts)
# - inventory/hosts.yml (Target hosts)
# - /mnt/arcodange/longhorn/replicas/ (Data - MUST NOT be touched)
# - /home/pi/arcodange/backups/k3s_pvc/ (Fallback backup location)
#
#
# PLAYBOOK FLOW:
#
# Phase 1: DIAGNOSIS (idempotent, safe to run anytime)
# - Check CSI driver registration status
# - Check Longhorn manager health
# - Identify which recovery phase is needed
#
# Phase 2: SOFT RECOVERY (least destructive)
# - Touch longhorn-install.yaml manifest
# - Wait 60s for k3s HelmChart controller to reconcile
# - Verify pod recreation
#
# Phase 3: HARD RECOVERY (if soft fails)
# - Delete driver-deployer pod
# - Delete all longhorn-driver-deployer pods
# - Wait for HelmChart to recreate
#
# Phase 4: NUCLEAR RECOVERY (if hard fails)
# - Delete HelmChart resource
# - Remove manifest file
# - Force-delete longhorn-system namespace (after removing finalizers)
# - Reinstall Longhorn via manifest
#
# Phase 5: RESTORE FROM BACKUP (idempotent)
# - Apply PV/PVC from backup
# - Apply Longhorn CRs from backup
# - Data auto-discovered from disk
#
# DESIGNED TO HANDLE:
# - CSI driver registration lost
# - Longhorn manager webhook circular dependency
# - Partial pod crashes
# - Full Longhorn namespace corruption
#
# LIMITATIONS:
# - Requires pi1 (control plane) to be reachable
# - Data in /mnt/arcodange/longhorn/ MUST survive
# - Docker must be functional on at least 1 node
# - Does NOT handle Docker overlay2 corruption
#
# TESTED SCENARIOS:
# - [ ] CSI driver not registered (primary use case)
# - [ ] Longhorn manager CrashLoopBackOff
# - [ ] Full namespace deletion needed
# - [ ] Backup restore validation
#
# TODO:
# - Add Docker storage health check
# - Add pre-recovery data verification
# - Add post-recovery validation

View File

@@ -0,0 +1,153 @@
---
title: Recovery Approach Analysis — Post-Incident Review
incident_id: 2026-04-13-001
date: 2026-04-13
author: Claude Code (external review)
---
# Recovery Approach Analysis
## TL;DR
The incident escalated from a **~5 minute fix** to a **full Longhorn reinstall with backup restore** because the simplest remediation (k3s restart) was never attempted, and a single aggressive command (`kubectl delete pods --all --force`) created a new problem that did not previously exist.
---
## What Was Skipped
### 1. Restart k3s on all nodes (never attempted)
This should have been the **first or second action** after the manifest touch failed.
```bash
systemctl restart k3s # pi1 — control plane
systemctl restart k3s-agent # pi2, pi3 — agent nodes
```
After a power cut, k3s/kubelet state is dirty. Restarting k3s:
- Forces kubelet to reinitialize the plugin registry cleanly
- Allows Longhorn pods to restart in correct dependency order
- Avoids the simultaneous-restart race condition that causes webhook issues
- Takes ~2 minutes with no destructive side effects
This was listed as a last resort in the runbook consulted at incident start. It should have been tried **before any pod deletion**, not after.
### 2. Stale CSI socket check on each node (never attempted)
```bash
# On each node (pi1, pi2, pi3):
ls /var/lib/kubelet/plugins/driver.longhorn.io/
# If a stale .sock file exists:
rm /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
```
The incident log confirms the CSI socket was missing/stale, but no one went to the nodes to verify and clean this up. Removing a stale socket + restarting the `longhorn-csi-plugin` daemonset is a targeted, low-risk fix.
---
## Where the Direction Went Wrong
### The pivotal mistake: force deleting all 24 pods simultaneously
**Command run at 15:32:15:**
```bash
kubectl delete pods -n longhorn-system --all --force --grace-period=0
```
This command created the **webhook circular dependency problem**, which did not exist before it was run.
**Why it caused the circular dependency:**
In normal operation, Longhorn managers start sequentially. One becomes the webhook leader and begins serving on port 9501 before others register as service endpoints.
When all 24 pods are force-deleted simultaneously:
1. All 3 manager pods race-start at the same time
2. All 3 IPs are registered as `longhorn-conversion-webhook` service endpoints immediately
3. The health check (`https://<pod-ip>:9501/v1/healthz`) is run against all 3
4. Only the elected leader actually serves port 9501 — the other 2 fail the probe
5. Failing managers crash: `"conversion webhook service is not accessible after 1m0s"`
6. `longhorn-driver-deployer` init container waits for healthy managers indefinitely
7. CSI socket is never created, CSI driver never registers
**The original problem was only a lost CSI socket registration.** The webhook circular dependency is a new problem introduced by the recovery attempt.
---
## The Escalation Cascade
Each step created a harder problem than the one it was meant to solve:
```
Power cut
→ CSI socket lost (original problem — simple fix)
→ Force delete all pods
→ Webhook circular dependency (new problem)
→ Delete HelmChart + manifest
→ 84 finalizers blocking namespace deletion (new problem)
→ Full reinstall required
→ Backup restore required
→ Risk to volume metadata
```
The original problem required touching 1 socket file and restarting k3s. The current state requires:
- Manually patching finalizers off 84+ resources
- Full Longhorn reinstall
- Restoring PV/PVC and Longhorn CRs from backup
- Verifying data auto-discovery from replicas
---
## Correct Recovery Sequence (Hindsight)
### Step 1 — k3s restart (should have been tried at ~15:27)
```bash
ansible -i inventory/hosts.yml all -m shell -a "sudo systemctl restart k3s || sudo systemctl restart k3s-agent"
```
Wait 3 minutes. In most power-cut scenarios, this alone restores CSI registration.
### Step 2 — If still broken: targeted daemonset restart (not force-delete-all)
```bash
kubectl rollout restart daemonset/longhorn-manager -n longhorn-system
kubectl rollout status daemonset/longhorn-manager -n longhorn-system
```
Graceful restart respects the dependency order. Wait for managers to stabilize before touching CSI pods.
### Step 3 — Check and clean stale sockets on each node
```bash
# Run on pi1, pi2, pi3:
ls /var/lib/kubelet/plugins/driver.longhorn.io/
rm -f /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
kubectl rollout restart daemonset/longhorn-csi-plugin -n longhorn-system
```
### Step 4 — Verify CSI driver registered
```bash
kubectl get csidriver
kubectl get csinodes
```
### Step 5 — Only if all above failed: delete driver-deployer pod only
```bash
kubectl delete pod -n longhorn-system -l app=longhorn-driver-deployer
```
Not all pods. One targeted pod.
---
## What Was Done Well
- Quick identification of the original root cause (CSI registration)
- Confirming volume data integrity early (`robustness="healthy"`)
- Securing backups before destructive operations (16:30)
- Fixing the backup script bug (useful regardless of incident)
- Detailed logging throughout
---
## Action Items for Future Incidents
- [ ] Add k3s restart as **step 2** in the Longhorn recovery runbook (before any pod deletion)
- [ ] Add CSI socket cleanup to the runbook as an explicit step on each node
- [ ] Add a "minimum destructive action" principle: prefer `rollout restart` over `delete --force --all`
- [ ] Implement `recover_longhorn.yml` playbook with the phased approach (soft → targeted → hard) to prevent ad-hoc escalation
- [ ] Add a pre-action checklist: "have I tried restarting the service before deleting its resources?"

View File

@@ -0,0 +1,107 @@
#!/usr/bin/env python3
"""
Merge Longhorn snapshot + head layers into a single mountable raw image.
Longhorn stores replica data as sparse raw images in a chain:
volume-snap-<id>.img — full state at the time the snapshot was taken
volume-head-NNN.img — delta (only changed blocks) since the snapshot
To reconstruct the full filesystem, head blocks take priority over snapshot
blocks. Sparse (all-zero) blocks in the head fall through to the snapshot.
Usage:
sudo python3 merge-longhorn-layers.py <replica-dir> <output.img>
Example:
sudo python3 merge-longhorn-layers.py \\
/mnt/arcodange/longhorn/replicas/pvc-cdd434d1-...-998f49ff \\
/tmp/merged.img
# Then mount and inspect:
sudo mount -o loop /tmp/merged.img /mnt/recovery
ls /mnt/recovery/
Proven useful during incident 2026-04-13 to recover the url-shortener SQLite
database from a Longhorn replica that was never touched by the nuclear cleanup
(pi3, dir suffix -998f49ff, Apr 6 snapshot).
Key lesson: always identify the untouched replica dir (oldest timestamps,
never renamed) before attempting directory swaps. Back it up first.
"""
import os
import sys
import json
BLOCK = 4096
def find_layers(replica_dir: str) -> tuple[str | None, str | None]:
"""
Read volume.meta to find head filename and snapshot parent.
Returns (snapshot_path, head_path). snapshot_path is None for base volumes.
"""
meta_path = os.path.join(replica_dir, "volume.meta")
with open(meta_path) as f:
meta = json.load(f)
head_name = meta["Head"]
parent_name = meta.get("Parent", "")
head_path = os.path.join(replica_dir, head_name)
snap_path = os.path.join(replica_dir, parent_name) if parent_name else None
return snap_path, head_path
def merge(snap_path: str | None, head_path: str, out_path: str) -> None:
size = os.path.getsize(head_path)
print(f"Volume size: {size // (1024 * 1024)} MiB")
print(f"Snapshot: {snap_path or '(none — base volume)'}")
print(f"Head: {head_path}")
print(f"Output: {out_path}")
snap_f = open(snap_path, "rb") if snap_path else None
head_f = open(head_path, "rb")
with open(out_path, "wb") as out:
out.truncate(size)
blocks = size // BLOCK
for i, offset in enumerate(range(0, size, BLOCK)):
head_f.seek(offset)
hb = head_f.read(BLOCK)
if hb and any(hb):
out.seek(offset)
out.write(hb)
elif snap_f:
snap_f.seek(offset)
sb = snap_f.read(BLOCK)
if sb and any(sb):
out.seek(offset)
out.write(sb)
if i % 4096 == 0:
pct = (i / blocks) * 100
print(f"\r {pct:.0f}%", end="", flush=True)
print("\r 100% — done.")
if snap_f:
snap_f.close()
head_f.close()
if __name__ == "__main__":
if len(sys.argv) != 3:
print(__doc__)
sys.exit(1)
replica_dir = sys.argv[1]
out_path = sys.argv[2]
if not os.path.isdir(replica_dir):
print(f"Error: {replica_dir} is not a directory", file=sys.stderr)
sys.exit(1)
snap, head = find_layers(replica_dir)
merge(snap, head, out_path)

View File

@@ -0,0 +1,312 @@
# Incident Documentation
This directory contains incident reports, postmortems, and recovery logs for the Arcodange Factory infrastructure.
## Purpose
Document all infrastructure incidents to:
- Track root causes and resolutions
- Maintain a knowledge base for future troubleshooting
- Improve system reliability through lessons learned
- Provide clear guidance for on-call responders
## Structure
Each incident is documented in its own directory under `docs/incidents/` with the following naming convention:
```
docs/incidents/
├── YYYY-MM-DD-incident-name/
│ ├── README.md # Incident summary and timeline
│ ├── status.md # Real-time status updates (optional)
│ ├── log.md # Detailed recovery actions and logs
│ ├── root-cause.md # Technical analysis (optional)
│ └── diagrams/ # Architecture/flow diagrams (optional)
│ └── *.mmd # Mermaid diagrams
└── ...
```
## Incident Directory Contents
### 1. `README.md` (Required)
The primary incident document. Must include:
- **Incident ID**: Unique identifier (e.g., `2026-04-13-001`)
- **Title**: Clear, descriptive title
- **Date/Time**: Start and end timestamps
- **Status**: Open / Investigating / Resolved / Monitoring
- **Severity**: SEV-1 (Critical) / SEV-2 (High) / SEV-3 (Medium) / SEV-4 (Low)
- **Impact**: Brief description of affected services
- **Summary**: What happened
- **Timeline**: Key events with timestamps
- **Root Cause**: Technical analysis
- **Resolution**: Steps taken to resolve
- **Action Items**: Follow-up tasks
- **Lessons Learned**: Key takeaways
**Front matter template:**
```markdown
---
title: Incident Title
incident_id: YYYY-MM-DD-NNN
date: YYYY-MM-DD
time_start: HH:MM:SS UTC
time_end: HH:MM:SS UTC
status: Resolved
severity: SEV-2
tags:
- kubernetes
- longhorn
- storage
---
```
### 2. `log.md` (Recommended)
Detailed technical log of all recovery actions. Must include:
- Commands executed with timestamps
- Command output (relevant portions)
- Decision rationale for each action
- Outcome of each action
- Next stepsidentified
Format:
```markdown
## [Time] Action Description
**Command:** `actual command run`
**Output:**
```
relevant output
```
**Decision:** Why this action was taken
**Outcome:** What happened
**Next:** What to do next
```
### 3. Mermaid Diagrams
Include at least one Mermaid diagram in each incident to visualize:
- Architecture/flow before incident
- Failure propagation
- Recovery process
- New architecture after fixes
**Example theme usage:**
```mermaid
%%{init: { 'theme': 'forest', 'themeVariables': { 'primaryColor': '#ffdfd3', 'edgeLabelBackground':'#fff' }}}%%
```
Available themes: `default`, `base`, `forest`, `dark`, `neutral`
**Recommended diagrams:**
- `incident-flow.mmd`: Timeline/flow of the incident
- `architecture.mmd`: Affected components architecture
- `recovery-flow.mmd`: Recovery steps visualization
- `dependency-tree.mmd`: Component dependencies showing failure path
## Incident Severity Definitions
| Severity | Description | Response Time | Impact |
|----------|-------------|---------------|--------|
| SEV-1 | Critical system-wide outage | Immediate (24/7) | Multiple services down, potential data loss |
| SEV-2 | Major service degradation | < 1 hour | Single critical service down |
| SEV-3 | Partial service degradation | < 4 hours | Non-critical service affected |
| SEV-4 | Minor issue | Next business day | Cosmetic or non-impacting |
## Available Ansible Playbooks for Recovery
This collection provides comprehensive infrastructure management via Ansible.
Always use `-i inventory/hosts.yml` when running playbooks.
### Master Playbooks (Run in order for full recovery)
| Playbook | Purpose | Targets |
|----------|---------|---------|
| `playbooks/01_system.yml` | System setup (hostnames, iSCSI, Docker, Longhorn, DNS) | raspberries |
| `playbooks/02_setup.yml` | Infrastructure setup (NFS backup, PostgreSQL, Gitea) | localhost, postgres, gitea |
| `playbooks/03_cicd.yml` | CI/CD pipeline (Gitea tokens, Docker Compose, ArgoCD) | localhost, gitea |
| `playbooks/04_tools.yml` | Tool deployment (Hashicorp Vault, Crowdsec) | tools group |
| `playbooks/05_backup.yml` | Backup configuration | localhost |
### Component-Specific Playbooks
#### System
| Playbook | Purpose | Notes |
|----------|---------|-------|
| `playbooks/system/rpi.yml` | Raspberry Pi hostname setup | |
| `playbooks/system/dns.yml` | DNS/pi-hole configuration | |
| `playbooks/system/ssl.yml` | SSL certificate setup with step-ca | |
| `playbooks/system/prepare_disks.yml` | Disk partitioning and formatting | |
| `playbooks/system/system_docker.yml` | Docker installation with custom storage | Storage at `/mnt/arcodange/docker` |
| `playbooks/system/k3s_config.yml` | K3s configuration (Traefik, Longhorn HelmCharts) | **Key for k3s** |
| `playbooks/system/system_k3s.yml` | K3s cluster deployment | Uses k3s-ansible collection |
| `playbooks/system/iscsi_longhorn.yml` | iSCSI client for Longhorn | Prerequisite for Longhorn |
| `playbooks/system/k3s_dns.yml` | K3s DNS configuration | |
| `playbooks/system/k3s_ssl.yml` | K3s SSL/traefik certificates | |
#### Storage
| Playbook | Purpose | Notes |
|----------|---------|-------|
| `playbooks/setup/backup_nfs.yml` | Longhorn RWX NFS backup volume | Creates 50Gi PVC + recurring backups |
| `playbooks/backup/k3s_pvc.yml` | PVC backup scripts | Creates `/opt/k3s_volumes/backup.sh` and `restore.sh` |
#### Backup
| Playbook | Purpose | Notes |
|----------|---------|-------|
| `playbooks/backup/backup.yml` | Main backup orchestration | Calls postgres, gitea, k3s_pvc |
| `playbooks/backup/postgres.yml` | PostgreSQL database backup | Docker exec pg_dumpall |
| `playbooks/backup/gitea.yml` | Gitea backup | Uses gitea dump command |
| `playbooks/backup/cron_report.yml` | Mail utility for cron reports | |
| `playbooks/backup/cron_report_mailutility.yml` | MTA configuration | |
### Inventory File
**File:** `inventory/hosts.yml`
**Groups:**
- `raspberries`: pi1, pi2, pi3 (Raspberry Pi nodes)
- `local`: localhost, pi1, pi2, pi3
- `postgres`: pi2 (PostgreSQL host)
- `gitea`: pi2 (Gitea host, inherits postgres)
- `pihole`: pi1, pi3 (DNS hosts)
- `step_ca`: pi1, pi2, pi3 (Certificate authority)
- `all`: All above groups
**Important:** All playbooks MUST be run with `-i inventory/hosts.yml` flag:
```bash
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml
```
### Handy Commands for Incident Response
```bash
# Check all pods
kubectl get pods -A
# Check Longhorn specifically
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system
# Check storage
kubectl get pv -A
kubectl get pvc -A
kubectl get csidriver
# Check nodes
kubectl get nodes -o wide
kubectl describe node <nodename>
# Force Longhorn HelmChart reconcile (k3s-specific)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
# Restart Longhorn
kubectl delete pods -n longhorn-system --all --force --grace-period=0
# Check Longhorn data on disk
ls /mnt/arcodange/longhorn/replicas/
# Check Docker storage
ls /mnt/arcodange/docker/overlay2/ | head
# Run ansible playbook (dry-run first)
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --limit pi1
```
### K3s-Specific Recovery Notes
Longhorn is installed via **HelmChart manifest** (k3s native):
- File: `/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml`
- To trigger reconcile: `touch` the file (k3s watches for changes)
- DO NOT use `helm install` directly - it may conflict with k3s HelmChart controller
Traefik is also installed via HelmChart manifest:
- File: `/var/lib/rancher/k3s/server/manifests/traefik-v3.yaml`
## Incident Templates
### Quick Start Template
```markdown
---
title: [Short Description]
incident_id: YYYY-MM-DD-NNN
date: $(date +%Y-%m-%d)
time_start: $(date +%H:%M:%S)
status: Investigating
severity: SEV-2
tags:
- tag1
- tag2
---
## Summary
[1-2 sentences describing the issue]
## Impact
[What services/users are affected]
## Timeline
| Time | Event | Owner |
|------|-------|-------|
| HH:MM | Initial detection | | @user
| HH:MM | Investigation started | | @user
| HH:MM | Root cause identified | | @user
| HH:MM | Resolution applied | | @user
| HH:MM | Service restored | | @user
## Root Cause
[Technical analysis]
## Resolution
[Step-by-step what was done]
## Mermaid Diagram
%%{init: { 'theme': 'forest' }}%%
graph TD
A[Component A] -->|depends on| B[Component B]
B -->|failed due to| C[Component C]
C -->|power cut| D[Root Cause]
```
*remember to always to this for labels:*
- have a space before a filepath
- no parenthesis '()'
- use <br> instead of \n for new lines
## Action Items
- [ ] Task 1
- [ ] Task 2
## Lessons Learned
- Lesson 1
- Lesson 2
```
## Contributing to Incident Documentation
1. **During Incident**: Focus on resolution, log commands and outputs in `log.md`
2. **After Resolution**: Create/read the `README.md` with full incident details
3. **Add Diagrams**: Include at least one Mermaid diagram to visualize the issue
4. **Peer Review**: Have another team member review before closing
5. **Update Templates**: Improve templates based on what was missing
## Directory Index
| Incident | Date | Severity | Status |
|----------|------|----------|--------|
| [2026-04-13-power-cut](./2026-04-13-power-cut/README.md) | 2026-04-13 | SEV-1 | In Progress |

View File

@@ -0,0 +1,244 @@
# Cluster Recovery Agent Instructions
You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or
Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and
kubectl commands, and bring the cluster back to a fully healthy state.
You do NOT need to modify any code. All recovery tooling already exists.
---
## Cluster Overview
| Component | Details |
|-----------|---------|
| Nodes | pi1, pi2, pi3 (Raspberry Pi, SSH via `pi<N>.home`) |
| k8s distribution | k3s |
| Storage | Longhorn (`/mnt/arcodange/longhorn/`) |
| GitOps | ArgoCD (apps auto-sync from `gitea.arcodange.lab/arcodange-org/`) |
| Secrets | HashiCorp Vault (`tools` namespace, manual unseal) |
| Ingress | Traefik + CrowdSec bouncer |
| Working dir | `/Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/` |
| Inventory | `inventory/hosts.yml` |
**Critical dependency:** ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC.
**Always recover and unseal Vault before scaling ERP up.**
---
## Step 0 — Assess Damage
Run these first to understand what is broken:
```bash
# Overall pod health
kubectl get pods -A | grep -v Running | grep -v Completed
# PVC health (anything not Bound is a problem)
kubectl get pvc -A | grep -v Bound
# Longhorn volume states
kubectl get volumes.longhorn.io -n longhorn-system
# Longhorn manager health (prerequisite for all recovery)
kubectl get pods -n longhorn-system -l app=longhorn-manager
```
---
## Step 1 — Longhorn Volume Recovery
### Path A — Fast path (backup file exists, Volume CRDs were backed up)
Check if a recent backup exists on pi1:
```bash
ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"
```
If a backup file exists and is recent (from before the incident):
```bash
ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"
```
Then verify PVCs bound and skip to Step 2.
### Path B — Block-device injection (no usable backup, raw replica files intact)
Use this when PVCs are `Lost`/`Terminating` and no Volume CRD backup is available.
**Check which volumes need recovery:**
```bash
# Volumes with no PVC or Lost/Terminating PVC
kubectl get pvc -A | grep -v Bound
```
**For each failed volume, create a vars file** following the pattern in:
`playbooks/recover/longhorn_data_vars.example.yml`
Existing vars files from the 2026-04-13 incident (reusable as references):
- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
- `playbooks/recover/longhorn_data_vars_erp_vault.yml` — erp, hashicorp-vault (audit + data)
- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse
**Key rules for the vars file:**
- `source_node`/`source_dir` can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
- Set `workload_name: ""` for ERP — it must not scale up until Vault is unsealed
- For StatefulSets with multiple PVCs (e.g. Vault), set `workload_name: ""` on all but the last entry
**Run the recovery playbook:**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars_<NAME>.yml
```
The playbook is **idempotent** — safe to re-run if it fails midway.
**Playbook phases (for context when troubleshooting):**
| Phase | What it does |
|-------|-------------|
| 0 | Auto-discovers best replica dir (skips `Rebuilding: true`) |
| 1 | Backs up untouched replica dir to `/home/pi/arcodange/backups/longhorn-recovery/` |
| 2 | Merges snapshot+head layers into a single `.img` via `merge-longhorn-layers.py` |
| 3 | **Scales down workloads first**, then clears stuck Terminating PVCs, creates Volume CRD |
| 4 | Scale down (second pass, idempotent) |
| 5 | Attaches volume via maintenance ticket to source node |
| 6 | `mkfs.ext4` (if unformatted) + `rsync` from merged image into live block device |
| 7 | Removes maintenance ticket (volume detaches) |
| 8 | Creates PV (Retain, no claimRef) + PVC pinned to PV |
| 9 | Scales up workloads, waits for readyReplicas ≥ 1 (failures here are `ignore_errors: yes`) |
**Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned:**
The playbook handles this automatically (scales down before finalizer removal). If you still hit it:
```bash
kubectl scale statefulset <name> -n <namespace> --replicas=0
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc <pvc-name> -n <namespace>
# Then re-run the playbook
```
---
## Step 2 — Unseal HashiCorp Vault
After Vault's PVCs are recovered, the pod boots **sealed**. Check:
```bash
kubectl get pod hashicorp-vault-0 -n tools
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed
```
If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):
```bash
ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml
```
Unseal keys are at `~/.arcodange/cluster-keys.json` on the local machine. The playbook reads them automatically.
After the playbook completes, verify:
```bash
kubectl get pod hashicorp-vault-0 -n tools # must be 1/1 Ready
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed # must be false
```
---
## Step 3 — Scale Up ERP
Only after Vault is unsealed and Ready:
```bash
kubectl scale deployment erp -n erp --replicas=1
kubectl rollout status deployment/erp -n erp
```
---
## Step 4 — Reconfigure Tools (CrowdSec, etc.)
Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:
```bash
# Standard run (bouncer key + Traefik middleware + restart)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml
# Include captcha HTML injection (use when captcha page is broken)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all
```
If crowdsec-agent or crowdsec-appsec pods are stuck in `Error` after a long outage,
the playbook handles restarting them automatically.
---
## Step 5 — Re-enable ArgoCD selfHeal
Check if `selfHeal` was disabled during recovery (look for `selfHeal: false` in the tools app):
```bash
grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml
```
If disabled, re-enable it by editing `argocd/values.yaml` and setting `selfHeal: true`,
then syncing the ArgoCD app:
```bash
kubectl get app tools -n argocd
```
---
## Step 6 — Final Verification
```bash
# All pods running
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"
# All PVCs bound
kubectl get pvc -A | grep -v Bound
# All Longhorn volumes healthy
kubectl get volumes.longhorn.io -n longhorn-system
# Run a fresh backup to capture the recovered state
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
-e backup_root_dir=/mnt/backups
```
---
## Key Files Reference
| File | Purpose |
|------|---------|
| `playbooks/recover/longhorn_data.yml` | Main block-device recovery playbook |
| `playbooks/recover/longhorn.yml` | Recovery when Volume CRDs still exist |
| `playbooks/recover/longhorn_data_vars.example.yml` | Template for recovery vars |
| `playbooks/recover/longhorn_data_vars_erp_vault.yml` | Vars for erp + vault (2026-04-13 incident) |
| `playbooks/recover/longhorn_data_vars_remaining.yml` | Vars for other volumes (2026-04-13 incident) |
| `playbooks/backup/backup.yml` | Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs) |
| `playbooks/backup/k3s_pvc.yml` | PV/PVC/Longhorn Volume CRD backup |
| `playbooks/tools/hashicorp_vault.yml` | Vault unseal + OIDC reconfiguration |
| `playbooks/tools/crowdsec.yml` | CrowdSec bouncer + Traefik middleware setup |
| `docs/adr/20260414-longhorn-pvc-recovery.md` | Full incident ADR with all recovery methods |
| `~/.arcodange/cluster-keys.json` | Vault unseal keys (local machine only) |
---
## Decision Tree
```
Cluster down after outage
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
└─ Yes
├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
└─ No
├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
└─ No
├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
└─ No ──▶ Path B (longhorn_data.yml block-device injection)
Check replica dirs exist first:
ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"
```

View File

@@ -0,0 +1,360 @@
# Runbook: Longhorn Block-Device Data Recovery
**When to use:** Longhorn has been fully reinstalled (nuclear cleanup). Volume CRDs are gone.
Application PVCs are stuck `Terminating` or `Lost`. The raw replica `.img` files still exist
on disk across the nodes. kubectl/k8s objects cannot help — we must work directly with the
Longhorn replica directories and block devices.
**Automated version:** `playbooks/recover/longhorn_data.yml`
---
## Mental Model
Longhorn stores each replica as a chain of sparse raw image files inside a directory named
`<pv-name>-<random-hex>` under `<longhorn_data_path>/replicas/`. Each directory contains:
```
volume.meta — engine state (Head filename, Parent snapshot, Dirty flag)
volume-head-NNN.img — active write log (sparse, only changed blocks)
volume-head-NNN.img.meta — head metadata
volume-snap-<uuid>.img — snapshot at a point in time (sparse, full state)
volume-snap-<uuid>.img.meta — snapshot metadata
revision.counter — monotonically increasing write counter
```
After a nuclear cleanup + reinstall, Longhorn creates **new empty replica directories** with
new random hex suffixes. The old directories (with data) are left on disk but orphaned.
**Why directory-swap fails:** the old `volume.meta` has a different engine generation and
`Dirty: true`. Longhorn detects the inconsistency across replicas and rebuilds from the
"cleanest" source (the new empty pi1 replica), overwriting the old data.
**What works:** extract the filesystem from the untouched replica directory directly, then
inject the data files into the live Longhorn block device while the volume is temporarily
attached in maintenance mode.
---
## Decision Tree
```
Are Volume CRDs present in Longhorn?
├── YES → normal PV/PVC restore is enough, use playbooks/recover/longhorn.yml
└── NO
└── Are replica directories present on disk?
├── NO → data is lost, provision fresh volumes
└── YES
└── Is there an untouched replica dir (timestamps from before the incident)?
├── NO → data likely unrecoverable (all dirs were zeroed during reconciliation)
└── YES → follow this runbook
```
---
## Step 0 — Pre-flight: Inventory Surviving Replica Directories
On each node, list replica dirs and their sizes. Dirs with actual data are large (>16K).
New empty dirs created by Longhorn are always exactly 16K.
```bash
for node in pi1 pi2 pi3; do
echo "=== $node ==="
ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
done
```
**Key rule:** identify the replica dir that was **never touched** by the reinstall — it has
old timestamps (from before the incident) and its size matches the original volume usage.
This is your recovery source. **Back it up before touching anything.**
```bash
# On the node that has the untouched dir:
sudo mkdir -p /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
sudo cp -a /mnt/arcodange/longhorn/replicas/<pv-name>-<old-hex>/ \
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
```
---
## Step 1 — Reconstruct the Filesystem
The replica directory contains a snapshot chain. Each layer is a sparse raw image — unchanged
blocks appear as zeroed sparse regions, only written blocks contain data. To reconstruct the
full filesystem, layers must be merged: head takes priority, then snapshot.
Use `docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py`:
```bash
# On the node holding the backup:
sudo python3 merge-longhorn-layers.py \
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/<pv-name>-<old-hex>/ \
/tmp/<pvc-name>-merged.img
# Verify the filesystem mounts
sudo mkdir -p /mnt/recovery-<pvc-name>
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-<pvc-name>
sudo ls -lah /mnt/recovery-<pvc-name>/
sudo umount /mnt/recovery-<pvc-name>
```
If mount fails with "wrong fs type" or "bad superblock":
- The snapshot `.img` is all-zero (was overwritten by a prior Longhorn reconciliation)
- Try the next oldest replica dir from another node
- Check with `sudo od -A x -t x1z -v snap.img | grep -v ' 00 00...' | head -5`
---
## Step 2 — Create the Longhorn Volume CRD
Longhorn needs to know about the volume before its block device can be used.
```bash
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: <pv-name>
namespace: longhorn-system
spec:
accessMode: rwo # or rwx
dataEngine: v1
frontend: blockdev
numberOfReplicas: 3
size: "<size-in-bytes>" # e.g. "134217728" for 128Mi
EOF
```
Wait for replicas to appear:
```bash
kubectl get replicas.longhorn.io -n longhorn-system | grep <pv-name>
# Expect 3 replicas in "stopped" state
```
---
## Step 3 — Attach the Volume in Maintenance Mode
Longhorn only creates the block device (`/dev/longhorn/<pv-name>`) when the volume is
attached to a node. Use a `VolumeAttachment` ticket to attach without a pod.
Choose `<target-node>` = the same node where the backup/merged image is stored (avoids
copying large files across the network).
```bash
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
name: <pv-name>
namespace: longhorn-system
spec:
attachmentTickets:
recovery:
generation: 0
id: recovery
nodeID: <target-node>
parameters:
disableFrontend: "false"
type: longhorn-api
volume: <pv-name>
EOF
kubectl wait --for=jsonpath='{.status.state}'=attached \
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=120s
```
---
## Step 4 — Scale Down the Workload
Always stop the workload before touching the data to prevent concurrent writes and filesystem
corruption.
```bash
# For a Deployment:
kubectl scale deployment <name> -n <namespace> --replicas=0
# For a StatefulSet:
kubectl scale statefulset <name> -n <namespace> --replicas=0
```
---
## Step 5 — Inject Data Files via Block Device
```bash
ssh <target-node> bash <<'SHELL'
# Mount the live block device
sudo mkdir -p /mnt/recovery-live
sudo mount /dev/longhorn/<pv-name> /mnt/recovery-live
# Mount the reconstructed image (if not already mounted)
sudo mkdir -p /mnt/recovery-src
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-src
# Sync: only the application data files, not lost+found
sudo rsync -av --exclude='lost+found' /mnt/recovery-src/ /mnt/recovery-live/
# Verify
sudo ls -lah /mnt/recovery-live/
# Unmount both
sudo umount /mnt/recovery-src
sudo umount /mnt/recovery-live
SHELL
```
---
## Step 6 — Detach the Volume
```bash
kubectl patch volumeattachments.longhorn.io <pv-name> \
-n longhorn-system --type json \
-p '[{"op":"remove","path":"/spec/attachmentTickets/recovery"}]'
kubectl wait --for=jsonpath='{.status.state}'=detached \
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=60s
```
---
## Step 7 — Restore PV and PVC
Clear stuck Terminating PV/PVC finalizers first if they exist:
```bash
kubectl patch pv <pv-name> --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null
kubectl patch pvc <pvc-name> -n <namespace> --type=merge \
-p '{"metadata":{"finalizers":null}}' 2>/dev/null
# Wait a moment for them to delete
```
Recreate the PV with `Retain` policy and no `claimRef`:
```bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
name: <pv-name>
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
accessModes: [ReadWriteOnce] # match original
capacity:
storage: <size> # e.g. 128Mi
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: <pv-name>
volumeAttributes:
dataEngine: v1
dataLocality: disabled
disableRevisionCounter: "true"
numberOfReplicas: "3"
staleReplicaTimeout: "30"
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn
volumeMode: Filesystem
EOF
```
Recreate the PVC pinned to this PV:
```bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: <pvc-name>
namespace: <namespace>
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: <size>
storageClassName: longhorn
volumeMode: Filesystem
volumeName: <pv-name>
EOF
```
---
## Step 8 — Scale Up and Verify
```bash
kubectl scale deployment <name> -n <namespace> --replicas=1
kubectl wait --for=condition=Ready pod -l app=<name> -n <namespace> --timeout=120s
```
---
## Pitfalls Learned During 2026-04-13 Recovery
| Pitfall | What happened | Prevention |
|---------|--------------|------------|
| **Directory swap corrupts data** | Longhorn found old `Dirty: true` volume.meta + empty pi1 replica → rebuilt from empty source | Never swap dirs. Use merge tool + block device injection instead |
| **Snapshot is zeroed after swap** | Longhorn reconciliation overwrote snapshot images when rebuilding from empty replica | Back up the untouched dir FIRST before any rename |
| **Multiple dirs per volume on pi3** | Rebuild attempts during the incident created extra dirs | Identify the untouched dir by timestamp AND verify non-zero content with `od` |
| **`Rebuilding: true` replica → all-zeros merged image** | Phase 0 picked a replica mid-rebuild (1.3 GiB actual data, sparse files look large) — merge tool produced an all-zeros image | Check `volume.meta` and skip any dir with `"Rebuilding": true` before merging |
| **`du -sb` gives misleading apparent sizes** | Sparse replica files (8 GiB file, 1.3 GiB actual) appeared larger than healthy 11 GiB replicas | Use `du -sk` (actual disk blocks) not `du -sb` (apparent/logical size) to rank replicas |
| **Dirty journal prevents ro mount** | `mount -o loop,ro` fails with "bad superblock" on an ext4 with unclean shutdown | Use `mount -o loop,ro,noload` to skip journal replay for read-only access |
| **New volume is unformatted** | `mount /dev/longhorn/<pv>` fails with "wrong fs type" on a freshly created volume | Run `mkfs.ext4 -F` before mounting; guard with `blkid` to skip if already formatted |
| **rsync rc=23 on power-cut partitions** | Some filesystem blocks were unreadable ("Structure needs cleaning") → rsync exits 23 | Use `rsync --ignore-errors`; rc=23 is a partial transfer, not a total failure |
| **pod blocks volume re-attach** | Old Error-state pod held a volume attachment claim | Delete old Error pods before scaling up new ones |
| **`kubectl cp` needs `tar`** | Distroless container had no `tar` binary | Mount block device directly on the node instead |
| **VolumeAttachment ticket removal** | Deleting a VolumeAttachment object causes Longhorn to immediately recreate it | Patch the `recovery` key out of `spec.attachmentTickets` instead of deleting the object |
| **Phase 7 wait for `detached` times out** | After removing the recovery ticket, a workload may immediately create its own ticket | Wait for the `recovery` ticket to disappear from `spec.attachmentTickets`, not for full detach |
| **StatefulSet pods not found by label** | `kubectl get pod -l app=<name>` returns nothing for StatefulSet pods | Wait on `readyReplicas ≥ 1` on the StatefulSet object, not on pod labels |
| **`set_fact` overridden by `-e @file`** | Ansible extra vars have highest precedence — `set_fact: longhorn_recovery_volumes` was silently ignored | Use a different variable name (`_volumes`) for the resolved list, never reassign the extra var name |
---
## Identifying the Right Replica Directory
When multiple old dirs exist for the same volume on a node, pick the one to use for recovery:
1. **Skip `Rebuilding: true`:** check `volume.meta` first — a dir that was being rebuilt when
the incident happened has incomplete data (sparse files are allocated but mostly zeroed):
```bash
python3 -c "import json; d=json.load(open('volume.meta')); print('Rebuilding:', d['Rebuilding'])"
```
Only consider dirs where `Rebuilding: false`.
2. **Actual size:** `sudo du -sk <dir>` (actual disk usage in KB — not `du -sb` which returns
apparent/logical size and is misleading for sparse files). Pick the largest actual size.
3. **Timestamps:** prefer the most recently modified before the incident date.
4. **Snapshot chain:** if Rebuilding is false on multiple dirs, check `volume.meta` for
`"Dirty": false` (clean shutdown) vs `"Dirty": true`. Prefer clean if available.
5. **Content check:** verify the snapshot is not all zeros:
```bash
sudo od -A x -t x1z -v volume-snap-*.img | grep -v ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' | head -3
```
If the output is empty (all zeros), the snapshot was overwritten. Try another node.
**Summary rule:** `Rebuilding: false` → largest `du -sk` → non-zero snapshot content.
---
## Reference: Key Commands
```bash
# List all replica dirs for a volume across all nodes
for n in pi1 pi2 pi3; do echo "==$n=="; ssh $n "sudo ls /mnt/arcodange/longhorn/replicas/ | grep <pv-prefix>"; done
# Check Longhorn volume state
kubectl get volumes.longhorn.io -n longhorn-system <pv-name>
# Check VolumeAttachment tickets
kubectl get volumeattachments.longhorn.io -n longhorn-system <pv-name> \
-o jsonpath='{.spec.attachmentTickets}'
# Check Longhorn block device existence on a node
ssh <node> "ls /dev/longhorn/<pv-name>"
# Verify filesystem content without starting the app
ssh <node> "sudo mount /dev/longhorn/<pv-name> /mnt/check && sudo ls /mnt/check && sudo umount /mnt/check"
```

View File

@@ -24,12 +24,15 @@
- name: define backup command
set_fact:
backup_cmd: |-
echo "
$(kubectl get -A pv -o yaml)
---
$(kubectl get -A pvc -o yaml)
"
# PVs + PVCs + Longhorn Volume CRDs (critical for fast recovery — without Volume CRDs,
# Longhorn cannot re-associate orphaned replica dirs after a reinstall and forces
# full block-device injection recovery. See docs/adr/20260414-longhorn-pvc-recovery.md)
backup_cmd: >-
kubectl get -A pv,pvc -o yaml
&& echo '---'
&& kubectl get -A volumes.longhorn.io -o yaml
&& echo '---'
&& kubectl get -A settings.longhorn.io -o yaml
- name: test backup_cmd
ansible.builtin.shell: |
@@ -65,19 +68,34 @@
#!/bin/bash
set -e
BACKUP_DIR="{{ backup_dir }}"
PRIMARY_BACKUP_DIR="{{ backup_dir }}"
FALLBACK_BACKUP_DIR="/home/pi/arcodange/backups/k3s_pvc"
# Check if fallback directory exists and has backups
if [ -d "$FALLBACK_BACKUP_DIR" ] && ls "$FALLBACK_BACKUP_DIR"/*.volumes 1>/dev/null 2>&1; then
BACKUP_DIR="$FALLBACK_BACKUP_DIR"
echo "Using fallback backup directory: $BACKUP_DIR"
elif [ -d "$PRIMARY_BACKUP_DIR" ] && ls "$PRIMARY_BACKUP_DIR"/*.volumes 1>/dev/null 2>&1; then
BACKUP_DIR="$PRIMARY_BACKUP_DIR"
else
echo "No backup directory found"
exit 1
fi
if [ -z "$1" ]; then
FILE=$(ls -1t "$BACKUP_DIR"/backup_*.volumes | head -n 1)
echo "Aucune date fournie, restauration du dernier dump : $FILE"
echo "No date provided, restoring latest dump: $FILE"
else
FILE="$BACKUP_DIR/backup_$1.volumes"
if [ ! -f "$FILE" ]; then
echo "Fichier $FILE introuvable"
echo "File $FILE not found"
exit 1
fi
fi
kubectl apply -f "$FILE"
echo "Restauration des volumes k3s terminée."
echo "K3S volumes restoration complete."
echo "NOTE: file includes PVs, PVCs, and Longhorn Volume CRDs."
echo "If Longhorn replica dirs are still orphaned after this restore,"
echo "fall back to: ansible-playbook playbooks/recover/longhorn_data.yml"

View File

@@ -0,0 +1,536 @@
---
- name: Recover Longhorn from Power Cut - CSI Driver Registration Loss
hosts: raspberries:&local
gather_facts: yes
become: yes
vars:
# Backup locations
primary_backup_dir: "/mnt/backups/k3s_pvc"
fallback_backup_dir: "/home/pi/arcodange/backups/k3s_pvc"
scripts_dir: "/opt/k3s_volumes"
# Longhorn configuration
longhorn_manifest_path: "/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml"
longhorn_namespace: "longhorn-system"
longhorn_chart_name: "longhorn-install"
longhorn_chart_namespace: "kube-system"
# Data paths (DO NOT MODIFY - points to actual volume data)
longhorn_data_path: "/mnt/arcodange/longhorn"
tasks:
# ========================================================================
# PHASE 0: Pre-flight Checks
# ========================================================================
- name: Verify data directory exists on control plane
ansible.builtin.stat:
path: "{{ longhorn_data_path }}"
register: data_dir
when: inventory_hostname == 'pi1'
run_once: true
- name: FAIL if data directory missing
ansible.builtin.fail:
msg: "CRITICAL: Longhorn data directory {{ longhorn_data_path }} does not exist. Aborting recovery."
when: inventory_hostname == 'pi1' and not data_dir.stat.exists
run_once: true
- name: Check for fallback backups on pi1
ansible.builtin.shell: ls {{ fallback_backup_dir }}/backup_*.volumes 2>/dev/null
register: fallback_backup_check
changed_when: false
when: inventory_hostname == 'pi1'
run_once: true
ignore_errors: yes
- name: Check for primary backups on pi1
ansible.builtin.shell: ls {{ primary_backup_dir }}/backup_*.volumes 2>/dev/null
register: primary_backup_check
changed_when: false
when: inventory_hostname == 'pi1'
run_once: true
ignore_errors: yes
- name: Set backup fact
ansible.builtin.set_fact:
has_backups: "{{ (fallback_backup_check.rc == 0 and fallback_backup_check.stdout | trim != '') or (primary_backup_check.rc == 0 and primary_backup_check.stdout | trim != '') }}"
when: inventory_hostname == 'pi1'
run_once: true
- name: FAIL if no backups found
ansible.builtin.fail:
msg: "No backup files found in {{ primary_backup_dir }} or {{ fallback_backup_dir }}. Cannot proceed."
when: inventory_hostname == 'pi1' and not has_backups | bool
run_once: true
# ========================================================================
# PHASE 1: Diagnosis - Check Current State
# ========================================================================
- name: Gather Longhorn namespace status
block:
- name: Check if longhorn-system namespace exists
kubernetes.core.k8s_info:
kind: Namespace
name: "{{ longhorn_namespace }}"
register: longhorn_ns
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Check CSI driver registration
kubernetes.core.k8s_info:
kind: CSIDriver
name: driver.longhorn.io
register: csi_driver
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Check Longhorn manager pods
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: managers
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Set recovery_phase fact
ansible.builtin.set_fact:
recovery_phase: "none"
run_once: true
delegate_to: localhost
- name: Determine recovery phase needed
ansible.builtin.set_fact:
recovery_phase: >-
{% if csi_driver.failed %}
soft
{% elif managers.failed or managers.resources | default([]) | selectattr('status.phase', 'defined') | selectattr('status.phase', 'ne', 'Running') | list | length > 0 %}
hard
{% elif longhorn_ns.failed %}
none
{% else %}
none
{% endif %}
run_once: true
delegate_to: localhost
- name: Display recovery diagnosis
ansible.builtin.debug:
msg: "Diagnosis: recovery_phase={{ recovery_phase | default('none') }}. CSI Driver exists: {{ not csi_driver.failed | bool }}, Managers healthy: {{ managers.failed | ternary('unknown', managers.resources | default([]) | selectattr('status.phase', 'defined') | selectattr('status.phase', 'eq', 'Running') | list | length >= 3) | bool }}"
run_once: true
delegate_to: localhost
when: inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 2: Soft Recovery - Touch Manifest
# ========================================================================
- name: Execute soft recovery - touch Longhorn manifest
block:
- name: Touch longhorn-install.yaml manifest
ansible.builtin.file:
path: "{{ longhorn_manifest_path }}"
state: touch
register: manifest_touch
when: inventory_hostname == 'pi1'
- name: Wait for k3s to detect manifest change
ansible.builtin.pause:
minutes: 1
when: manifest_touch is changed
- name: Check if Longhorn pods are recreating
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
register: longhorn_pods
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Verify soft recovery success
ansible.builtin.set_fact:
soft_recovery_success: >-
{{ (longhorn_pods.resources | default([]) | selectattr('metadata.creationTimestamp', 'defined') | list | length) >= 10 }}
run_once: true
delegate_to: localhost
when: recovery_phase == 'soft' and inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 3: Hard Recovery - Delete Driver-Deployer
# ========================================================================
- name: Execute hard recovery - delete driver-deployer pods
block:
- name: Get driver-deployer pods
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-driver-deployer
register: driver_deployer_pods
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Delete driver-deployer pods
kubernetes.core.k8s:
state: absent
kind: Pod
namespace: "{{ longhorn_namespace }}"
name: "{{ item.metadata.name }}"
force: yes
grace_period: 0
loop: "{{ driver_deployer_pods.resources | default([]) }}"
when: driver_deployer_pods.resources | default([]) | length > 0
run_once: true
delegate_to: localhost
- name: Wait for HelmChart to recreate driver-deployer
ansible.builtin.pause:
minutes: 2
- name: Check driver-deployer status
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-driver-deployer
register: new_driver_deployer
ignore_errors: yes
run_once: true
delegate_to: localhost
when: (recovery_phase == 'hard' or (recovery_phase == 'soft' and not soft_recovery_success | default(false))) and inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 4: Nuclear Recovery - Full Reinstall
# ========================================================================
- name: Execute nuclear recovery - full Longhorn reinstall
block:
# Step 1: Delete HelmChart
- name: Delete Longhorn HelmChart
kubernetes.core.k8s:
state: absent
kind: HelmChart
namespace: "{{ longhorn_chart_namespace }}"
name: "{{ longhorn_chart_name }}"
force: yes
grace_period: 0
register: helmchart_deleted
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Wait for HelmChart to be fully removed
ansible.builtin.pause:
seconds: 30
when: helmchart_deleted is changed
run_once: true
# Step 2: Remove Longhorn manifest from filesystem
- name: Remove Longhorn manifest file
ansible.builtin.file:
path: "{{ longhorn_manifest_path }}"
state: absent
when: inventory_hostname == 'pi1'
register: manifest_removed
# Step 3: Remove finalizers from all Longhorn resources
- name: Get list of all Longhorn CRDs
kubernetes.core.k8s_info:
kind: CustomResourceDefinition
label_selectors:
- app=longhorn
register: longhorn_crds
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Get all Longhorn CR instances
kubernetes.core.k8s_info:
kind: "{{ item.spec.names.kind }}"
namespace: "{{ longhorn_namespace }}"
api_version: "{{ item.spec.group ~ '/' ~ item.spec.versions[0].name }}"
register: cr_instances
ignore_errors: yes
loop: "{{ longhorn_crds.resources | default([]) }}"
run_once: true
delegate_to: localhost
- name: Remove finalizers from all Longhorn CR instances
kubernetes.core.k8s_json_patch:
kind: "{{ item.0.spec.names.kind }}"
namespace: "{{ longhorn_namespace }}"
name: "{{ item.1.metadata.name }}"
api_version: "{{ item.0.spec.group ~ '/' ~ item.0.spec.versions[0].name }}"
patch:
- op: replace
path: /metadata/finalizers
value: []
loop: >-
{% set results = [] %}
{% for crd in longhorn_crds.resources | default([]) %}
{% for instance in hostvars['localhost']['cr_instances'].results | default([]) %}
{% if instance.crd == crd %}
{% set results = results.append([crd, instance.resources[0] if instance.resources else {}]) %}
{% endif %}
{% endfor %}
{% endfor %}
{{ results }}
when: cr_instances.results | default([]) | length > 0
run_once: true
delegate_to: localhost
ignore_errors: yes
# Step 4: Remove finalizers from PVCs
- name: Get all PVCs with longhorn storage class
kubernetes.core.k8s_info:
kind: PersistentVolumeClaim
register: all_pvcs
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Remove finalizers from PVCs
kubernetes.core.k8s_json_patch:
kind: PersistentVolumeClaim
namespace: "{{ item.metadata.namespace }}"
name: "{{ item.metadata.name }}"
patch:
- op: replace
path: /metadata/finalizers
value: []
loop: "{{ all_pvcs.resources | default([]) | selectattr('spec.storageClassName', 'defined') | selectattr('spec.storageClassName', 'match', 'longhorn.*') | list }}"
run_once: true
delegate_to: localhost
ignore_errors: yes
# Step 5: Remove namespace finalizers
- name: Remove finalizers from longhorn-system namespace
kubernetes.core.k8s_json_patch:
kind: Namespace
name: "{{ longhorn_namespace }}"
patch:
- op: replace
path: /spec/finalizers
value: []
run_once: true
delegate_to: localhost
ignore_errors: yes
- name: Delete longhorn-system namespace
kubernetes.core.k8s:
state: absent
kind: Namespace
name: "{{ longhorn_namespace }}"
force: yes
grace_period: 0
run_once: true
delegate_to: localhost
ignore_errors: yes
- name: Wait for namespace deletion
ansible.builtin.pause:
seconds: 15
run_once: true
# Step 6: Reinstall Longhorn via manifest
- name: Deploy Longhorn HelmChart manifest
ansible.builtin.copy:
dest: "{{ longhorn_manifest_path }}"
content: |
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
annotations:
helmcharts.cattle.io/managed-by: helm-controller
finalizers:
- wrangler.cattle.io/on-helm-chart-remove
name: longhorn-install
namespace: kube-system
spec:
version: v1.9.1
chart: longhorn
repo: https://charts.longhorn.io
failurePolicy: abort
targetNamespace: longhorn-system
createNamespace: true
valuesContent: |-
defaultSettings:
defaultDataPath: {{ longhorn_data_path }}
when: inventory_hostname == 'pi1'
register: manifest_deployed
- name: Trigger k3s reconcile by touching manifest
ansible.builtin.file:
path: "{{ longhorn_manifest_path }}"
state: touch
when: manifest_deployed is changed and inventory_hostname == 'pi1'
- name: Wait for Longhorn pods to be created
ansible.builtin.pause:
minutes: 3
when: manifest_deployed is changed
run_once: true
when: >-
(recovery_phase == 'hard' and not new_driver_deployer.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length > 0)
or (recovery_phase == 'soft' and not soft_recovery_success | default(false) and not new_driver_deployer.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length > 0)
or recovery_phase == 'none'
run_once: true
# ========================================================================
# PHASE 5: Restore from Backup
# ========================================================================
- name: Execute restore from backup
block:
- name: Determine backup directory to use
ansible.builtin.set_fact:
backup_dir_to_use: >-
{% if fallback_backup_dir and lookup('fileglob', fallback_backup_dir ~ '/backup_*.volumes') | length > 0 %}
{{ fallback_backup_dir }}
{% elif primary_backup_dir and lookup('fileglob', primary_backup_dir ~ '/backup_*.volumes') | length > 0 %}
{{ primary_backup_dir }}
{% else %}
""
{% endif %}
run_once: true
delegate_to: localhost
- name: FAIL if no backup directory found
ansible.builtin.fail:
msg: "No valid backup directory found with backup_*.volumes files"
when: backup_dir_to_use == ""
run_once: true
- name: Find latest backup file
ansible.builtin.set_fact:
latest_backup: >-
{% set files = lookup('fileglob', backup_dir_to_use ~ '/backup_*.volumes', wantlist=True) | sort(attribute='stat.mtime', reverse=True) %}
{% if files | length > 0 %}
{{ files[0].path }}
{% endif %}
run_once: true
delegate_to: localhost
- name: FAIL if no backup files found
ansible.builtin.fail:
msg: "No backup files found in {{ backup_dir_to_use }}"
when: latest_backup | default('') == ''
run_once: true
- name: Wait for Longhorn managers to be ready
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: managers_status
until: >-
{{ (managers_status.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length) >= 1 }}
retries: 30
delay: 10
run_once: true
delegate_to: localhost
- name: Apply PV/PVC backup
kubernetes.core.k8s:
state: present
src: "{{ latest_backup }}"
run_once: true
delegate_to: localhost
- name: Find Longhorn metadata backup
ansible.builtin.set_fact:
longhorn_backup: >-
{% set lh_files = lookup('fileglob', backup_dir_to_use ~ '/longhorn_metadata_*.yaml', wantlist=True) | sort(attribute='stat.mtime', reverse=True) %}
{% if lh_files | length > 0 %}
{{ lh_files[0].path }}
{% endif %}
run_once: true
delegate_to: localhost
- name: Apply Longhorn metadata backup (if exists)
kubernetes.core.k8s:
state: present
src: "{{ longhorn_backup | default(omit) }}"
namespace: "{{ longhorn_namespace }}"
when: longhorn_backup | default('') != ''
run_once: true
delegate_to: localhost
when: inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 6: Post-Recovery Verification
# ========================================================================
- name: Verify recovery success
block:
- name: Check CSI driver registration
kubernetes.core.k8s_info:
kind: CSIDriver
name: driver.longhorn.io
register: csi_final
until: csi_final.resources | length > 0
retries: 10
delay: 10
run_once: true
delegate_to: localhost
- name: Check Longhorn manager health
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: managers_final
until: >-
{{ (managers_final.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length) >= 3 }}
retries: 15
delay: 10
run_once: true
delegate_to: localhost
- name: Check CSI socket exists (on pi1)
ansible.builtin.stat:
path: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
register: csi_socket
when: inventory_hostname == 'pi1'
- name: Verify volume data is still present
ansible.builtin.stat:
path: "{{ longhorn_data_path }}/replicas"
register: replicas_dir
when: inventory_hostname == 'pi1'
- name: Display recovery summary
ansible.builtin.debug:
msg: |
===== Longhorn Recovery Summary =====
CSI Driver Registered: {{ not csi_final.failed | bool | ternary('✓', '✗') }}
Managers Running: {{ (managers_final.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length) }}/3
CSI Socket Exists: {{ csi_socket.stat.exists | default(false) | bool | ternary('✓', '✗') }}
Volume Data Present: {{ replicas_dir.stat.exists | default(false) | bool | ternary('✓', '✗') }}
Backup Used: {{ latest_backup | default('none') }}
======================================
run_once: true
when: inventory_hostname == 'pi1'
run_once: true

View File

@@ -0,0 +1,914 @@
---
# Longhorn Block-Device Data Recovery Playbook
#
# PURPOSE:
# Recover application data directly from raw Longhorn replica files when Volume CRDs
# are missing (e.g. after a nuclear cleanup + reinstall). Bypasses k8s objects entirely
# and works at the block-device level.
#
# WHEN TO USE:
# - Longhorn has been fully reinstalled (Volume CRDs are gone)
# - Application PVCs are stuck Terminating / Lost
# - The raw replica .img files still exist on disk
# → See docs/runbooks/longhorn-block-device-recovery.md for the manual equivalent
#
# WHEN NOT TO USE:
# - Volume CRDs still exist → use playbooks/recover/longhorn.yml instead
# - All replica dirs were zeroed by Longhorn reconciliation (data is unrecoverable)
#
# USAGE:
# ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
# -e @vars/recovery_volumes.yml
#
# VARS FILE FORMAT (vars/recovery_volumes.yml):
# longhorn_recovery_volumes:
# - pv_name: pvc-abc123 # Longhorn volume name (== PV name)
# pvc_name: myapp-data # PVC name in the namespace
# namespace: myapp # namespace where the PVC lives
# size_bytes: "134217728" # volume size in bytes (string)
# size_human: 128Mi # human-readable, used in PVC spec
# access_mode: ReadWriteOnce # ReadWriteOnce or ReadWriteMany
# workload_kind: Deployment # Deployment or StatefulSet
# workload_name: myapp # name of the workload to scale down/up
# source_node: pi3 # [OPTIONAL] node with untouched replica dir
# source_dir: pvc-abc123-998f49ff # [OPTIONAL] exact replica dir name
# verify_cmd: "" # optional: command to run inside pod to verify data after recovery
#
# source_node and source_dir are auto-discovered (largest dir >16K across all nodes)
# when not specified. Override manually only to force a specific replica dir.
#
# REQUIREMENTS:
# - python3 on all cluster nodes
# - kubectl configured on the Ansible controller (localhost)
# - longhorn-system namespace running and healthy before this playbook starts
# - kubernetes.core collection: ansible-galaxy collection install kubernetes.core
#
# TESTED SCENARIO:
# 2026-04-13 power cut — nuclear Longhorn reinstall — url-shortener SQLite recovery
# Proven working as of 2026-04-14.
- name: Longhorn Block-Device Data Recovery
hosts: localhost
gather_facts: no
vars:
longhorn_data_path: /mnt/arcodange/longhorn
longhorn_namespace: longhorn-system
longhorn_nodes: [pi1, pi2, pi3]
merge_tool_local: "{{ playbook_dir }}/../../docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py"
merge_tool_remote: /home/pi/merge-longhorn-layers.py
backup_base: /home/pi/arcodange/backups/longhorn-recovery
merged_base: /tmp/longhorn-recovery-merged
recovery_mount: /mnt/recovery-src
live_mount: /mnt/recovery-live
longhorn_recovery_volumes: [] # override with -e @vars/recovery_volumes.yml
tasks:
# =========================================================================
# PRE-FLIGHT
# =========================================================================
- name: "Pre-flight | Fail fast if no volumes defined"
ansible.builtin.fail:
msg: >
No recovery volumes defined. Pass -e @vars/recovery_volumes.yml with a
longhorn_recovery_volumes list. See playbook header for format.
when: longhorn_recovery_volumes | length == 0
- name: "Pre-flight | Verify merge tool exists locally"
ansible.builtin.stat:
path: "{{ merge_tool_local }}"
register: merge_tool_stat
delegate_to: localhost
- name: "Pre-flight | Fail if merge tool missing"
ansible.builtin.fail:
msg: "merge-longhorn-layers.py not found at {{ merge_tool_local }}"
when: not merge_tool_stat.stat.exists
- name: "Pre-flight | Check Longhorn is healthy"
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: lh_managers
delegate_to: localhost
- name: "Pre-flight | Fail if Longhorn managers are not running"
ansible.builtin.fail:
msg: >
Longhorn managers not running (found {{ lh_managers.resources | default([]) |
selectattr('status.phase', 'eq', 'Running') | list | length }} Running pods).
Ensure Longhorn is healthy before attempting data recovery.
when: >
(lh_managers.resources | default([]) |
selectattr('status.phase', 'eq', 'Running') | list | length) < 1
- name: "Pre-flight | Summary"
ansible.builtin.debug:
msg: >
Longhorn healthy ({{ lh_managers.resources |
selectattr('status.phase', 'eq', 'Running') | list | length }} managers running).
Recovering {{ longhorn_recovery_volumes | length }} volume(s):
{{ longhorn_recovery_volumes | map(attribute='pv_name') | list | join(', ') }}
# =========================================================================
# PHASE 0 — AUTO-DISCOVER BEST REPLICA DIR (when source_node/source_dir absent)
# =========================================================================
- name: "Phase 0 | Scan replica dirs on all nodes"
ansible.builtin.shell: |
result=""
for dir in {{ longhorn_data_path }}/replicas/{{ item.1.pv_name }}-*; do
[ -d "$dir" ] || continue
# Skip replicas that were being rebuilt — their data is incomplete
meta="$dir/volume.meta"
if [ -f "$meta" ]; then
rebuilding=$(python3 -c "import json; d=json.load(open('$meta')); print(d.get('Rebuilding', False))" 2>/dev/null)
[ "$rebuilding" = "True" ] && continue
fi
# Use actual disk usage (not apparent/sparse size) to rank replicas
size=$(du -sk "$dir" 2>/dev/null | cut -f1)
name=$(basename "$dir")
result="$result\n$size $name"
done
printf '%b' "$result" | grep -v '^$' || true
delegate_to: "{{ item.0 }}"
become: yes
loop: "{{ longhorn_nodes | product(longhorn_recovery_volumes) | list }}"
loop_control:
label: "{{ item.0 }}: {{ item.1.pv_name }}"
register: dir_scan_raw
changed_when: false
when: item.1.source_node | default('') == '' or item.1.source_dir | default('') == ''
- name: "Phase 0 | Pick best source (largest dir with data, >16K)"
ansible.builtin.set_fact:
_discovered_sources: "{{ _build | from_json }}"
vars:
_build: >-
{% set ns = namespace(result={}) %}
{% for res in dir_scan_raw.results | default([]) %}
{% if not res.skipped | default(false) and res.stdout | default('') != '' %}
{% set node = res.item.0 %}
{% set vol = res.item.1.pv_name %}
{% for line in res.stdout_lines %}
{% set parts = line.split() %}
{% if parts | length == 2 %}
{% set size = parts[0] | int %}
{% set dir = parts[1] %}
{% if size > 16384 and (vol not in ns.result or size > ns.result[vol].size) %}
{# size is in KB (from du -sk); 16384 KB = 16 MiB minimum real replica #}
{% set _ = ns.result.update({vol: {'node': node, 'dir': dir, 'size': size}}) %}
{% endif %}
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
{{ ns.result | to_json }}
- name: "Phase 0 | Show discovered sources"
ansible.builtin.debug:
msg: >-
{% for vol in longhorn_recovery_volumes %}
{{ vol.pv_name }}:
{% if vol.source_node | default('') != '' %}
source: MANUAL → {{ vol.source_node }}/{{ vol.source_dir }}
{% elif vol.pv_name in _discovered_sources %}
source: AUTO → {{ _discovered_sources[vol.pv_name].node }}/{{ _discovered_sources[vol.pv_name].dir }}
({{ (_discovered_sources[vol.pv_name].size / 1024 / 1024) | round(0) | int }} MiB)
{% else %}
source: NOT FOUND — no dir >16K on any node for this volume
{% endif %}
{% endfor %}
- name: "Phase 0 | Fail if source not found for any volume"
ansible.builtin.fail:
msg: >
No replica dir with data found for {{ item.pv_name }} on any node
({{ longhorn_nodes | join(', ') }}). Check that the replica files survived.
loop: "{{ longhorn_recovery_volumes }}"
loop_control:
label: "{{ item.pv_name }}"
when: >
item.source_node | default('') == '' and
item.source_dir | default('') == '' and
item.pv_name not in _discovered_sources
- name: "Phase 0 | Initialize merged volume list"
ansible.builtin.set_fact:
_merged_volumes: []
- name: "Phase 0 | Append each volume with resolved source"
ansible.builtin.set_fact:
_merged_volumes: "{{ _merged_volumes + [item | combine(_source)] }}"
vars:
_manual: "{{ item.source_node | default('') != '' and item.source_dir | default('') != '' }}"
_source: "{{ _manual | bool | ternary(
{'source_node': item.source_node, 'source_dir': item.source_dir},
{'source_node': _discovered_sources[item.pv_name].node,
'source_dir': _discovered_sources[item.pv_name].dir}) }}"
loop: "{{ longhorn_recovery_volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 0 | Apply resolved volume list"
ansible.builtin.set_fact:
_volumes: "{{ _merged_volumes }}"
# =========================================================================
# PHASE 1 — UPLOAD MERGE TOOL AND BACK UP REPLICA DIRS
# =========================================================================
- name: "Phase 1 | Upload merge tool to source nodes"
ansible.builtin.command: >
scp -o StrictHostKeyChecking=no
{{ merge_tool_local }}
pi@{{ item.source_node }}.home:{{ merge_tool_remote }}
delegate_to: localhost
become: no
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }} → {{ item.source_node }}"
changed_when: true
- name: "Phase 1 | Create backup directory on source node"
ansible.builtin.file:
path: "{{ backup_base }}/{{ item.pvc_name }}"
state: directory
mode: "0755"
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 1 | Check if backup already exists (skip if re-running)"
ansible.builtin.stat:
path: "{{ backup_base }}/{{ item.pvc_name }}/{{ item.source_dir }}/volume.meta"
register: backup_exists
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 1 | Back up untouched replica dir (safe copy before any operation)"
ansible.builtin.shell: >
cp -a {{ longhorn_data_path }}/replicas/{{ item.item.source_dir }}
{{ backup_base }}/{{ item.item.pvc_name }}/
delegate_to: "{{ item.item.source_node }}"
become: yes
loop: "{{ backup_exists.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
when: not item.stat.exists
changed_when: true
- name: "Phase 1 | Verify backup contains volume.meta"
ansible.builtin.stat:
path: "{{ backup_base }}/{{ item.pvc_name }}/{{ item.source_dir }}/volume.meta"
register: backup_meta
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 1 | Fail if backup is incomplete"
ansible.builtin.fail:
msg: >
Backup for {{ item.item.pvc_name }} is missing volume.meta — the source dir
{{ item.item.source_dir }} may not exist or backup copy failed.
loop: "{{ backup_meta.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
when: not item.stat.exists
# =========================================================================
# PHASE 2 — RECONSTRUCT FILESYSTEMS FROM REPLICA LAYERS
# =========================================================================
- name: "Phase 2 | Create merged output directory"
ansible.builtin.file:
path: "{{ merged_base }}"
state: directory
mode: "0755"
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 2 | Check if merged image already exists"
ansible.builtin.stat:
path: "{{ merged_base }}/{{ item.pvc_name }}.img"
register: merged_exists
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 2 | Merge snapshot + head layers into single image"
ansible.builtin.command: >
python3 {{ merge_tool_remote }}
{{ backup_base }}/{{ item.item.pvc_name }}/{{ item.item.source_dir }}
{{ merged_base }}/{{ item.item.pvc_name }}.img
delegate_to: "{{ item.item.source_node }}"
become: yes
loop: "{{ merged_exists.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
when: not item.stat.exists
changed_when: true
register: merge_output
- name: "Phase 2 | Show merge output"
ansible.builtin.debug:
msg: "{{ item.stdout_lines | default([]) }}"
loop: "{{ merge_output.results | default([]) }}"
loop_control:
label: "{{ item.item.item.pvc_name | default('') }}"
when: item.stdout_lines is defined
- name: "Phase 2 | Test mount merged image to verify filesystem"
ansible.builtin.shell: |
mkdir -p {{ recovery_mount }}-{{ item.pvc_name }}
mount -o loop,ro,noload {{ merged_base }}/{{ item.pvc_name }}.img {{ recovery_mount }}-{{ item.pvc_name }}
ls {{ recovery_mount }}-{{ item.pvc_name }}/
umount {{ recovery_mount }}-{{ item.pvc_name }}
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
register: mount_test
changed_when: false
- name: "Phase 2 | Show filesystem contents"
ansible.builtin.debug:
msg: "{{ item.item.pvc_name }}: {{ item.stdout_lines }}"
loop: "{{ mount_test.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
# =========================================================================
# PHASE 3 — CREATE LONGHORN VOLUME CRDs
# =========================================================================
# Scale down StatefulSets BEFORE removing PVC finalizers.
# StatefulSet controllers auto-recreate PVCs as soon as they are deleted; if we
# remove finalizers while the StatefulSet is still running, the controller
# immediately provisions a new empty PVC (bound to a fresh volume), making the
# PVC spec immutable by the time Phase 8 tries to pin it to our recovered PV.
# Deployments are less urgent here but scaled early for consistency.
- name: "Phase 3 | Pre-scale down Deployments (before PVC finalizer removal)"
kubernetes.core.k8s_scale:
kind: Deployment
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'Deployment' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 3 | Pre-scale down StatefulSets (before PVC finalizer removal)"
kubernetes.core.k8s_scale:
kind: StatefulSet
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'StatefulSet' and item.workload_name != ''
ignore_errors: yes
# Clear any stuck Terminating PVs/PVCs BEFORE creating Volume CRDs.
# If old Terminating PVCs still exist when we create the Volume CRD, Longhorn
# associates them and deletes the Volume CRD when the PVC finishes terminating.
- name: "Phase 3 | Check PVC state before touching finalizers"
ansible.builtin.shell: >
kubectl get pvc {{ item.pvc_name }} -n {{ item.namespace }}
-o jsonpath='{.metadata.deletionTimestamp}' 2>/dev/null || true
register: pvc_deletion_ts
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.pvc_name }}"
changed_when: false
- name: "Phase 3 | Remove finalizers from stuck PV (if Terminating)"
ansible.builtin.shell: >
kubectl patch pv {{ item.pv_name }} --type=merge
-p '{"metadata":{"finalizers":null}}' 2>/dev/null || true
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
changed_when: false
- name: "Phase 3 | Remove finalizers from stuck PVC (if Terminating)"
ansible.builtin.shell: >
kubectl patch pvc {{ item.pvc_name }} -n {{ item.namespace }}
--type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true
delegate_to: localhost
loop: "{{ pvc_deletion_ts.results }}"
loop_control:
label: "{{ item.item.namespace }}/{{ item.item.pvc_name }}"
when: item.stdout != ''
changed_when: false
- name: "Phase 3 | Wait for stuck PVCs to fully delete before creating Volume CRDs"
kubernetes.core.k8s_info:
kind: PersistentVolumeClaim
name: "{{ item.item.pvc_name }}"
namespace: "{{ item.item.namespace }}"
register: pvc_pre_check
until: pvc_pre_check.resources | default([]) | length == 0
retries: 12
delay: 5
delegate_to: localhost
loop: "{{ pvc_deletion_ts.results }}"
loop_control:
label: "{{ item.item.namespace }}/{{ item.item.pvc_name }}"
when: item.stdout != ''
- name: "Phase 3 | Check if Longhorn Volume CRD already exists"
kubernetes.core.k8s_info:
kind: Volume
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: volume_crd_check
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 3 | Create Longhorn Volume CRD"
kubernetes.core.k8s:
state: present
definition:
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: "{{ item.item.pv_name }}"
namespace: "{{ longhorn_namespace }}"
spec:
accessMode: "{{ item.item.access_mode | lower | replace('readwriteonce', 'rwo') | replace('readwritemany', 'rwx') }}"
dataEngine: v1
frontend: blockdev
numberOfReplicas: 3
size: "{{ item.item.size_bytes }}"
delegate_to: localhost
loop: "{{ volume_crd_check.results }}"
loop_control:
label: "{{ item.item.pv_name }}"
when: item.resources | default([]) | length == 0
- name: "Phase 3 | Wait for Longhorn replicas to appear (stopped state)"
kubernetes.core.k8s_info:
kind: Replica
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
label_selectors:
- "longhornvolume={{ item.pv_name }}"
register: replicas_check
until: replicas_check.resources | default([]) | length >= 1
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 3 | Wait for Volume status to be populated (webhook cache)"
kubernetes.core.k8s_info:
kind: Volume
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: vol_ready
until: >
(vol_ready.resources | default([]) | first | default({}) ).status.state | default('') != ''
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
# =========================================================================
# PHASE 4 — SCALE DOWN WORKLOADS
# =========================================================================
- name: "Phase 4 | Scale down Deployments"
kubernetes.core.k8s_scale:
kind: Deployment
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'Deployment' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 4 | Scale down StatefulSets"
kubernetes.core.k8s_scale:
kind: StatefulSet
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'StatefulSet' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 4 | Delete any lingering Error-state pods that may hold volume attachments"
ansible.builtin.shell: |
kubectl get pods -n {{ item.namespace }} \
--field-selector='status.phase=Failed' -o name | xargs -r kubectl delete -n {{ item.namespace }}
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}"
changed_when: false
ignore_errors: yes
# =========================================================================
# PHASE 5 — ATTACH VOLUME VIA MAINTENANCE TICKET
# =========================================================================
- name: "Phase 5 | Create VolumeAttachment maintenance ticket"
kubernetes.core.k8s:
state: present
definition:
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
name: "{{ item.pv_name }}"
namespace: "{{ longhorn_namespace }}"
spec:
attachmentTickets:
recovery:
generation: 0
id: recovery
nodeID: "{{ item.source_node }}"
parameters:
disableFrontend: "false"
type: longhorn-api
volume: "{{ item.pv_name }}"
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }} → {{ item.source_node }}"
- name: "Phase 5 | Wait for volume to reach attached state"
kubernetes.core.k8s_info:
kind: Volume
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: vol_state
until: >
(vol_state.resources | default([]) | first | default({}) ).status.state | default('') == 'attached'
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 5 | Verify block device exists on target node"
ansible.builtin.stat:
path: "/dev/longhorn/{{ item.pv_name }}"
register: blockdev_check
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 5 | Fail if block device not present"
ansible.builtin.fail:
msg: >
Block device /dev/longhorn/{{ item.item.pv_name }} not found on
{{ item.item.source_node }} after volume attached — check Longhorn logs.
loop: "{{ blockdev_check.results }}"
loop_control:
label: "{{ item.item.pv_name }}"
when: not item.stat.exists
# =========================================================================
# PHASE 6 — INJECT DATA INTO LIVE BLOCK DEVICE
# =========================================================================
- name: "Phase 6 | Inject data via block device (mount, rsync, umount)"
ansible.builtin.shell: |
LIVE="{{ live_mount }}-{{ item.pvc_name }}"
SRC="{{ recovery_mount }}-{{ item.pvc_name }}"
BLOCKDEV="/dev/longhorn/{{ item.pv_name }}"
MERGED="{{ merged_base }}/{{ item.pvc_name }}.img"
# Always unmount on exit (success or partial failure)
cleanup() {
mountpoint -q "$SRC" && umount "$SRC" || true
mountpoint -q "$LIVE" && umount "$LIVE" || true
}
trap cleanup EXIT
mkdir -p "$LIVE" "$SRC"
# Format if not already formatted (idempotent — safe on re-run)
if ! blkid "$BLOCKDEV" | grep -q 'TYPE='; then
mkfs.ext4 -F "$BLOCKDEV"
fi
# Mount live block device if not already mounted
if ! mountpoint -q "$LIVE"; then
mount "$BLOCKDEV" "$LIVE"
fi
# Mount merged recovery image read-only if not already mounted
if ! mountpoint -q "$SRC"; then
mount -o loop,ro,noload "$MERGED" "$SRC"
fi
# Sync data — exclude lost+found
# --ignore-errors: continue past unreadable files (e.g. corrupted parts from power cut)
# rc=23 (partial transfer) is treated as success — bulk data transferred
rsync -av --ignore-errors --exclude='lost+found' "$SRC/" "$LIVE/" || \
{ RC=$?; [ $RC -eq 23 ] && echo "WARNING: rsync rc=23 (some files unreadable in source — expected for power-cut partitions)" || exit $RC; }
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
register: inject_output
changed_when: true
- name: "Phase 6 | Show rsync output"
ansible.builtin.debug:
msg: "{{ item.stdout_lines | default([]) }}"
loop: "{{ inject_output.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
# =========================================================================
# PHASE 7 — DETACH VOLUME
# =========================================================================
- name: "Phase 7 | Remove recovery attachment ticket"
kubernetes.core.k8s_json_patch:
kind: VolumeAttachment
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
patch:
- op: remove
path: /spec/attachmentTickets/recovery
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
ignore_errors: yes
- name: "Phase 7 | Wait for recovery ticket to be gone"
kubernetes.core.k8s_info:
kind: VolumeAttachment
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: va_state
until: >
(va_state.resources | default([]) | first | default({}) ).spec.attachmentTickets.recovery is not defined
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
# =========================================================================
# PHASE 8 — RESTORE PV AND PVC
# =========================================================================
- name: "Phase 8 | Create PersistentVolume (Retain, no claimRef)"
kubernetes.core.k8s:
state: present
definition:
apiVersion: v1
kind: PersistentVolume
metadata:
name: "{{ item.pv_name }}"
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
accessModes:
- "{{ item.access_mode }}"
capacity:
storage: "{{ item.size_human }}"
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: "{{ item.pv_name }}"
volumeAttributes:
dataEngine: v1
dataLocality: disabled
disableRevisionCounter: "true"
numberOfReplicas: "3"
staleReplicaTimeout: "30"
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn
volumeMode: Filesystem
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 8 | Wait for PV to be Available or Bound"
kubernetes.core.k8s_info:
kind: PersistentVolume
name: "{{ item.pv_name }}"
register: pv_state
until: >
(pv_state.resources | default([]) | first | default({}) ).status.phase | default('')
in ['Available', 'Bound']
retries: 12
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 8 | Check if PVC already bound to correct PV"
ansible.builtin.shell: >
kubectl get pvc {{ item.pvc_name }} -n {{ item.namespace }}
-o jsonpath='{.spec.volumeName}' 2>/dev/null || true
register: pvc_current_volume
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.pvc_name }}"
changed_when: false
- name: "Phase 8 | Create PersistentVolumeClaim pinned to PV"
kubernetes.core.k8s:
state: present
definition:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "{{ item.item.pvc_name }}"
namespace: "{{ item.item.namespace }}"
spec:
accessModes:
- "{{ item.item.access_mode }}"
resources:
requests:
storage: "{{ item.item.size_human }}"
storageClassName: longhorn
volumeMode: Filesystem
volumeName: "{{ item.item.pv_name }}"
delegate_to: localhost
loop: "{{ pvc_current_volume.results }}"
loop_control:
label: "{{ item.item.namespace }}/{{ item.item.pvc_name }}"
when: item.stdout != item.item.pv_name
- name: "Phase 8 | Wait for PVC to be Bound"
kubernetes.core.k8s_info:
kind: PersistentVolumeClaim
namespace: "{{ item.namespace }}"
name: "{{ item.pvc_name }}"
register: pvc_state
until: >
(pvc_state.resources | default([]) | first | default({}) ).status.phase | default('') == 'Bound'
retries: 12
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.pvc_name }}"
# =========================================================================
# PHASE 9 — SCALE UP AND VERIFY
# =========================================================================
- name: "Phase 9 | Scale up Deployments"
kubernetes.core.k8s_scale:
kind: Deployment
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 1
wait: yes
wait_timeout: 120
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'Deployment' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 9 | Scale up StatefulSets"
kubernetes.core.k8s_scale:
kind: StatefulSet
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 1
wait: yes
wait_timeout: 120
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'StatefulSet' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 9 | Wait for workload to report ready replicas"
kubernetes.core.k8s_info:
kind: "{{ item.workload_kind }}"
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
register: workload_state
until: >
(workload_state.resources | default([]) | first | default({}) ).status.readyReplicas | default(0) | int >= 1
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_name != ''
ignore_errors: yes
- name: "Phase 9 | Run optional verification command in pod"
ansible.builtin.shell: >
kubectl exec -n {{ item.namespace }}
$(kubectl get pod -n {{ item.namespace }}
-l statefulset.kubernetes.io/pod-name={{ item.workload_name }}-0
--no-headers -o custom-columns=':metadata.name' 2>/dev/null ||
kubectl get pod -n {{ item.namespace }} {{ item.workload_name }}-0
--no-headers -o custom-columns=':metadata.name' 2>/dev/null)
-- sh -c '{{ item.verify_cmd }}'
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.verify_cmd | default('') != ''
register: verify_output
changed_when: false
ignore_errors: yes
- name: "Phase 9 | Show verification output"
ansible.builtin.debug:
msg: "{{ item.stdout_lines | default([]) }}"
loop: "{{ verify_output.results | default([]) }}"
loop_control:
label: "{{ item.item.pvc_name | default('') }}"
when: item.stdout_lines is defined and item.item.verify_cmd | default('') != ''
# =========================================================================
# RECOVERY SUMMARY
# =========================================================================
- name: "Summary | Recovery complete"
ansible.builtin.debug:
msg: |
╔══════════════════════════════════════════════════════╗
║ Longhorn Block-Device Recovery Complete ║
╚══════════════════════════════════════════════════════╝
Volumes recovered:
{% for v in _volumes %}
• {{ v.pvc_name }} ({{ v.namespace }}) ← {{ v.source_node }}:{{ v.source_dir }}
{% endfor %}
Backups retained at: {{ backup_base }}/<pvc-name>/
Merged images at: {{ merged_base }}/<pvc-name>.img
Next steps:
1. Verify application data through the app UI / API
2. Repeat for remaining volumes (update vars file)
3. Run a fresh k8s_pvc backup once all volumes are healthy

View File

@@ -0,0 +1,84 @@
---
# Example vars file for playbooks/recover/longhorn_data.yml
#
# Usage:
# ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
# -e @playbooks/recover/longhorn_data_vars.example.yml
#
# HOW TO FILL THIS IN:
#
# 1. Find untouched replica dirs across all nodes:
# for node in pi1 pi2 pi3; do
# echo "=== $node ==="
# ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
# done
# Pick the dir with the largest size (>16K) and oldest timestamps (from before the incident).
#
# 2. Get pv_name and pvc_name from PV/PVC backup:
# cat /home/pi/arcodange/backups/k3s_pvc/backup_*.volumes | grep -A5 "kind: PersistentVolume"
#
# 3. Get size_bytes from Longhorn volume spec or from:
# cat /mnt/arcodange/longhorn/replicas/<source_dir>/volume.meta
#
# 4. source_node = the node where the untouched dir lives
# source_dir = the exact directory name (e.g. pvc-abc123-998f49ff)
#
# Fields:
# pv_name — Longhorn volume name, equals the PV name (pvc-<uuid>) [REQUIRED]
# pvc_name — PVC name in the namespace [REQUIRED]
# namespace — namespace where the PVC lives [REQUIRED]
# size_bytes — volume capacity in bytes as a string (from volume spec) [REQUIRED]
# size_human — human-readable size for PVC spec (e.g. 128Mi, 8Gi) [REQUIRED]
# access_mode — ReadWriteOnce or ReadWriteMany [REQUIRED]
# workload_kind — Deployment or StatefulSet [REQUIRED]
# workload_name — name of the workload to scale down/up [REQUIRED]
# source_node — node holding the untouched replica dir (pi1/pi2/pi3) [OPTIONAL — auto-discovered]
# source_dir — exact replica dir name on source_node [OPTIONAL — auto-discovered]
# verify_cmd — shell command to run inside pod to confirm data after restore [OPTIONAL]
#
# source_node and source_dir are auto-discovered by Phase 0 (largest dir >16K across all
# nodes). Override them manually only if you want to force a specific replica dir.
longhorn_recovery_volumes:
# --- url-shortener (example, already recovered 2026-04-14) ---
- pv_name: pvc-cdd434d1-c8b4-4a75-acde-2978ec9febd4
pvc_name: url-shortener-data
namespace: url-shortener
size_bytes: "134217728"
size_human: 128Mi
access_mode: ReadWriteOnce
workload_kind: Deployment
workload_name: url-shortener
source_node: pi3
source_dir: pvc-cdd434d1-c8b4-4a75-acde-2978ec9febd4-998f49ff
verify_cmd: "sqlite3 /data/urls.db 'SELECT COUNT(*) FROM urls;'"
# --- traefik (example, already recovered 2026-04-14) ---
# - pv_name: pvc-<traefik-uuid>
# pvc_name: traefik-data
# namespace: traefik
# size_bytes: "134217728"
# size_human: 128Mi
# access_mode: ReadWriteOnce
# workload_kind: Deployment
# workload_name: traefik
# source_node: pi3
# source_dir: pvc-<traefik-uuid>-<hex>
# verify_cmd: ""
# --- vault (uncomment and fill for recovery) ---
# - pv_name: pvc-<vault-uuid>
# pvc_name: vault-data
# namespace: vault
# size_bytes: "1073741824"
# size_human: 1Gi
# access_mode: ReadWriteOnce
# workload_kind: StatefulSet
# workload_name: vault
# source_node: pi2
# source_dir: pvc-<vault-uuid>-<hex>
# verify_cmd: ""
# Add more volumes here following the same pattern.
# Process one at a time first to validate, then batch.

View File

@@ -0,0 +1,17 @@
---
# Recovery vars for Clickhouse
# Source: pi3, dir pvc-1251909b-...-1163420b (2.6G — largest, snapshot verified non-zero)
# Generated: 2026-04-14
longhorn_recovery_volumes:
- pv_name: pvc-1251909b-3cef-40c6-881c-3bb6e929a596
pvc_name: clickhouse-storage-clickhouse-0
namespace: tools
size_bytes: "17179869184" # 16Gi
size_human: 16Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: clickhouse
source_node: pi3
source_dir: pvc-1251909b-3cef-40c6-881c-3bb6e929a596-1163420b
verify_cmd: "clickhouse-client --query 'SHOW DATABASES'"

View File

@@ -0,0 +1,38 @@
---
# Recovery vars for erp and hashicorp-vault volumes
# source_node/source_dir omitted — auto-discovered by Phase 0
longhorn_recovery_volumes:
- pv_name: pvc-7971918e-e47f-4739-a976-965ea2d770b4
pvc_name: erp
namespace: erp
size_bytes: "53687091200"
size_human: 50Gi
access_mode: ReadWriteMany
workload_kind: Deployment
workload_name: "" # intentionally blank — ERP needs Vault unsealed first; scale up manually
verify_cmd: ""
# hashicorp-vault StatefulSet has two PVCs (audit + data).
# workload_name is set only on the last entry so the StatefulSet is scaled up
# once after both volumes are ready, not between them.
- pv_name: pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70
pvc_name: audit-hashicorp-vault-0
namespace: tools
size_bytes: "10737418240"
size_human: 10Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: ""
verify_cmd: ""
- pv_name: pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635
pvc_name: data-hashicorp-vault-0
namespace: tools
size_bytes: "10737418240"
size_human: 10Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: hashicorp-vault
verify_cmd: ""

View File

@@ -0,0 +1,47 @@
---
# Recovery vars for remaining volumes (prometheus, alertmanager, redis, backups-rwx)
# source_node and source_dir intentionally omitted — auto-discovered by Phase 0
longhorn_recovery_volumes:
- pv_name: pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9
pvc_name: prometheus-server
namespace: tools
size_bytes: "8589934592"
size_human: 8Gi
access_mode: ReadWriteOnce
workload_kind: Deployment
workload_name: prometheus-server
source_node: pi2
source_dir: pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9-910583f6
verify_cmd: ""
- pv_name: pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4
pvc_name: storage-prometheus-alertmanager-0
namespace: tools
size_bytes: "2147483648"
size_human: 2Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: prometheus-alertmanager
verify_cmd: ""
- pv_name: pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce
pvc_name: redis-storage-redis-0
namespace: tools
size_bytes: "1073741824"
size_human: 1Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: redis
verify_cmd: "redis-cli ping"
- pv_name: pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa
pvc_name: backups-rwx
namespace: longhorn-system
size_bytes: "53687091200"
size_human: 50Gi
access_mode: ReadWriteMany
workload_kind: Deployment
workload_name: ""
verify_cmd: ""