Captures the post-mortem of the April 13 power-cut: incident timeline, retrospective, and architecture/role diagrams. Adds an ADR explaining why Longhorn cannot re-associate orphaned replica directories after a nuclear reinstall (engine-id naming), plus block-device recovery runbooks and the `playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py` to rebuild PVCs from raw `volume-head-*.img` chains. Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs (needed for the fast-path restore) and rewrites the restore script with a fallback dir + English messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.4 KiB
Incident Documentation
This directory contains incident reports, postmortems, and recovery logs for the Arcodange Factory infrastructure.
Purpose
Document all infrastructure incidents to:
- Track root causes and resolutions
- Maintain a knowledge base for future troubleshooting
- Improve system reliability through lessons learned
- Provide clear guidance for on-call responders
Structure
Each incident is documented in its own directory under docs/incidents/ with the following naming convention:
docs/incidents/
├── YYYY-MM-DD-incident-name/
│ ├── README.md # Incident summary and timeline
│ ├── status.md # Real-time status updates (optional)
│ ├── log.md # Detailed recovery actions and logs
│ ├── root-cause.md # Technical analysis (optional)
│ └── diagrams/ # Architecture/flow diagrams (optional)
│ └── *.mmd # Mermaid diagrams
└── ...
Incident Directory Contents
1. README.md (Required)
The primary incident document. Must include:
- Incident ID: Unique identifier (e.g.,
2026-04-13-001) - Title: Clear, descriptive title
- Date/Time: Start and end timestamps
- Status: Open / Investigating / Resolved / Monitoring
- Severity: SEV-1 (Critical) / SEV-2 (High) / SEV-3 (Medium) / SEV-4 (Low)
- Impact: Brief description of affected services
- Summary: What happened
- Timeline: Key events with timestamps
- Root Cause: Technical analysis
- Resolution: Steps taken to resolve
- Action Items: Follow-up tasks
- Lessons Learned: Key takeaways
Front matter template:
---
title: Incident Title
incident_id: YYYY-MM-DD-NNN
date: YYYY-MM-DD
time_start: HH:MM:SS UTC
time_end: HH:MM:SS UTC
status: Resolved
severity: SEV-2
tags:
- kubernetes
- longhorn
- storage
---
2. log.md (Recommended)
Detailed technical log of all recovery actions. Must include:
- Commands executed with timestamps
- Command output (relevant portions)
- Decision rationale for each action
- Outcome of each action
- Next stepsidentified
Format:
## [Time] Action Description
**Command:** `actual command run`
**Output:**
relevant output
**Decision:** Why this action was taken
**Outcome:** What happened
**Next:** What to do next
3. Mermaid Diagrams
Include at least one Mermaid diagram in each incident to visualize:
- Architecture/flow before incident
- Failure propagation
- Recovery process
- New architecture after fixes
Example theme usage:
%%{init: { 'theme': 'forest', 'themeVariables': { 'primaryColor': '#ffdfd3', 'edgeLabelBackground':'#fff' }}}%%
Available themes: default, base, forest, dark, neutral
Recommended diagrams:
incident-flow.mmd: Timeline/flow of the incidentarchitecture.mmd: Affected components architecturerecovery-flow.mmd: Recovery steps visualizationdependency-tree.mmd: Component dependencies showing failure path
Incident Severity Definitions
| Severity | Description | Response Time | Impact |
|---|---|---|---|
| SEV-1 | Critical system-wide outage | Immediate (24/7) | Multiple services down, potential data loss |
| SEV-2 | Major service degradation | < 1 hour | Single critical service down |
| SEV-3 | Partial service degradation | < 4 hours | Non-critical service affected |
| SEV-4 | Minor issue | Next business day | Cosmetic or non-impacting |
Available Ansible Playbooks for Recovery
This collection provides comprehensive infrastructure management via Ansible.
Always use -i inventory/hosts.yml when running playbooks.
Master Playbooks (Run in order for full recovery)
| Playbook | Purpose | Targets |
|---|---|---|
playbooks/01_system.yml |
System setup (hostnames, iSCSI, Docker, Longhorn, DNS) | raspberries |
playbooks/02_setup.yml |
Infrastructure setup (NFS backup, PostgreSQL, Gitea) | localhost, postgres, gitea |
playbooks/03_cicd.yml |
CI/CD pipeline (Gitea tokens, Docker Compose, ArgoCD) | localhost, gitea |
playbooks/04_tools.yml |
Tool deployment (Hashicorp Vault, Crowdsec) | tools group |
playbooks/05_backup.yml |
Backup configuration | localhost |
Component-Specific Playbooks
System
| Playbook | Purpose | Notes |
|---|---|---|
playbooks/system/rpi.yml |
Raspberry Pi hostname setup | |
playbooks/system/dns.yml |
DNS/pi-hole configuration | |
playbooks/system/ssl.yml |
SSL certificate setup with step-ca | |
playbooks/system/prepare_disks.yml |
Disk partitioning and formatting | |
playbooks/system/system_docker.yml |
Docker installation with custom storage | Storage at /mnt/arcodange/docker |
playbooks/system/k3s_config.yml |
K3s configuration (Traefik, Longhorn HelmCharts) | Key for k3s |
playbooks/system/system_k3s.yml |
K3s cluster deployment | Uses k3s-ansible collection |
playbooks/system/iscsi_longhorn.yml |
iSCSI client for Longhorn | Prerequisite for Longhorn |
playbooks/system/k3s_dns.yml |
K3s DNS configuration | |
playbooks/system/k3s_ssl.yml |
K3s SSL/traefik certificates |
Storage
| Playbook | Purpose | Notes |
|---|---|---|
playbooks/setup/backup_nfs.yml |
Longhorn RWX NFS backup volume | Creates 50Gi PVC + recurring backups |
playbooks/backup/k3s_pvc.yml |
PVC backup scripts | Creates /opt/k3s_volumes/backup.sh and restore.sh |
Backup
| Playbook | Purpose | Notes |
|---|---|---|
playbooks/backup/backup.yml |
Main backup orchestration | Calls postgres, gitea, k3s_pvc |
playbooks/backup/postgres.yml |
PostgreSQL database backup | Docker exec pg_dumpall |
playbooks/backup/gitea.yml |
Gitea backup | Uses gitea dump command |
playbooks/backup/cron_report.yml |
Mail utility for cron reports | |
playbooks/backup/cron_report_mailutility.yml |
MTA configuration |
Inventory File
File: inventory/hosts.yml
Groups:
raspberries: pi1, pi2, pi3 (Raspberry Pi nodes)local: localhost, pi1, pi2, pi3postgres: pi2 (PostgreSQL host)gitea: pi2 (Gitea host, inherits postgres)pihole: pi1, pi3 (DNS hosts)step_ca: pi1, pi2, pi3 (Certificate authority)all: All above groups
Important: All playbooks MUST be run with -i inventory/hosts.yml flag:
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml
Handy Commands for Incident Response
# Check all pods
kubectl get pods -A
# Check Longhorn specifically
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system
# Check storage
kubectl get pv -A
kubectl get pvc -A
kubectl get csidriver
# Check nodes
kubectl get nodes -o wide
kubectl describe node <nodename>
# Force Longhorn HelmChart reconcile (k3s-specific)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
# Restart Longhorn
kubectl delete pods -n longhorn-system --all --force --grace-period=0
# Check Longhorn data on disk
ls /mnt/arcodange/longhorn/replicas/
# Check Docker storage
ls /mnt/arcodange/docker/overlay2/ | head
# Run ansible playbook (dry-run first)
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --limit pi1
K3s-Specific Recovery Notes
Longhorn is installed via HelmChart manifest (k3s native):
- File:
/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml - To trigger reconcile:
touchthe file (k3s watches for changes) - DO NOT use
helm installdirectly - it may conflict with k3s HelmChart controller
Traefik is also installed via HelmChart manifest:
- File:
/var/lib/rancher/k3s/server/manifests/traefik-v3.yaml
Incident Templates
Quick Start Template
---
title: [Short Description]
incident_id: YYYY-MM-DD-NNN
date: $(date +%Y-%m-%d)
time_start: $(date +%H:%M:%S)
status: Investigating
severity: SEV-2
tags:
- tag1
- tag2
---
## Summary
[1-2 sentences describing the issue]
## Impact
[What services/users are affected]
## Timeline
| Time | Event | Owner |
|------|-------|-------|
| HH:MM | Initial detection | | @user
| HH:MM | Investigation started | | @user
| HH:MM | Root cause identified | | @user
| HH:MM | Resolution applied | | @user
| HH:MM | Service restored | | @user
## Root Cause
[Technical analysis]
## Resolution
[Step-by-step what was done]
## Mermaid Diagram
%%{init: { 'theme': 'forest' }}%%
graph TD
A[Component A] -->|depends on| B[Component B]
B -->|failed due to| C[Component C]
C -->|power cut| D[Root Cause]
remember to always to this for labels:
- have a space before a filepath
- no parenthesis '()'
- use
instead of \n for new lines
Action Items
- Task 1
- Task 2
Lessons Learned
- Lesson 1
- Lesson 2
## Contributing to Incident Documentation
1. **During Incident**: Focus on resolution, log commands and outputs in `log.md`
2. **After Resolution**: Create/read the `README.md` with full incident details
3. **Add Diagrams**: Include at least one Mermaid diagram to visualize the issue
4. **Peer Review**: Have another team member review before closing
5. **Update Templates**: Improve templates based on what was missing
## Directory Index
| Incident | Date | Severity | Status |
|----------|------|----------|--------|
| [2026-04-13-power-cut](./2026-04-13-power-cut/README.md) | 2026-04-13 | SEV-1 | In Progress |