Files

Gabriel Radureau 1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling

Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 12:55:18 +02:00

9.4 KiB

Raw Blame History

Incident Documentation

This directory contains incident reports, postmortems, and recovery logs for the Arcodange Factory infrastructure.

Purpose

Document all infrastructure incidents to:

Track root causes and resolutions
Maintain a knowledge base for future troubleshooting
Improve system reliability through lessons learned
Provide clear guidance for on-call responders

Structure

Each incident is documented in its own directory under docs/incidents/ with the following naming convention:

docs/incidents/
├── YYYY-MM-DD-incident-name/
│   ├── README.md              # Incident summary and timeline
│   ├── status.md              # Real-time status updates (optional)
│   ├── log.md                 # Detailed recovery actions and logs
│   ├── root-cause.md          # Technical analysis (optional)
│   └── diagrams/              # Architecture/flow diagrams (optional)
│       └── *.mmd              # Mermaid diagrams
└── ...

Incident Directory Contents

1. `README.md` (Required)

The primary incident document. Must include:

Incident ID: Unique identifier (e.g., 2026-04-13-001)
Title: Clear, descriptive title
Date/Time: Start and end timestamps
Status: Open / Investigating / Resolved / Monitoring
Severity: SEV-1 (Critical) / SEV-2 (High) / SEV-3 (Medium) / SEV-4 (Low)
Impact: Brief description of affected services
Summary: What happened
Timeline: Key events with timestamps
Root Cause: Technical analysis
Resolution: Steps taken to resolve
Action Items: Follow-up tasks
Lessons Learned: Key takeaways

Front matter template:

---
title: Incident Title
incident_id: YYYY-MM-DD-NNN
date: YYYY-MM-DD
time_start: HH:MM:SS UTC
time_end: HH:MM:SS UTC
status: Resolved
severity: SEV-2
tags:
  - kubernetes
  - longhorn
  - storage
---

2. `log.md` (Recommended)

Detailed technical log of all recovery actions. Must include:

Commands executed with timestamps
Command output (relevant portions)
Decision rationale for each action
Outcome of each action
Next stepsidentified

Format:

## [Time] Action Description

**Command:** `actual command run`

**Output:**

relevant output


**Decision:** Why this action was taken

**Outcome:** What happened

**Next:** What to do next

3. Mermaid Diagrams

Include at least one Mermaid diagram in each incident to visualize:

Architecture/flow before incident
Failure propagation
Recovery process
New architecture after fixes

Example theme usage:

%%{init: { 'theme': 'forest', 'themeVariables': { 'primaryColor': '#ffdfd3', 'edgeLabelBackground':'#fff' }}}%%

Available themes: default, base, forest, dark, neutral

Recommended diagrams:

incident-flow.mmd: Timeline/flow of the incident
architecture.mmd: Affected components architecture
recovery-flow.mmd: Recovery steps visualization
dependency-tree.mmd: Component dependencies showing failure path

Incident Severity Definitions

Severity	Description	Response Time	Impact
SEV-1	Critical system-wide outage	Immediate (24/7)	Multiple services down, potential data loss
SEV-2	Major service degradation	< 1 hour	Single critical service down
SEV-3	Partial service degradation	< 4 hours	Non-critical service affected
SEV-4	Minor issue	Next business day	Cosmetic or non-impacting

Available Ansible Playbooks for Recovery

This collection provides comprehensive infrastructure management via Ansible. Always use -i inventory/hosts.yml when running playbooks.

Master Playbooks (Run in order for full recovery)

Playbook	Purpose	Targets
`playbooks/01_system.yml`	System setup (hostnames, iSCSI, Docker, Longhorn, DNS)	raspberries
`playbooks/02_setup.yml`	Infrastructure setup (NFS backup, PostgreSQL, Gitea)	localhost, postgres, gitea
`playbooks/03_cicd.yml`	CI/CD pipeline (Gitea tokens, Docker Compose, ArgoCD)	localhost, gitea
`playbooks/04_tools.yml`	Tool deployment (Hashicorp Vault, Crowdsec)	tools group
`playbooks/05_backup.yml`	Backup configuration	localhost

Component-Specific Playbooks

System

Playbook	Purpose	Notes
`playbooks/system/rpi.yml`	Raspberry Pi hostname setup
`playbooks/system/dns.yml`	DNS/pi-hole configuration
`playbooks/system/ssl.yml`	SSL certificate setup with step-ca
`playbooks/system/prepare_disks.yml`	Disk partitioning and formatting
`playbooks/system/system_docker.yml`	Docker installation with custom storage	Storage at `/mnt/arcodange/docker`
`playbooks/system/k3s_config.yml`	K3s configuration (Traefik, Longhorn HelmCharts)	Key for k3s
`playbooks/system/system_k3s.yml`	K3s cluster deployment	Uses k3s-ansible collection
`playbooks/system/iscsi_longhorn.yml`	iSCSI client for Longhorn	Prerequisite for Longhorn
`playbooks/system/k3s_dns.yml`	K3s DNS configuration
`playbooks/system/k3s_ssl.yml`	K3s SSL/traefik certificates

Storage

Playbook	Purpose	Notes
`playbooks/setup/backup_nfs.yml`	Longhorn RWX NFS backup volume	Creates 50Gi PVC + recurring backups
`playbooks/backup/k3s_pvc.yml`	PVC backup scripts	Creates `/opt/k3s_volumes/backup.sh` and `restore.sh`

Backup

Playbook	Purpose	Notes
`playbooks/backup/backup.yml`	Main backup orchestration	Calls postgres, gitea, k3s_pvc
`playbooks/backup/postgres.yml`	PostgreSQL database backup	Docker exec pg_dumpall
`playbooks/backup/gitea.yml`	Gitea backup	Uses gitea dump command
`playbooks/backup/cron_report.yml`	Mail utility for cron reports
`playbooks/backup/cron_report_mailutility.yml`	MTA configuration

Inventory File

File: inventory/hosts.yml

Groups:

raspberries: pi1, pi2, pi3 (Raspberry Pi nodes)
local: localhost, pi1, pi2, pi3
postgres: pi2 (PostgreSQL host)
gitea: pi2 (Gitea host, inherits postgres)
pihole: pi1, pi3 (DNS hosts)
step_ca: pi1, pi2, pi3 (Certificate authority)
all: All above groups

Important: All playbooks MUST be run with -i inventory/hosts.yml flag:

ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml

Handy Commands for Incident Response

# Check all pods
kubectl get pods -A

# Check Longhorn specifically
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system

# Check storage
kubectl get pv -A
kubectl get pvc -A
kubectl get csidriver

# Check nodes
kubectl get nodes -o wide
kubectl describe node <nodename>

# Force Longhorn HelmChart reconcile (k3s-specific)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml

# Restart Longhorn
kubectl delete pods -n longhorn-system --all --force --grace-period=0

# Check Longhorn data on disk
ls /mnt/arcodange/longhorn/replicas/

# Check Docker storage
ls /mnt/arcodange/docker/overlay2/ | head

# Run ansible playbook (dry-run first)
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --limit pi1

K3s-Specific Recovery Notes

Longhorn is installed via HelmChart manifest (k3s native):

File: /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
To trigger reconcile: touch the file (k3s watches for changes)
DO NOT use helm install directly - it may conflict with k3s HelmChart controller

Traefik is also installed via HelmChart manifest:

File: /var/lib/rancher/k3s/server/manifests/traefik-v3.yaml

Incident Templates

Quick Start Template

---
title: [Short Description]
incident_id: YYYY-MM-DD-NNN
date: $(date +%Y-%m-%d)
time_start: $(date +%H:%M:%S)
status: Investigating
severity: SEV-2
tags:
  - tag1
  - tag2
---

## Summary

[1-2 sentences describing the issue]

## Impact

[What services/users are affected]

## Timeline

| Time | Event | Owner |
|------|-------|-------|
| HH:MM | Initial detection | | @user
| HH:MM | Investigation started | | @user
| HH:MM | Root cause identified | | @user
| HH:MM | Resolution applied | | @user
| HH:MM | Service restored | | @user

## Root Cause

[Technical analysis]

## Resolution

[Step-by-step what was done]

## Mermaid Diagram

%%{init: { 'theme': 'forest' }}%%
graph TD
    A[Component A] -->|depends on| B[Component B]
    B -->|failed due to| C[Component C]
    C -->|power cut| D[Root Cause]

remember to always to this for labels:

have a space before a filepath
no parenthesis '()'
use
instead of \n for new lines

Action Items

Task 1
Task 2

Lessons Learned

Lesson 1
Lesson 2


## Contributing to Incident Documentation

1. **During Incident**: Focus on resolution, log commands and outputs in `log.md`
2. **After Resolution**: Create/read the `README.md` with full incident details
3. **Add Diagrams**: Include at least one Mermaid diagram to visualize the issue
4. **Peer Review**: Have another team member review before closing
5. **Update Templates**: Improve templates based on what was missing

## Directory Index

| Incident | Date | Severity | Status |
|----------|------|----------|--------|
| [2026-04-13-power-cut](./2026-04-13-power-cut/README.md) | 2026-04-13 | SEV-1 | In Progress |

9.4 KiB Raw Blame History