Files
factory/ansible/arcodange/factory/docs/incidents/README.md
Gabriel Radureau 1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:18 +02:00

9.4 KiB

Incident Documentation

This directory contains incident reports, postmortems, and recovery logs for the Arcodange Factory infrastructure.

Purpose

Document all infrastructure incidents to:

  • Track root causes and resolutions
  • Maintain a knowledge base for future troubleshooting
  • Improve system reliability through lessons learned
  • Provide clear guidance for on-call responders

Structure

Each incident is documented in its own directory under docs/incidents/ with the following naming convention:

docs/incidents/
├── YYYY-MM-DD-incident-name/
│   ├── README.md              # Incident summary and timeline
│   ├── status.md              # Real-time status updates (optional)
│   ├── log.md                 # Detailed recovery actions and logs
│   ├── root-cause.md          # Technical analysis (optional)
│   └── diagrams/              # Architecture/flow diagrams (optional)
│       └── *.mmd              # Mermaid diagrams
└── ...

Incident Directory Contents

1. README.md (Required)

The primary incident document. Must include:

  • Incident ID: Unique identifier (e.g., 2026-04-13-001)
  • Title: Clear, descriptive title
  • Date/Time: Start and end timestamps
  • Status: Open / Investigating / Resolved / Monitoring
  • Severity: SEV-1 (Critical) / SEV-2 (High) / SEV-3 (Medium) / SEV-4 (Low)
  • Impact: Brief description of affected services
  • Summary: What happened
  • Timeline: Key events with timestamps
  • Root Cause: Technical analysis
  • Resolution: Steps taken to resolve
  • Action Items: Follow-up tasks
  • Lessons Learned: Key takeaways

Front matter template:

---
title: Incident Title
incident_id: YYYY-MM-DD-NNN
date: YYYY-MM-DD
time_start: HH:MM:SS UTC
time_end: HH:MM:SS UTC
status: Resolved
severity: SEV-2
tags:
  - kubernetes
  - longhorn
  - storage
---

Detailed technical log of all recovery actions. Must include:

  • Commands executed with timestamps
  • Command output (relevant portions)
  • Decision rationale for each action
  • Outcome of each action
  • Next stepsidentified

Format:

## [Time] Action Description

**Command:** `actual command run`

**Output:**

relevant output


**Decision:** Why this action was taken

**Outcome:** What happened

**Next:** What to do next

3. Mermaid Diagrams

Include at least one Mermaid diagram in each incident to visualize:

  • Architecture/flow before incident
  • Failure propagation
  • Recovery process
  • New architecture after fixes

Example theme usage:

%%{init: { 'theme': 'forest', 'themeVariables': { 'primaryColor': '#ffdfd3', 'edgeLabelBackground':'#fff' }}}%%

Available themes: default, base, forest, dark, neutral

Recommended diagrams:

  • incident-flow.mmd: Timeline/flow of the incident
  • architecture.mmd: Affected components architecture
  • recovery-flow.mmd: Recovery steps visualization
  • dependency-tree.mmd: Component dependencies showing failure path

Incident Severity Definitions

Severity Description Response Time Impact
SEV-1 Critical system-wide outage Immediate (24/7) Multiple services down, potential data loss
SEV-2 Major service degradation < 1 hour Single critical service down
SEV-3 Partial service degradation < 4 hours Non-critical service affected
SEV-4 Minor issue Next business day Cosmetic or non-impacting

Available Ansible Playbooks for Recovery

This collection provides comprehensive infrastructure management via Ansible. Always use -i inventory/hosts.yml when running playbooks.

Master Playbooks (Run in order for full recovery)

Playbook Purpose Targets
playbooks/01_system.yml System setup (hostnames, iSCSI, Docker, Longhorn, DNS) raspberries
playbooks/02_setup.yml Infrastructure setup (NFS backup, PostgreSQL, Gitea) localhost, postgres, gitea
playbooks/03_cicd.yml CI/CD pipeline (Gitea tokens, Docker Compose, ArgoCD) localhost, gitea
playbooks/04_tools.yml Tool deployment (Hashicorp Vault, Crowdsec) tools group
playbooks/05_backup.yml Backup configuration localhost

Component-Specific Playbooks

System

Playbook Purpose Notes
playbooks/system/rpi.yml Raspberry Pi hostname setup
playbooks/system/dns.yml DNS/pi-hole configuration
playbooks/system/ssl.yml SSL certificate setup with step-ca
playbooks/system/prepare_disks.yml Disk partitioning and formatting
playbooks/system/system_docker.yml Docker installation with custom storage Storage at /mnt/arcodange/docker
playbooks/system/k3s_config.yml K3s configuration (Traefik, Longhorn HelmCharts) Key for k3s
playbooks/system/system_k3s.yml K3s cluster deployment Uses k3s-ansible collection
playbooks/system/iscsi_longhorn.yml iSCSI client for Longhorn Prerequisite for Longhorn
playbooks/system/k3s_dns.yml K3s DNS configuration
playbooks/system/k3s_ssl.yml K3s SSL/traefik certificates

Storage

Playbook Purpose Notes
playbooks/setup/backup_nfs.yml Longhorn RWX NFS backup volume Creates 50Gi PVC + recurring backups
playbooks/backup/k3s_pvc.yml PVC backup scripts Creates /opt/k3s_volumes/backup.sh and restore.sh

Backup

Playbook Purpose Notes
playbooks/backup/backup.yml Main backup orchestration Calls postgres, gitea, k3s_pvc
playbooks/backup/postgres.yml PostgreSQL database backup Docker exec pg_dumpall
playbooks/backup/gitea.yml Gitea backup Uses gitea dump command
playbooks/backup/cron_report.yml Mail utility for cron reports
playbooks/backup/cron_report_mailutility.yml MTA configuration

Inventory File

File: inventory/hosts.yml

Groups:

  • raspberries: pi1, pi2, pi3 (Raspberry Pi nodes)
  • local: localhost, pi1, pi2, pi3
  • postgres: pi2 (PostgreSQL host)
  • gitea: pi2 (Gitea host, inherits postgres)
  • pihole: pi1, pi3 (DNS hosts)
  • step_ca: pi1, pi2, pi3 (Certificate authority)
  • all: All above groups

Important: All playbooks MUST be run with -i inventory/hosts.yml flag:

ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml

Handy Commands for Incident Response

# Check all pods
kubectl get pods -A

# Check Longhorn specifically
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system

# Check storage
kubectl get pv -A
kubectl get pvc -A
kubectl get csidriver

# Check nodes
kubectl get nodes -o wide
kubectl describe node <nodename>

# Force Longhorn HelmChart reconcile (k3s-specific)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml

# Restart Longhorn
kubectl delete pods -n longhorn-system --all --force --grace-period=0

# Check Longhorn data on disk
ls /mnt/arcodange/longhorn/replicas/

# Check Docker storage
ls /mnt/arcodange/docker/overlay2/ | head

# Run ansible playbook (dry-run first)
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --limit pi1

K3s-Specific Recovery Notes

Longhorn is installed via HelmChart manifest (k3s native):

  • File: /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
  • To trigger reconcile: touch the file (k3s watches for changes)
  • DO NOT use helm install directly - it may conflict with k3s HelmChart controller

Traefik is also installed via HelmChart manifest:

  • File: /var/lib/rancher/k3s/server/manifests/traefik-v3.yaml

Incident Templates

Quick Start Template

---
title: [Short Description]
incident_id: YYYY-MM-DD-NNN
date: $(date +%Y-%m-%d)
time_start: $(date +%H:%M:%S)
status: Investigating
severity: SEV-2
tags:
  - tag1
  - tag2
---

## Summary

[1-2 sentences describing the issue]

## Impact

[What services/users are affected]

## Timeline

| Time | Event | Owner |
|------|-------|-------|
| HH:MM | Initial detection | | @user
| HH:MM | Investigation started | | @user
| HH:MM | Root cause identified | | @user
| HH:MM | Resolution applied | | @user
| HH:MM | Service restored | | @user

## Root Cause

[Technical analysis]

## Resolution

[Step-by-step what was done]

## Mermaid Diagram

%%{init: { 'theme': 'forest' }}%%
graph TD
    A[Component A] -->|depends on| B[Component B]
    B -->|failed due to| C[Component C]
    C -->|power cut| D[Root Cause]

remember to always to this for labels:

  • have a space before a filepath
  • no parenthesis '()'
  • use
    instead of \n for new lines

Action Items

  • Task 1
  • Task 2

Lessons Learned

  • Lesson 1
  • Lesson 2

## Contributing to Incident Documentation

1. **During Incident**: Focus on resolution, log commands and outputs in `log.md`
2. **After Resolution**: Create/read the `README.md` with full incident details
3. **Add Diagrams**: Include at least one Mermaid diagram to visualize the issue
4. **Peer Review**: Have another team member review before closing
5. **Update Templates**: Improve templates based on what was missing

## Directory Index

| Incident | Date | Severity | Status |
|----------|------|----------|--------|
| [2026-04-13-power-cut](./2026-04-13-power-cut/README.md) | 2026-04-13 | SEV-1 | In Progress |