docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)

Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00
parent b886f06824
commit dbe32161dc
16 changed files with 1571 additions and 0 deletions
--- a/vibe/guidebooks/factory-provisioning/ansible/05-backup.md
+++ b/vibe/guidebooks/factory-provisioning/ansible/05-backup.md
@@ -0,0 +1,107 @@
+[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **05 · Backup**
+
+# 05 · Backup — daily cron dumps
+
+> [!NOTE]
+> **Status:** ✅ active · **Last Updated:** 2026-06-23
+> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
+> **Downstream:** [06 · Recover](06-recover.md) — how these dumps are replayed
+> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [04 · Tools](04-tools.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
+
+Stage 5 installs three independent **cron-driven backup jobs** that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point [`playbooks/05_backup.yml`](../../../../ansible/arcodange/factory/playbooks/05_backup.yml) imports [`playbooks/backup/backup.yml`](../../../../ansible/arcodange/factory/playbooks/backup/backup.yml), which chains the three sub-playbooks, each passing `backup_root_dir: /mnt/backups`.
+
+Every job follows the **same anatomy**: run a daily cron at **04:00**, write a date-stamped archive to `/mnt/backups/<kind>/`, prune anything older than **3 days**, and drop a matching `restore.sh` next to the backup script. `/mnt/backups` is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps.
+
+> [!NOTE]
+> All three sub-playbooks **install** scripts and cron entries; they do not run a backup themselves (beyond a one-shot `test backup_cmd` smoke check that pipes to `/dev/null`). The actual backups fire from cron. To read failures, SSH to the host and use `sudo su` → `mails` (see [`backup/README.md`](../../../../ansible/arcodange/factory/playbooks/backup/README.md)).
+
+---
+
+## The three jobs
+
+| Job | Sub-playbook | Host | Backup command | Artifact | Scripts dir |
+| --- | --- | --- | --- | --- | --- |
+| **Postgres** | [`backup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/backup/postgres.yml) | `postgres` | `docker exec <pg> pg_dumpall -U <user>` ∣ `gzip` | `backup_YYYYMMDD.sql.gz` | `…/docker_composes/postgres/scripts` |
+| **Gitea** | [`backup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/backup/gitea.yml) | `gitea` | `docker exec -u git <gitea> gitea dump --skip-log --skip-db --skip-package-data --type tar.gz` | `backup_YYYYMMDD.gitea.gz` | `…/docker_composes/gitea/scripts` |
+| **K3s PVC** | [`backup/k3s_pvc.yml`](../../../../ansible/arcodange/factory/playbooks/backup/k3s_pvc.yml) | `pi1` | `kubectl get pv,pvc` + `volumes.longhorn.io` + `settings.longhorn.io` (YAML) | `backup_YYYYMMDD.volumes` | `/opt/k3s_volumes` |
+
+All three share: `keep_days: 3`, cron `minute: 0 hour: 4 user: root`, and `backup_dir: /mnt/backups/<kind>`.
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart TD
+  classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
+  classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
+  classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4;
+
+  C["cron · daily 04:00 · user root"]:::cron
+  PG["postgres.yml<br/>pg_dumpall ∣ gzip"]:::job
+  GT["gitea.yml<br/>gitea dump tar.gz"]:::job
+  PV["k3s_pvc.yml<br/>PV · PVC · Longhorn CRDs"]:::job
+  D["/mnt/backups/{postgres,gitea,k3s_pvc}/<br/>keep 3 days"]:::store
+  L["Longhorn:<br/>snapshot · replicate · off-site"]:::ship
+
+  C --> PG --> D
+  C --> GT --> D
+  C --> PV --> D
+  D --> L
+```
+
+1. A single daily **04:00 root cron** triggers each job's `backup.sh`.
+2. **postgres.yml** runs `pg_dumpall` through `gzip`, **gitea.yml** streams a `gitea dump` tarball, **k3s_pvc.yml** serialises the volume metadata.
+3. Each writes a date-stamped archive into `/mnt/backups/<kind>/` and prunes files older than 3 days (`find … -mtime +3 -delete`).
+4. Because `/mnt/backups` is a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron.
+
+---
+
+## Job details
+
+### Postgres — `postgres.yml`
+
+The backup command is built from the Postgres host's docker-compose facts (`container_name`, `POSTGRES_USER`). `pg_dumpall` captures **all databases plus globals (roles)** in one logical dump, gzipped. The generated `restore.sh` takes an optional `YYYYMMDD` argument (defaults to the latest dump), `docker cp`s it into the container, gunzips, and replays with `psql -f`. If the restore misbehaves, the script reminds you to wipe the data dir before replaying.
+
+### Gitea — `gitea.yml`
+
+The dump runs as the `git` user with `--skip-db` (Postgres is backed up separately by the Postgres job) and `--skip-package-data`, streamed to stdout (`-f -`) so it never lands on the container's own disk. The `restore.sh` unpacks the tarball back into `/data/gitea` (config/data) and `/data/git/repositories` (repos), fixes `git:git` ownership, and **regenerates hooks** (`gitea admin regenerate hooks`) — without that step the restored repos have stale hook paths.
+
+### K3s PVC — `k3s_pvc.yml`
+
+This job does **not** back up volume *data* (Longhorn handles the bytes). It backs up the **Kubernetes objects** needed to re-bind those volumes: all `pv` + `pvc`, the **`volumes.longhorn.io` CRDs**, and `settings.longhorn.io`, concatenated into one `.volumes` YAML (`---`-separated). It writes the dump to both `/mnt/backups/k3s_pvc/` *and* a copy alongside the script. The `restore.sh` prefers a fallback dir (`/home/pi/arcodange/backups/k3s_pvc`) then the primary, picks the latest (or a dated) dump, and `kubectl apply`s it.
+
+> [!IMPORTANT]
+> **Backing up the Longhorn `volumes.longhorn.io` CRDs is what enables *fast* recovery.** With the Volume CRDs in the backup, recovery is a single `kubectl apply` that re-associates the surviving on-disk replicas with their PVs (see [06 · Recover → `longhorn.yml`](06-recover.md)). **Without** the Volume CRDs, a Longhorn reinstall assigns **new engine IDs**, cannot adopt the orphaned replica directories, and you fall through to the slow **block-device data recovery** (`longhorn_data.yml`). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). This is the prevention half of the [storage failure mode](../../lab-ecosystem/storage-and-recovery.md).
+
+---
+
+## Gotchas
+
+> [!WARNING]
+> - **3-day retention is tight.** A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local `/mnt/backups` files are short-lived.
+> - **The smoke test runs the real dump.** Each play has a `test backup_cmd` task that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design.
+> - **Cron runs as `root`, scripts live in app dirs.** The `backup.sh`/`restore.sh` are written into the app's docker-compose `scripts/` dir (or `/opt/k3s_volumes`); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5.
+> - **Gitea restore needs the hook regeneration.** Skipping `gitea admin regenerate hooks` leaves repos with broken push hooks — the `restore.sh` already does it, so use the script rather than a manual untar.
+> - **Postgres and Gitea DB are backed up by *different* jobs.** Gitea dumps with `--skip-db`; its database rows come from the Postgres `pg_dumpall`. Restoring Gitea fully means restoring **both** archives.
+
+---
+
+## Where stage 5 sits
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart LR
+  classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
+  classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
+
+  s04["04 · Tools"]:::done
+  s05["05 · Backup<br/>Postgres · Gitea · K3s PVC"]:::here
+  rec["recover/*<br/>(on disaster)"]:::rec
+
+  s04 --> s05
+  s05 -. "feeds restore" .-> rec
+```
+
+1. **04 · Tools** stood up Vault and CrowdSec — the secret store stage 5's dumps help protect.
+2. **05 · Backup** (this page) is the last linear stage: it schedules the daily dumps.
+3. The artifacts here are the **input** to the on-demand [06 · Recover](06-recover.md) branch — the `.volumes` dump in particular gates whether recovery is fast (CRDs present) or slow (block-device).