Files
factory/vibe/guidebooks/factory-provisioning/ansible/05-backup.md
Gabriel Radureau dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00

8.5 KiB
Raw Permalink Blame History

vibe > Guidebooks > Factory provisioning > Ansible > 05 · Backup

05 · Backup — daily cron dumps

Note

Status: active · Last Updated: 2026-06-23 Upstream: Ansible sub-hub · Factory provisioning hub Downstream: 06 · Recover — how these dumps are replayed Related: Storage & recovery · 04 · Tools · ADR-0001 safe prod-like environment

Stage 5 installs three independent cron-driven backup jobs that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point playbooks/05_backup.yml imports playbooks/backup/backup.yml, which chains the three sub-playbooks, each passing backup_root_dir: /mnt/backups.

Every job follows the same anatomy: run a daily cron at 04:00, write a date-stamped archive to /mnt/backups/<kind>/, prune anything older than 3 days, and drop a matching restore.sh next to the backup script. /mnt/backups is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps.

Note

All three sub-playbooks install scripts and cron entries; they do not run a backup themselves (beyond a one-shot test backup_cmd smoke check that pipes to /dev/null). The actual backups fire from cron. To read failures, SSH to the host and use sudo sumails (see backup/README.md).


The three jobs

Job Sub-playbook Host Backup command Artifact Scripts dir
Postgres backup/postgres.yml postgres docker exec <pg> pg_dumpall -U <user> gzip backup_YYYYMMDD.sql.gz …/docker_composes/postgres/scripts
Gitea backup/gitea.yml gitea docker exec -u git <gitea> gitea dump --skip-log --skip-db --skip-package-data --type tar.gz backup_YYYYMMDD.gitea.gz …/docker_composes/gitea/scripts
K3s PVC backup/k3s_pvc.yml pi1 kubectl get pv,pvc + volumes.longhorn.io + settings.longhorn.io (YAML) backup_YYYYMMDD.volumes /opt/k3s_volumes

All three share: keep_days: 3, cron minute: 0 hour: 4 user: root, and backup_dir: /mnt/backups/<kind>.

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
  classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
  classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
  classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
  classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4;

  C["cron · daily 04:00 · user root"]:::cron
  PG["postgres.yml<br/>pg_dumpall  gzip"]:::job
  GT["gitea.yml<br/>gitea dump tar.gz"]:::job
  PV["k3s_pvc.yml<br/>PV · PVC · Longhorn CRDs"]:::job
  D["/mnt/backups/{postgres,gitea,k3s_pvc}/<br/>keep 3 days"]:::store
  L["Longhorn:<br/>snapshot · replicate · off-site"]:::ship

  C --> PG --> D
  C --> GT --> D
  C --> PV --> D
  D --> L
  1. A single daily 04:00 root cron triggers each job's backup.sh.
  2. postgres.yml runs pg_dumpall through gzip, gitea.yml streams a gitea dump tarball, k3s_pvc.yml serialises the volume metadata.
  3. Each writes a date-stamped archive into /mnt/backups/<kind>/ and prunes files older than 3 days (find … -mtime +3 -delete).
  4. Because /mnt/backups is a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron.

Job details

Postgres — postgres.yml

The backup command is built from the Postgres host's docker-compose facts (container_name, POSTGRES_USER). pg_dumpall captures all databases plus globals (roles) in one logical dump, gzipped. The generated restore.sh takes an optional YYYYMMDD argument (defaults to the latest dump), docker cps it into the container, gunzips, and replays with psql -f. If the restore misbehaves, the script reminds you to wipe the data dir before replaying.

Gitea — gitea.yml

The dump runs as the git user with --skip-db (Postgres is backed up separately by the Postgres job) and --skip-package-data, streamed to stdout (-f -) so it never lands on the container's own disk. The restore.sh unpacks the tarball back into /data/gitea (config/data) and /data/git/repositories (repos), fixes git:git ownership, and regenerates hooks (gitea admin regenerate hooks) — without that step the restored repos have stale hook paths.

K3s PVC — k3s_pvc.yml

This job does not back up volume data (Longhorn handles the bytes). It backs up the Kubernetes objects needed to re-bind those volumes: all pv + pvc, the volumes.longhorn.io CRDs, and settings.longhorn.io, concatenated into one .volumes YAML (----separated). It writes the dump to both /mnt/backups/k3s_pvc/ and a copy alongside the script. The restore.sh prefers a fallback dir (/home/pi/arcodange/backups/k3s_pvc) then the primary, picks the latest (or a dated) dump, and kubectl applys it.

Important

Backing up the Longhorn volumes.longhorn.io CRDs is what enables fast recovery. With the Volume CRDs in the backup, recovery is a single kubectl apply that re-associates the surviving on-disk replicas with their PVs (see 06 · Recover → longhorn.yml). Without the Volume CRDs, a Longhorn reinstall assigns new engine IDs, cannot adopt the orphaned replica directories, and you fall through to the slow block-device data recovery (longhorn_data.yml). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the Longhorn PVC recovery ADR. This is the prevention half of the storage failure mode.


Gotchas

Warning

  • 3-day retention is tight. A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local /mnt/backups files are short-lived.
  • The smoke test runs the real dump. Each play has a test backup_cmd task that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design.
  • Cron runs as root, scripts live in app dirs. The backup.sh/restore.sh are written into the app's docker-compose scripts/ dir (or /opt/k3s_volumes); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5.
  • Gitea restore needs the hook regeneration. Skipping gitea admin regenerate hooks leaves repos with broken push hooks — the restore.sh already does it, so use the script rather than a manual untar.
  • Postgres and Gitea DB are backed up by different jobs. Gitea dumps with --skip-db; its database rows come from the Postgres pg_dumpall. Restoring Gitea fully means restoring both archives.

Where stage 5 sits

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
  classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
  classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
  classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;

  s04["04 · Tools"]:::done
  s05["05 · Backup<br/>Postgres · Gitea · K3s PVC"]:::here
  rec["recover/*<br/>(on disaster)"]:::rec

  s04 --> s05
  s05 -. "feeds restore" .-> rec
  1. 04 · Tools stood up Vault and CrowdSec — the secret store stage 5's dumps help protect.
  2. 05 · Backup (this page) is the last linear stage: it schedules the daily dumps.
  3. The artifacts here are the input to the on-demand 06 · Recover branch — the .volumes dump in particular gates whether recovery is fast (CRDs present) or slow (block-device).