Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.5 KiB
vibe > Guidebooks > Factory provisioning > Ansible > 05 · Backup
05 · Backup — daily cron dumps
Note
Status: ✅ active · Last Updated: 2026-06-23 Upstream: Ansible sub-hub · Factory provisioning hub Downstream: 06 · Recover — how these dumps are replayed Related: Storage & recovery · 04 · Tools · ADR-0001 safe prod-like environment
Stage 5 installs three independent cron-driven backup jobs that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point playbooks/05_backup.yml imports playbooks/backup/backup.yml, which chains the three sub-playbooks, each passing backup_root_dir: /mnt/backups.
Every job follows the same anatomy: run a daily cron at 04:00, write a date-stamped archive to /mnt/backups/<kind>/, prune anything older than 3 days, and drop a matching restore.sh next to the backup script. /mnt/backups is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps.
Note
All three sub-playbooks install scripts and cron entries; they do not run a backup themselves (beyond a one-shot
test backup_cmdsmoke check that pipes to/dev/null). The actual backups fire from cron. To read failures, SSH to the host and usesudo su→mails(seebackup/README.md).
The three jobs
| Job | Sub-playbook | Host | Backup command | Artifact | Scripts dir |
|---|---|---|---|---|---|
| Postgres | backup/postgres.yml |
postgres |
docker exec <pg> pg_dumpall -U <user> ∣ gzip |
backup_YYYYMMDD.sql.gz |
…/docker_composes/postgres/scripts |
| Gitea | backup/gitea.yml |
gitea |
docker exec -u git <gitea> gitea dump --skip-log --skip-db --skip-package-data --type tar.gz |
backup_YYYYMMDD.gitea.gz |
…/docker_composes/gitea/scripts |
| K3s PVC | backup/k3s_pvc.yml |
pi1 |
kubectl get pv,pvc + volumes.longhorn.io + settings.longhorn.io (YAML) |
backup_YYYYMMDD.volumes |
/opt/k3s_volumes |
All three share: keep_days: 3, cron minute: 0 hour: 4 user: root, and backup_dir: /mnt/backups/<kind>.
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4;
C["cron · daily 04:00 · user root"]:::cron
PG["postgres.yml<br/>pg_dumpall ∣ gzip"]:::job
GT["gitea.yml<br/>gitea dump tar.gz"]:::job
PV["k3s_pvc.yml<br/>PV · PVC · Longhorn CRDs"]:::job
D["/mnt/backups/{postgres,gitea,k3s_pvc}/<br/>keep 3 days"]:::store
L["Longhorn:<br/>snapshot · replicate · off-site"]:::ship
C --> PG --> D
C --> GT --> D
C --> PV --> D
D --> L
- A single daily 04:00 root cron triggers each job's
backup.sh. - postgres.yml runs
pg_dumpallthroughgzip, gitea.yml streams agitea dumptarball, k3s_pvc.yml serialises the volume metadata. - Each writes a date-stamped archive into
/mnt/backups/<kind>/and prunes files older than 3 days (find … -mtime +3 -delete). - Because
/mnt/backupsis a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron.
Job details
Postgres — postgres.yml
The backup command is built from the Postgres host's docker-compose facts (container_name, POSTGRES_USER). pg_dumpall captures all databases plus globals (roles) in one logical dump, gzipped. The generated restore.sh takes an optional YYYYMMDD argument (defaults to the latest dump), docker cps it into the container, gunzips, and replays with psql -f. If the restore misbehaves, the script reminds you to wipe the data dir before replaying.
Gitea — gitea.yml
The dump runs as the git user with --skip-db (Postgres is backed up separately by the Postgres job) and --skip-package-data, streamed to stdout (-f -) so it never lands on the container's own disk. The restore.sh unpacks the tarball back into /data/gitea (config/data) and /data/git/repositories (repos), fixes git:git ownership, and regenerates hooks (gitea admin regenerate hooks) — without that step the restored repos have stale hook paths.
K3s PVC — k3s_pvc.yml
This job does not back up volume data (Longhorn handles the bytes). It backs up the Kubernetes objects needed to re-bind those volumes: all pv + pvc, the volumes.longhorn.io CRDs, and settings.longhorn.io, concatenated into one .volumes YAML (----separated). It writes the dump to both /mnt/backups/k3s_pvc/ and a copy alongside the script. The restore.sh prefers a fallback dir (/home/pi/arcodange/backups/k3s_pvc) then the primary, picks the latest (or a dated) dump, and kubectl applys it.
Important
Backing up the Longhorn
volumes.longhorn.ioCRDs is what enables fast recovery. With the Volume CRDs in the backup, recovery is a singlekubectl applythat re-associates the surviving on-disk replicas with their PVs (see 06 · Recover →longhorn.yml). Without the Volume CRDs, a Longhorn reinstall assigns new engine IDs, cannot adopt the orphaned replica directories, and you fall through to the slow block-device data recovery (longhorn_data.yml). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the Longhorn PVC recovery ADR. This is the prevention half of the storage failure mode.
Gotchas
Warning
- 3-day retention is tight. A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local
/mnt/backupsfiles are short-lived.- The smoke test runs the real dump. Each play has a
test backup_cmdtask that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design.- Cron runs as
root, scripts live in app dirs. Thebackup.sh/restore.share written into the app's docker-composescripts/dir (or/opt/k3s_volumes); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5.- Gitea restore needs the hook regeneration. Skipping
gitea admin regenerate hooksleaves repos with broken push hooks — therestore.shalready does it, so use the script rather than a manual untar.- Postgres and Gitea DB are backed up by different jobs. Gitea dumps with
--skip-db; its database rows come from the Postgrespg_dumpall. Restoring Gitea fully means restoring both archives.
Where stage 5 sits
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
s04["04 · Tools"]:::done
s05["05 · Backup<br/>Postgres · Gitea · K3s PVC"]:::here
rec["recover/*<br/>(on disaster)"]:::rec
s04 --> s05
s05 -. "feeds restore" .-> rec
- 04 · Tools stood up Vault and CrowdSec — the secret store stage 5's dumps help protect.
- 05 · Backup (this page) is the last linear stage: it schedules the daily dumps.
- The artifacts here are the input to the on-demand 06 · Recover branch — the
.volumesdump in particular gates whether recovery is fast (CRDs present) or slow (block-device).