Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most data-critical app, cross-linked from the applications hub (bidirectional): - README.md : Dolibarr 22.0.4 on Postgres; data-criticality; overview diagram; the Vault-unseal-before-scale recovery ordering (CAUTION). - deployment.md : upstream image + custom entrypoint (MySQL->psql), the 50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init scripts (conf.php creds, table-ownership), ingress, CI. - backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery ordering (Longhorn -> Vault unseal -> erp scale-up). - operations.md : the read-only bin/arcodange CLI, static/company.json, Deno+Playwright tests, day-2 ops. erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams MCP-validated; zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
16 KiB
vibe > Guidebooks > ERP > Backup & recovery
Backup & recovery
Status: ✅ Active Last Updated: 2026-06-23 Upstream: ERP · Deployment Downstream: Operations Related: storage concept · factory recover playbooks · tools secrets-and-vso · factory postgres-iac
erp is the lab's single most data-critical application, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the erp PostgreSQL database (captured by a pg_dump) and the uploaded documents on the Longhorn PVC (captured by Longhorn snapshots/backups, not the pg_dump). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.
Backup mechanism
The recurring backup is an Ansible-deployed Kubernetes CronJob named dolibarr-backup in namespace erp, declared by ansible/arcodange/erp/playbooks/recurrentBackup.yml. Each scheduled tick spawns a one-shot postgres:16.3 Job that takes a logical dump of the erp database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.
The pipeline inside each run:
- Detect version —
psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from. - Dump + compress —
pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>. The--insertsflag emits row-by-rowINSERTstatements (portable, version-tolerant restores) and--no-tablespacesstrips host-specific tablespace clauses. - Write to PVC — the archive lands at
/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz, where the container mounts theerpPVC withsubPath: documents/admin/backup. - Prune —
find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -deleteremoves anything older than 15 days.
DB credentials are supplied by the VSO-materialised vso-db-credentials secret (envFrom + PGPASSWORD from its password key), the same dynamic postgres/creds/erp secret the pod uses — see tools secrets-and-vso. The Job runs backoffLimit: 0 with restartPolicy: Never, so a failed run leaves an inspectable terminated pod rather than retrying blindly.
Schedule, retention & artifacts
| Property | Value | Source |
|---|---|---|
| Resource | CronJob dolibarr-backup (ns erp) |
recurrentBackup.yml |
| Schedule | 0 4 * * * (04:00 daily) |
spec.schedule |
| Successful job history | successfulJobsHistoryLimit: 3 |
spec |
| Failed job history | failedJobsHistoryLimit: 3 |
spec |
| Retention | 15 days (find -mtime +15 -delete) |
dump script |
| Dump image | postgres:16.3 |
jobTemplate container |
| Dump command | pg_dump --no-tablespaces --inserts (logical) |
dump script |
| Compression | gzip (CronJob) / tar -czf (ad-hoc) |
dump scripts |
| Archive path | /documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz |
mount + dump script |
| Mount | PVC erp, subPath: documents/admin/backup |
volumeMounts |
| Failure policy | backoffLimit: 0, restartPolicy: Never |
jobTemplate |
Ad-hoc & manual alternatives
Two escape hatches exist for an on-demand dump outside the 04:00 schedule:
| Tool | What it does | When to reach for it |
|---|---|---|
ansible/.../playbooks/backup.yml |
One-shot Ansible Job dolibarr-backup (postgres:16.3); fetches the ERP version by scraping https://erp.arcodange.lab/, dumps with the same --no-tablespaces --inserts flags, tar -czf into the PVC, and waits for completion |
A single immediate dump driven from a control host that can reach the cluster API |
backup/create_backup.sh |
Pure kubectl shell: kubectl run pg-dump-temp (postgres:16.3) to pg_dump -c locally, then kubectl cp the archive into the running erp pod's /var/www/documents/admin/backup/ |
A laptop with kubectl access but no Ansible setup |
Warning
pg_dumpand the server must match major versions. The lab's Postgres is 16.3, so every dump/restore container pinspostgres:16.3. Dolibarr's built-in Tools → Database backup page (/admin/tools/dolibarr_export.php) historically shells out to the image's bundledpg_dump(e.g. 11.x), which aborts withserver version mismatch. Use the CronJob, the Ansible playbooks, orcreate_backup.sh— never the in-app export against a newer server.
What is — and is NOT — in the dump
The pg_dump captures only the relational database. Everything users upload lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.
| Data | Where it lives | Protected by |
|---|---|---|
Invoices, third parties, accounting rows, config rows (llx_* tables) |
erp Postgres DB |
pg_dump archives (this page) |
| Uploaded documents, generated PDFs, attachments | /var/www/documents on the Longhorn RWX PVC |
Longhorn snapshots / backups |
| Custom modules / overrides | /var/www/html/custom on the same PVC |
Longhorn snapshots / backups |
Important
A
pg_dumpalone does not make erp recoverable. A full recovery needs both the latestpg_dump_erp_*.sql.gzand the Longhorn-restored document volume. The backup archive itself sits on that same PVC (/documents/admin/backup), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the storage concept for how the Longhorn volume is snapshotted and recovered.
Restore
The restore is the Ansible-driven Job dolibarr-restore from ansible/.../playbooks/restore.yml (postgres:16.3). It auto-discovers the most recent pg_dump_erp_*.sql.gz via ls -t ... | head -n1, or you can pin a specific archive with -e backup_file=.... It runs backoffLimit: 0 / restartPolicy: Never so a failed restore leaves a terminated pod you can inspect. Restoring into a live database corrupts state — scale erp to zero first.
Ordered procedure:
- [AGENT] Confirm the cluster is healthy and inspect available archives (read-only).
- [HUMAN] Scale the
erpDeployment to 0 to stop all writes. - [HUMAN] Run the restore Job (latest archive, or a pinned
backup_file); ittar -xzfs the archive andpsql -fs it into theerpDB. - [AGENT] Watch the Job to completion and read its logs.
- [HUMAN] Scale the
erpDeployment back to 1. - [AGENT] Validate erp is serving and the data is present.
# [AGENT] read-only: cluster health + list backup archives on the PVC
kubectl get deploy,pods -n erp
kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz
# [HUMAN] prod-mutating: stop writes before restoring
kubectl scale deploy/erp -n erp --replicas=0
kubectl rollout status deploy/erp -n erp --watch=false # expect 0 replicas
# [HUMAN] prod-mutating: run the restore Job
# default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
# or pin an explicit archive:
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
-e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz
# [AGENT] read-only: follow the restore Job and its logs
kubectl get job/dolibarr-restore -n erp -o wide
kubectl logs -n erp job/dolibarr-restore
# [HUMAN] prod-mutating: bring erp back up
kubectl scale deploy/erp -n erp --replicas=1
kubectl rollout status deploy/erp -n erp
# [AGENT] read-only: validate erp is serving after restore
kubectl get pods -n erp
kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/
Warning
Always scale erp to 0 before restoring. The restore loads SQL straight into the live
erpdatabase; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.
Recovery ordering (cluster rebuild)
Caution
Vault MUST be unsealed before erp is scaled up. The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising
vso-db-credentialsfrompostgres/creds/erp(DOLI_DB_USER/DOLI_DB_PASSWORD). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:
- Recover Longhorn volumes — bring the document PVC (and the
documents/admin/backuparchives riding on it) back online.- Unseal Vault — so VSO can issue erp's dynamic DB credentials and static config.
- Scale erp to 1 — only now does the pod come up with usable creds.
- (Optional) restore data — if the DB needs rolling back to a
pg_dump, scale to 0, run the restore Job, scale back to 1 (see Restore above).This sequence is the storage→secrets→apps backbone described in the storage concept and executed by the factory recover playbooks; the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.
The ownership fix
After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — public schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable erp_role:
| Script | Mechanism | Use |
|---|---|---|
backup/erp_role_as_table_owner.sql |
Loops every public table and ALTER TABLE ... OWNER TO erp_role |
Force-set ownership table-by-table |
chart/scripts/update_ownership.sql |
Detects the current schema owner and REASSIGN OWNED BY <owner> TO erp_role only when it differs |
The idempotent, chart-shipped fix |
The chart wires update_ownership.sql into a pg-fix-table-ownership CronJob; trigger it on demand after a module activation:
# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
kubectl create job \
--from=cronjob/pg-fix-table-ownership \
pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
-n kube-system
Run this before a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in Operations.
Flow
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef db fill:#b45309,stroke:#92400e,color:#fff
VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
PSQL["psql -f dump.sql"]:::proc
DB["erp Postgres DB<br>via pgbouncer.tools"]:::db
CRON -- "spawns" --> DUMPJOB
DUMPJOB -- "pg_dump" --> GZIP
GZIP -- "writes archive" --> PVC
PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
RESTOREJOB -- "tar -xzf then" --> PSQL
PSQL -- "loads into" --> DB
DUMPJOB -- "dumps from" --> DB
VSO -. "DB creds" .-> DUMPJOB
VSO -. "DB creds" .-> RESTOREJOB
- The CronJob
dolibarr-backupfires at0 4 * * *and spawns apg_dumpJob (postgres:16.3). - The Job dumps the live
erpdatabase (logical,--inserts --no-tablespaces) — reading credentials from the VSOvso-db-credentialssecret. - The dump streams through gzip and the resulting
pg_dump_erp_<version>_<timestamp>.sql.gzis written to/documents/admin/backupon the Longhorn RWX PVC; archives older than 15 days are pruned. - On restore, the
dolibarr-restoreJob picks the newest.sql.gz(ls -t | head -n1, or a pinnedbackup_file) from the PVC — also using the VSO credentials. - The restore Job
tar -xzfs the archive andpsql -fs it back into theerpdatabase (with erp scaled to 0 first).
Gotchas
Warning
- 15-day retention only. The CronJob deletes any
pg_dump_erp_*.sql.gzolder than 15 days. If you need long-term or compliance copies, pull archives off-cluster before they age out — nothing here keeps a month-old dump.- Version match is mandatory.
pg_dump/psqlmajor version must equal the server's (16.x). Every Job pinspostgres:16.3; the in-app Dolibarr export against the newer server aborts withserver version mismatch.- Scale to 0 before restore. Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.
- Vault unseal precedes scale-up. erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the recovery ordering on any rebuild.
- The admin/Postgres password lives in OpenTofu state. The per-app database and role are declared in IaC, so the authoritative credential material is held in the TF state — treat that state as a secret and recover it alongside Vault. See factory postgres-iac.
Cross-references
- Deployment — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
- Operations — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
- storage concept — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
- factory recover playbooks — the Ansible recovery steps that must run before erp is scaled back up.
- tools secrets-and-vso — the VSO runtime that materialises
vso-db-credentials, feeding both the backup and restore Jobs. - factory postgres-iac — the per-app
erpPostgreSQL database + role, and the TF state that holds its admin password.