Files

Gabriel Radureau 7bf83e75ed docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops)

Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most
data-critical app, cross-linked from the applications hub (bidirectional):

- README.md             : Dolibarr 22.0.4 on Postgres; data-criticality; overview
  diagram; the Vault-unseal-before-scale recovery ordering (CAUTION).
- deployment.md         : upstream image + custom entrypoint (MySQL->psql), the
  50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init
  scripts (conf.php creds, table-ownership), ingress, CI.
- backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day
  retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery
  ordering (Longhorn -> Vault unseal -> erp scale-up).
- operations.md         : the read-only bin/arcodange CLI, static/company.json,
  Deno+Playwright tests, day-2 ops.

erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams
MCP-validated; zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-23 22:12:11 +02:00

16 KiB

Raw Blame History

vibe > Guidebooks > ERP > Backup & recovery

Backup & recovery

Status: ✅ Active Last Updated: 2026-06-23 Upstream: ERP · Deployment Downstream: Operations Related: storage concept · factory recover playbooks · tools secrets-and-vso · factory postgres-iac

erp is the lab's single most data-critical application, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the erp PostgreSQL database (captured by a pg_dump) and the uploaded documents on the Longhorn PVC (captured by Longhorn snapshots/backups, not the pg_dump). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.

Backup mechanism

The recurring backup is an Ansible-deployed Kubernetes CronJob named dolibarr-backup in namespace erp, declared by ansible/arcodange/erp/playbooks/recurrentBackup.yml. Each scheduled tick spawns a one-shot postgres:16.3 Job that takes a logical dump of the erp database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.

The pipeline inside each run:

Detect version — psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';" reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from.
Dump + compress — pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>. The --inserts flag emits row-by-row INSERT statements (portable, version-tolerant restores) and --no-tablespaces strips host-specific tablespace clauses.
Write to PVC — the archive lands at /documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz, where the container mounts the erp PVC with subPath: documents/admin/backup.
Prune — find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete removes anything older than 15 days.

DB credentials are supplied by the VSO-materialised vso-db-credentials secret (envFrom + PGPASSWORD from its password key), the same dynamic postgres/creds/erp secret the pod uses — see tools secrets-and-vso. The Job runs backoffLimit: 0 with restartPolicy: Never, so a failed run leaves an inspectable terminated pod rather than retrying blindly.

Schedule, retention & artifacts

Property	Value	Source
Resource	CronJob `dolibarr-backup` (ns `erp`)	recurrentBackup.yml
Schedule	`0 4 * * *` (04:00 daily)	`spec.schedule`
Successful job history	`successfulJobsHistoryLimit: 3`	`spec`
Failed job history	`failedJobsHistoryLimit: 3`	`spec`
Retention	15 days (`find -mtime +15 -delete`)	dump script
Dump image	`postgres:16.3`	`jobTemplate` container
Dump command	`pg_dump --no-tablespaces --inserts` (logical)	dump script
Compression	`gzip` (CronJob) / `tar -czf` (ad-hoc)	dump scripts
Archive path	`/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz`	mount + dump script
Mount	PVC `erp`, `subPath: documents/admin/backup`	`volumeMounts`
Failure policy	`backoffLimit: 0`, `restartPolicy: Never`	`jobTemplate`

Ad-hoc & manual alternatives

Two escape hatches exist for an on-demand dump outside the 04:00 schedule:

Tool	What it does	When to reach for it
`ansible/.../playbooks/backup.yml`	One-shot Ansible Job `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion	A single immediate dump driven from a control host that can reach the cluster API
`backup/create_backup.sh`	Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/`	A laptop with `kubectl` access but no Ansible setup

Warning

pg_dump and the server must match major versions. The lab's Postgres is 16.3, so every dump/restore container pins postgres:16.3. Dolibarr's built-in Tools → Database backup page (/admin/tools/dolibarr_export.php) historically shells out to the image's bundled pg_dump (e.g. 11.x), which aborts with server version mismatch. Use the CronJob, the Ansible playbooks, or create_backup.sh — never the in-app export against a newer server.

What is — and is NOT — in the dump

The pg_dump captures only the relational database. Everything users upload lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.

Data	Where it lives	Protected by
Invoices, third parties, accounting rows, config rows (`llx_*` tables)	`erp` Postgres DB	`pg_dump` archives (this page)
Uploaded documents, generated PDFs, attachments	`/var/www/documents` on the Longhorn RWX PVC	Longhorn snapshots / backups
Custom modules / overrides	`/var/www/html/custom` on the same PVC	Longhorn snapshots / backups

Important

A pg_dump alone does not make erp recoverable. A full recovery needs both the latest pg_dump_erp_*.sql.gz and the Longhorn-restored document volume. The backup archive itself sits on that same PVC (/documents/admin/backup), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the storage concept for how the Longhorn volume is snapshotted and recovered.

Restore

The restore is the Ansible-driven Job dolibarr-restore from ansible/.../playbooks/restore.yml (postgres:16.3). It auto-discovers the most recent pg_dump_erp_*.sql.gz via ls -t ... | head -n1, or you can pin a specific archive with -e backup_file=.... It runs backoffLimit: 0 / restartPolicy: Never so a failed restore leaves a terminated pod you can inspect. Restoring into a live database corrupts state — scale erp to zero first.

Ordered procedure:

[AGENT] Confirm the cluster is healthy and inspect available archives (read-only).
[HUMAN] Scale the erp Deployment to 0 to stop all writes.
[HUMAN] Run the restore Job (latest archive, or a pinned backup_file); it tar -xzfs the archive and psql -fs it into the erp DB.
[AGENT] Watch the Job to completion and read its logs.
[HUMAN] Scale the erp Deployment back to 1.
[AGENT] Validate erp is serving and the data is present.

# [AGENT] read-only: cluster health + list backup archives on the PVC
kubectl get deploy,pods -n erp
kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz

# [HUMAN] prod-mutating: stop writes before restoring
kubectl scale deploy/erp -n erp --replicas=0
kubectl rollout status deploy/erp -n erp --watch=false   # expect 0 replicas

# [HUMAN] prod-mutating: run the restore Job
#   default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
#   or pin an explicit archive:
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
  -e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz

# [AGENT] read-only: follow the restore Job and its logs
kubectl get job/dolibarr-restore -n erp -o wide
kubectl logs -n erp job/dolibarr-restore

# [HUMAN] prod-mutating: bring erp back up
kubectl scale deploy/erp -n erp --replicas=1
kubectl rollout status deploy/erp -n erp

# [AGENT] read-only: validate erp is serving after restore
kubectl get pods -n erp
kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/

Warning

Always scale erp to 0 before restoring. The restore loads SQL straight into the live erp database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.

Recovery ordering (cluster rebuild)

Caution

Vault MUST be unsealed before erp is scaled up. The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising vso-db-credentials from postgres/creds/erp (DOLI_DB_USER / DOLI_DB_PASSWORD). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:

Recover Longhorn volumes — bring the document PVC (and the documents/admin/backup archives riding on it) back online.

Unseal Vault — so VSO can issue erp's dynamic DB credentials and static config.

Scale erp to 1 — only now does the pod come up with usable creds.

(Optional) restore data — if the DB needs rolling back to a pg_dump, scale to 0, run the restore Job, scale back to 1 (see Restore above).

This sequence is the storage→secrets→apps backbone described in the storage concept and executed by the factory recover playbooks; the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.

The ownership fix

After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — public schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable erp_role:

Script	Mechanism	Use
`backup/erp_role_as_table_owner.sql`	Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role`	Force-set ownership table-by-table
`chart/scripts/update_ownership.sql`	Detects the current schema owner and `REASSIGN OWNED BY <owner> TO erp_role` only when it differs	The idempotent, chart-shipped fix

The chart wires update_ownership.sql into a pg-fix-table-ownership CronJob; trigger it on demand after a module activation:

# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
kubectl create job \
  --from=cronjob/pg-fix-table-ownership \
  pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
  -n kube-system

Run this before a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in Operations.

Flow

%%{init: {'theme': 'base'}}%%
flowchart LR
    classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
    classDef proc fill:#059669,stroke:#047857,color:#fff
    classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
    classDef db fill:#b45309,stroke:#92400e,color:#fff

    VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
    CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
    DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
    GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
    PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
    RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
    PSQL["psql -f dump.sql"]:::proc
    DB["erp Postgres DB<br>via pgbouncer.tools"]:::db

    CRON -- "spawns" --> DUMPJOB
    DUMPJOB -- "pg_dump" --> GZIP
    GZIP -- "writes archive" --> PVC
    PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
    RESTOREJOB -- "tar -xzf then" --> PSQL
    PSQL -- "loads into" --> DB
    DUMPJOB -- "dumps from" --> DB
    VSO -. "DB creds" .-> DUMPJOB
    VSO -. "DB creds" .-> RESTOREJOB

The CronJob dolibarr-backup fires at 0 4 * * * and spawns a pg_dump Job (postgres:16.3).
The Job dumps the live erp database (logical, --inserts --no-tablespaces) — reading credentials from the VSO vso-db-credentials secret.
The dump streams through gzip and the resulting pg_dump_erp_<version>_<timestamp>.sql.gz is written to /documents/admin/backup on the Longhorn RWX PVC; archives older than 15 days are pruned.
On restore, the dolibarr-restore Job picks the newest .sql.gz (ls -t | head -n1, or a pinned backup_file) from the PVC — also using the VSO credentials.
The restore Job tar -xzfs the archive and psql -fs it back into the erp database (with erp scaled to 0 first).

Gotchas

Warning

15-day retention only. The CronJob deletes any pg_dump_erp_*.sql.gz older than 15 days. If you need long-term or compliance copies, pull archives off-cluster before they age out — nothing here keeps a month-old dump.

Version match is mandatory. pg_dump/psql major version must equal the server's (16.x). Every Job pins postgres:16.3; the in-app Dolibarr export against the newer server aborts with server version mismatch.

Scale to 0 before restore. Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.

Vault unseal precedes scale-up. erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the recovery ordering on any rebuild.

The admin/Postgres password lives in OpenTofu state. The per-app database and role are declared in IaC, so the authoritative credential material is held in the TF state — treat that state as a secret and recover it alongside Vault. See factory postgres-iac.

Cross-references

Deployment — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
Operations — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
storage concept — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
factory recover playbooks — the Ansible recovery steps that must run before erp is scaled back up.
tools secrets-and-vso — the VSO runtime that materialises vso-db-credentials, feeding both the backup and restore Jobs.
factory postgres-iac — the per-app erp PostgreSQL database + role, and the TF state that holds its admin password.

16 KiB Raw Blame History