[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery** # Backup & recovery > **Status:** ✅ Active > **Last Updated:** 2026-06-23 > **Upstream:** [ERP](README.md) · [Deployment](deployment.md) > **Downstream:** [Operations](operations.md) > **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) `erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild. ## Backup mechanism The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents. The pipeline inside each run: 1. **Detect version** — `psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from. 2. **Dump + compress** — `pg_dump -d erp --no-tablespaces --inserts | gzip > `. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses. 3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp__.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`. 4. **Prune** — `find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days. DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly. ### Schedule, retention & artifacts | Property | Value | Source | |---|---|---| | Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) | | Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` | | Successful job history | `successfulJobsHistoryLimit: 3` | `spec` | | Failed job history | `failedJobsHistoryLimit: 3` | `spec` | | Retention | 15 days (`find -mtime +15 -delete`) | dump script | | Dump image | `postgres:16.3` | `jobTemplate` container | | Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script | | Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts | | Archive path | `/documents/admin/backup/pg_dump_erp__.sql.gz` | mount + dump script | | Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` | | Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` | ### Ad-hoc & manual alternatives Two escape hatches exist for an on-demand dump outside the 04:00 schedule: | Tool | What it does | When to reach for it | |---|---|---| | [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API | | [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup | > [!WARNING] > **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server. ## What is — and is NOT — in the dump The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records. | Data | Where it lives | Protected by | |---|---|---| | Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) | | Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups | | Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups | > [!IMPORTANT] > A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered. ## Restore The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first. Ordered procedure: 1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only). 2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes. 3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB. 4. **[AGENT]** Watch the Job to completion and read its logs. 5. **[HUMAN]** Scale the `erp` Deployment back to **1**. 6. **[AGENT]** Validate erp is serving and the data is present. ```bash # [AGENT] read-only: cluster health + list backup archives on the PVC kubectl get deploy,pods -n erp kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz ``` ```bash # [HUMAN] prod-mutating: stop writes before restoring kubectl scale deploy/erp -n erp --replicas=0 kubectl rollout status deploy/erp -n erp --watch=false # expect 0 replicas ``` ```bash # [HUMAN] prod-mutating: run the restore Job # default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC ansible-playbook ansible/arcodange/erp/playbooks/restore.yml # or pin an explicit archive: ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \ -e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz ``` ```bash # [AGENT] read-only: follow the restore Job and its logs kubectl get job/dolibarr-restore -n erp -o wide kubectl logs -n erp job/dolibarr-restore ``` ```bash # [HUMAN] prod-mutating: bring erp back up kubectl scale deploy/erp -n erp --replicas=1 kubectl rollout status deploy/erp -n erp ``` ```bash # [AGENT] read-only: validate erp is serving after restore kubectl get pods -n erp kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/ ``` > [!WARNING] > **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish. ## Recovery ordering (cluster rebuild) > [!CAUTION] > **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed: > > 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online. > 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config. > 3. **Scale erp to 1** — only now does the pod come up with usable creds. > 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above). > > This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook. ## The ownership fix After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**: | Script | Mechanism | Use | |---|---|---| | [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table | | [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY TO erp_role` only when it differs | The idempotent, chart-shipped fix | The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation: ```bash # [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role kubectl create job \ --from=cronjob/pg-fix-table-ownership \ pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \ -n kube-system ``` Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md). ## Flow ```mermaid %%{init: {'theme': 'base'}}%% flowchart LR classDef sched fill:#2563eb,stroke:#1e40af,color:#fff classDef proc fill:#059669,stroke:#047857,color:#fff classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff classDef db fill:#b45309,stroke:#92400e,color:#fff VSO["VSO secret
vso-db-credentials
(postgres/creds/erp)"]:::store CRON["CronJob dolibarr-backup
schedule 0 4 * * *"]:::sched DUMPJOB["pg_dump Job
postgres:16.3"]:::proc GZIP["gzip stream
--inserts --no-tablespaces"]:::proc PVC["Longhorn RWX PVC
/documents/admin/backup
pg_dump_erp_*.sql.gz"]:::store RESTOREJOB["Restore Job dolibarr-restore
postgres:16.3"]:::proc PSQL["psql -f dump.sql"]:::proc DB["erp Postgres DB
via pgbouncer.tools"]:::db CRON -- "spawns" --> DUMPJOB DUMPJOB -- "pg_dump" --> GZIP GZIP -- "writes archive" --> PVC PVC -- "ls -t latest .sql.gz" --> RESTOREJOB RESTOREJOB -- "tar -xzf then" --> PSQL PSQL -- "loads into" --> DB DUMPJOB -- "dumps from" --> DB VSO -. "DB creds" .-> DUMPJOB VSO -. "DB creds" .-> RESTOREJOB ``` 1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`). 2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret. 3. The dump streams through **gzip** and the resulting `pg_dump_erp__.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned. 4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials. 5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first). ## Gotchas > [!WARNING] > - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump. > - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`. > - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1. > - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild. > - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md). ## Cross-references - [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on. - [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks. - [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered. - [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up. - [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs. - [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password.