[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery**

# Backup & recovery

> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [ERP](README.md) · [Deployment](deployment.md)
> **Downstream:** [Operations](operations.md)
> **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)

`erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.

## Backup mechanism

The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.

The pipeline inside each run:

1. **Detect version** — `psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from.
2. **Dump + compress** — `pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>`. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses.
3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`.
4. **Prune** — `find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days.

DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly.

### Schedule, retention & artifacts

| Property | Value | Source |
|---|---|---|
| Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) |
| Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` |
| Successful job history | `successfulJobsHistoryLimit: 3` | `spec` |
| Failed job history | `failedJobsHistoryLimit: 3` | `spec` |
| Retention | 15 days (`find -mtime +15 -delete`) | dump script |
| Dump image | `postgres:16.3` | `jobTemplate` container |
| Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script |
| Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts |
| Archive path | `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz` | mount + dump script |
| Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` |
| Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` |

### Ad-hoc & manual alternatives

Two escape hatches exist for an on-demand dump outside the 04:00 schedule:

| Tool | What it does | When to reach for it |
|---|---|---|
| [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API |
| [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup |

> [!WARNING]
> **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server.

## What is — and is NOT — in the dump

The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.

| Data | Where it lives | Protected by |
|---|---|---|
| Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) |
| Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups |
| Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups |

> [!IMPORTANT]
> A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered.

## Restore

The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first.

Ordered procedure:

1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only).
2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes.
3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB.
4. **[AGENT]** Watch the Job to completion and read its logs.
5. **[HUMAN]** Scale the `erp` Deployment back to **1**.
6. **[AGENT]** Validate erp is serving and the data is present.

```bash
# [AGENT] read-only: cluster health + list backup archives on the PVC
kubectl get deploy,pods -n erp
kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz
```

```bash
# [HUMAN] prod-mutating: stop writes before restoring
kubectl scale deploy/erp -n erp --replicas=0
kubectl rollout status deploy/erp -n erp --watch=false   # expect 0 replicas
```

```bash
# [HUMAN] prod-mutating: run the restore Job
#   default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
#   or pin an explicit archive:
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
  -e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz
```

```bash
# [AGENT] read-only: follow the restore Job and its logs
kubectl get job/dolibarr-restore -n erp -o wide
kubectl logs -n erp job/dolibarr-restore
```

```bash
# [HUMAN] prod-mutating: bring erp back up
kubectl scale deploy/erp -n erp --replicas=1
kubectl rollout status deploy/erp -n erp
```

```bash
# [AGENT] read-only: validate erp is serving after restore
kubectl get pods -n erp
kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/
```

> [!WARNING]
> **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.

## Recovery ordering (cluster rebuild)

> [!CAUTION]
> **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:
>
> 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online.
> 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config.
> 3. **Scale erp to 1** — only now does the pod come up with usable creds.
> 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above).
>
> This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.

## The ownership fix

After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**:

| Script | Mechanism | Use |
|---|---|---|
| [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table |
| [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY <owner> TO erp_role` only when it differs | The idempotent, chart-shipped fix |

The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation:

```bash
# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
kubectl create job \
  --from=cronjob/pg-fix-table-ownership \
  pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
  -n kube-system
```

Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md).

## Flow

```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
    classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
    classDef proc fill:#059669,stroke:#047857,color:#fff
    classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
    classDef db fill:#b45309,stroke:#92400e,color:#fff

    VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
    CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
    DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
    GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
    PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
    RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
    PSQL["psql -f dump.sql"]:::proc
    DB["erp Postgres DB<br>via pgbouncer.tools"]:::db

    CRON -- "spawns" --> DUMPJOB
    DUMPJOB -- "pg_dump" --> GZIP
    GZIP -- "writes archive" --> PVC
    PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
    RESTOREJOB -- "tar -xzf then" --> PSQL
    PSQL -- "loads into" --> DB
    DUMPJOB -- "dumps from" --> DB
    VSO -. "DB creds" .-> DUMPJOB
    VSO -. "DB creds" .-> RESTOREJOB
```

1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`).
2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret.
3. The dump streams through **gzip** and the resulting `pg_dump_erp_<version>_<timestamp>.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned.
4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials.
5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first).

## Gotchas

> [!WARNING]
> - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump.
> - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`.
> - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.
> - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild.
> - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md).

## Cross-references

- [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
- [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up.
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password.