docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops)
Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most data-critical app, cross-linked from the applications hub (bidirectional): - README.md : Dolibarr 22.0.4 on Postgres; data-criticality; overview diagram; the Vault-unseal-before-scale recovery ordering (CAUTION). - deployment.md : upstream image + custom entrypoint (MySQL->psql), the 50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init scripts (conf.php creds, table-ownership), ingress, CI. - backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery ordering (Longhorn -> Vault unseal -> erp scale-up). - operations.md : the read-only bin/arcodange CLI, static/company.json, Deno+Playwright tests, day-2 ops. erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams MCP-validated; zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
207
vibe/guidebooks/erp/backup-and-recovery.md
Normal file
207
vibe/guidebooks/erp/backup-and-recovery.md
Normal file
@@ -0,0 +1,207 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery**
|
||||
|
||||
# Backup & recovery
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [ERP](README.md) · [Deployment](deployment.md)
|
||||
> **Downstream:** [Operations](operations.md)
|
||||
> **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)
|
||||
|
||||
`erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.
|
||||
|
||||
## Backup mechanism
|
||||
|
||||
The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.
|
||||
|
||||
The pipeline inside each run:
|
||||
|
||||
1. **Detect version** — `psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from.
|
||||
2. **Dump + compress** — `pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>`. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses.
|
||||
3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`.
|
||||
4. **Prune** — `find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days.
|
||||
|
||||
DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly.
|
||||
|
||||
### Schedule, retention & artifacts
|
||||
|
||||
| Property | Value | Source |
|
||||
|---|---|---|
|
||||
| Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) |
|
||||
| Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` |
|
||||
| Successful job history | `successfulJobsHistoryLimit: 3` | `spec` |
|
||||
| Failed job history | `failedJobsHistoryLimit: 3` | `spec` |
|
||||
| Retention | 15 days (`find -mtime +15 -delete`) | dump script |
|
||||
| Dump image | `postgres:16.3` | `jobTemplate` container |
|
||||
| Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script |
|
||||
| Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts |
|
||||
| Archive path | `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz` | mount + dump script |
|
||||
| Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` |
|
||||
| Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` |
|
||||
|
||||
### Ad-hoc & manual alternatives
|
||||
|
||||
Two escape hatches exist for an on-demand dump outside the 04:00 schedule:
|
||||
|
||||
| Tool | What it does | When to reach for it |
|
||||
|---|---|---|
|
||||
| [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API |
|
||||
| [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup |
|
||||
|
||||
> [!WARNING]
|
||||
> **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server.
|
||||
|
||||
## What is — and is NOT — in the dump
|
||||
|
||||
The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.
|
||||
|
||||
| Data | Where it lives | Protected by |
|
||||
|---|---|---|
|
||||
| Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) |
|
||||
| Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups |
|
||||
| Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered.
|
||||
|
||||
## Restore
|
||||
|
||||
The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first.
|
||||
|
||||
Ordered procedure:
|
||||
|
||||
1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only).
|
||||
2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes.
|
||||
3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB.
|
||||
4. **[AGENT]** Watch the Job to completion and read its logs.
|
||||
5. **[HUMAN]** Scale the `erp` Deployment back to **1**.
|
||||
6. **[AGENT]** Validate erp is serving and the data is present.
|
||||
|
||||
```bash
|
||||
# [AGENT] read-only: cluster health + list backup archives on the PVC
|
||||
kubectl get deploy,pods -n erp
|
||||
kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz
|
||||
```
|
||||
|
||||
```bash
|
||||
# [HUMAN] prod-mutating: stop writes before restoring
|
||||
kubectl scale deploy/erp -n erp --replicas=0
|
||||
kubectl rollout status deploy/erp -n erp --watch=false # expect 0 replicas
|
||||
```
|
||||
|
||||
```bash
|
||||
# [HUMAN] prod-mutating: run the restore Job
|
||||
# default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
|
||||
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
|
||||
# or pin an explicit archive:
|
||||
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
|
||||
-e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz
|
||||
```
|
||||
|
||||
```bash
|
||||
# [AGENT] read-only: follow the restore Job and its logs
|
||||
kubectl get job/dolibarr-restore -n erp -o wide
|
||||
kubectl logs -n erp job/dolibarr-restore
|
||||
```
|
||||
|
||||
```bash
|
||||
# [HUMAN] prod-mutating: bring erp back up
|
||||
kubectl scale deploy/erp -n erp --replicas=1
|
||||
kubectl rollout status deploy/erp -n erp
|
||||
```
|
||||
|
||||
```bash
|
||||
# [AGENT] read-only: validate erp is serving after restore
|
||||
kubectl get pods -n erp
|
||||
kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/
|
||||
```
|
||||
|
||||
> [!WARNING]
|
||||
> **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.
|
||||
|
||||
## Recovery ordering (cluster rebuild)
|
||||
|
||||
> [!CAUTION]
|
||||
> **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:
|
||||
>
|
||||
> 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online.
|
||||
> 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config.
|
||||
> 3. **Scale erp to 1** — only now does the pod come up with usable creds.
|
||||
> 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above).
|
||||
>
|
||||
> This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.
|
||||
|
||||
## The ownership fix
|
||||
|
||||
After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**:
|
||||
|
||||
| Script | Mechanism | Use |
|
||||
|---|---|---|
|
||||
| [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table |
|
||||
| [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY <owner> TO erp_role` only when it differs | The idempotent, chart-shipped fix |
|
||||
|
||||
The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation:
|
||||
|
||||
```bash
|
||||
# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
|
||||
kubectl create job \
|
||||
--from=cronjob/pg-fix-table-ownership \
|
||||
pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
|
||||
-n kube-system
|
||||
```
|
||||
|
||||
Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md).
|
||||
|
||||
## Flow
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart LR
|
||||
classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
classDef db fill:#b45309,stroke:#92400e,color:#fff
|
||||
|
||||
VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
|
||||
CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
|
||||
DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
|
||||
GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
|
||||
PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
|
||||
RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
|
||||
PSQL["psql -f dump.sql"]:::proc
|
||||
DB["erp Postgres DB<br>via pgbouncer.tools"]:::db
|
||||
|
||||
CRON -- "spawns" --> DUMPJOB
|
||||
DUMPJOB -- "pg_dump" --> GZIP
|
||||
GZIP -- "writes archive" --> PVC
|
||||
PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
|
||||
RESTOREJOB -- "tar -xzf then" --> PSQL
|
||||
PSQL -- "loads into" --> DB
|
||||
DUMPJOB -- "dumps from" --> DB
|
||||
VSO -. "DB creds" .-> DUMPJOB
|
||||
VSO -. "DB creds" .-> RESTOREJOB
|
||||
```
|
||||
|
||||
1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`).
|
||||
2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret.
|
||||
3. The dump streams through **gzip** and the resulting `pg_dump_erp_<version>_<timestamp>.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned.
|
||||
4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials.
|
||||
5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first).
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump.
|
||||
> - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`.
|
||||
> - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.
|
||||
> - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild.
|
||||
> - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md).
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
|
||||
- [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
|
||||
- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
|
||||
- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up.
|
||||
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs.
|
||||
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password.
|
||||
Reference in New Issue
Block a user