docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops)

Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most
data-critical app, cross-linked from the applications hub (bidirectional):

- README.md             : Dolibarr 22.0.4 on Postgres; data-criticality; overview
  diagram; the Vault-unseal-before-scale recovery ordering (CAUTION).
- deployment.md         : upstream image + custom entrypoint (MySQL->psql), the
  50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init
  scripts (conf.php creds, table-ownership), ingress, CI.
- backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day
  retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery
  ordering (Longhorn -> Vault unseal -> erp scale-up).
- operations.md         : the read-only bin/arcodange CLI, static/company.json,
  Deno+Playwright tests, day-2 ops.

erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams
MCP-validated; zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 22:12:11 +02:00
parent 4823394e0e
commit 7bf83e75ed
6 changed files with 659 additions and 2 deletions

View File

@@ -0,0 +1,207 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery**
# Backup & recovery
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [ERP](README.md) · [Deployment](deployment.md)
> **Downstream:** [Operations](operations.md)
> **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)
`erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.
## Backup mechanism
The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.
The pipeline inside each run:
1. **Detect version**`psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from.
2. **Dump + compress**`pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>`. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses.
3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`.
4. **Prune**`find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days.
DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly.
### Schedule, retention & artifacts
| Property | Value | Source |
|---|---|---|
| Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) |
| Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` |
| Successful job history | `successfulJobsHistoryLimit: 3` | `spec` |
| Failed job history | `failedJobsHistoryLimit: 3` | `spec` |
| Retention | 15 days (`find -mtime +15 -delete`) | dump script |
| Dump image | `postgres:16.3` | `jobTemplate` container |
| Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script |
| Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts |
| Archive path | `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz` | mount + dump script |
| Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` |
| Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` |
### Ad-hoc & manual alternatives
Two escape hatches exist for an on-demand dump outside the 04:00 schedule:
| Tool | What it does | When to reach for it |
|---|---|---|
| [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API |
| [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup |
> [!WARNING]
> **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server.
## What is — and is NOT — in the dump
The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.
| Data | Where it lives | Protected by |
|---|---|---|
| Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) |
| Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups |
| Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups |
> [!IMPORTANT]
> A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered.
## Restore
The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first.
Ordered procedure:
1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only).
2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes.
3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB.
4. **[AGENT]** Watch the Job to completion and read its logs.
5. **[HUMAN]** Scale the `erp` Deployment back to **1**.
6. **[AGENT]** Validate erp is serving and the data is present.
```bash
# [AGENT] read-only: cluster health + list backup archives on the PVC
kubectl get deploy,pods -n erp
kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz
```
```bash
# [HUMAN] prod-mutating: stop writes before restoring
kubectl scale deploy/erp -n erp --replicas=0
kubectl rollout status deploy/erp -n erp --watch=false # expect 0 replicas
```
```bash
# [HUMAN] prod-mutating: run the restore Job
# default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
# or pin an explicit archive:
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
-e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz
```
```bash
# [AGENT] read-only: follow the restore Job and its logs
kubectl get job/dolibarr-restore -n erp -o wide
kubectl logs -n erp job/dolibarr-restore
```
```bash
# [HUMAN] prod-mutating: bring erp back up
kubectl scale deploy/erp -n erp --replicas=1
kubectl rollout status deploy/erp -n erp
```
```bash
# [AGENT] read-only: validate erp is serving after restore
kubectl get pods -n erp
kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/
```
> [!WARNING]
> **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.
## Recovery ordering (cluster rebuild)
> [!CAUTION]
> **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:
>
> 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online.
> 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config.
> 3. **Scale erp to 1** — only now does the pod come up with usable creds.
> 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above).
>
> This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.
## The ownership fix
After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**:
| Script | Mechanism | Use |
|---|---|---|
| [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table |
| [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY <owner> TO erp_role` only when it differs | The idempotent, chart-shipped fix |
The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation:
```bash
# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
kubectl create job \
--from=cronjob/pg-fix-table-ownership \
pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
-n kube-system
```
Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md).
## Flow
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef db fill:#b45309,stroke:#92400e,color:#fff
VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
PSQL["psql -f dump.sql"]:::proc
DB["erp Postgres DB<br>via pgbouncer.tools"]:::db
CRON -- "spawns" --> DUMPJOB
DUMPJOB -- "pg_dump" --> GZIP
GZIP -- "writes archive" --> PVC
PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
RESTOREJOB -- "tar -xzf then" --> PSQL
PSQL -- "loads into" --> DB
DUMPJOB -- "dumps from" --> DB
VSO -. "DB creds" .-> DUMPJOB
VSO -. "DB creds" .-> RESTOREJOB
```
1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`).
2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret.
3. The dump streams through **gzip** and the resulting `pg_dump_erp_<version>_<timestamp>.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned.
4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials.
5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first).
## Gotchas
> [!WARNING]
> - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump.
> - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`.
> - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.
> - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild.
> - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md).
## Cross-references
- [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
- [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up.
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password.