docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops)

Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most data-critical app, cross-linked from the applications hub (bidirectional): - README.md : Dolibarr 22.0.4 on Postgres; data-criticality; overview diagram; the Vault-unseal-before-scale recovery ordering (CAUTION). - deployment.md : upstream image + custom entrypoint (MySQL->psql), the 50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init scripts (conf.php creds, table-ownership), ingress, CI. - backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery ordering (Longhorn -> Vault unseal -> erp scale-up). - operations.md : the read-only bin/arcodange CLI, static/company.json, Deno+Playwright tests, day-2 ops. erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams MCP-validated; zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 22:12:11 +02:00
parent 4823394e0e
commit 7bf83e75ed
6 changed files with 659 additions and 2 deletions
--- a/vibe/guidebooks/erp/backup-and-recovery.md
+++ b/vibe/guidebooks/erp/backup-and-recovery.md
@@ -0,0 +1,207 @@
+[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery**
+
+# Backup & recovery
+
+> **Status:** ✅ Active
+> **Last Updated:** 2026-06-23
+> **Upstream:** [ERP](README.md) · [Deployment](deployment.md)
+> **Downstream:** [Operations](operations.md)
+> **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)
+
+`erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.
+
+## Backup mechanism
+
+The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.
+
+The pipeline inside each run:
+
+1. **Detect version** — `psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from.
+2. **Dump + compress** — `pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>`. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses.
+3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`.
+4. **Prune** — `find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days.
+
+DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly.
+
+### Schedule, retention & artifacts
+
+| Property | Value | Source |
+|---|---|---|
+| Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) |
+| Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` |
+| Successful job history | `successfulJobsHistoryLimit: 3` | `spec` |
+| Failed job history | `failedJobsHistoryLimit: 3` | `spec` |
+| Retention | 15 days (`find -mtime +15 -delete`) | dump script |
+| Dump image | `postgres:16.3` | `jobTemplate` container |
+| Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script |
+| Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts |
+| Archive path | `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz` | mount + dump script |
+| Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` |
+| Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` |
+
+### Ad-hoc & manual alternatives
+
+Two escape hatches exist for an on-demand dump outside the 04:00 schedule:
+
+| Tool | What it does | When to reach for it |
+|---|---|---|
+| [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API |
+| [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup |
+
+> [!WARNING]
+> **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server.
+
+## What is — and is NOT — in the dump
+
+The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.
+
+| Data | Where it lives | Protected by |
+|---|---|---|
+| Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) |
+| Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups |
+| Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups |
+
+> [!IMPORTANT]
+> A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered.
+
+## Restore
+
+The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first.
+
+Ordered procedure:
+
+1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only).
+2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes.
+3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB.
+4. **[AGENT]** Watch the Job to completion and read its logs.
+5. **[HUMAN]** Scale the `erp` Deployment back to **1**.
+6. **[AGENT]** Validate erp is serving and the data is present.
+
+```bash
+# [AGENT] read-only: cluster health + list backup archives on the PVC
+kubectl get deploy,pods -n erp
+kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz
+```
+
+```bash
+# [HUMAN] prod-mutating: stop writes before restoring
+kubectl scale deploy/erp -n erp --replicas=0
+kubectl rollout status deploy/erp -n erp --watch=false   # expect 0 replicas
+```
+
+```bash
+# [HUMAN] prod-mutating: run the restore Job
+#   default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
+ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
+#   or pin an explicit archive:
+ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
+  -e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz
+```
+
+```bash
+# [AGENT] read-only: follow the restore Job and its logs
+kubectl get job/dolibarr-restore -n erp -o wide
+kubectl logs -n erp job/dolibarr-restore
+```
+
+```bash
+# [HUMAN] prod-mutating: bring erp back up
+kubectl scale deploy/erp -n erp --replicas=1
+kubectl rollout status deploy/erp -n erp
+```
+
+```bash
+# [AGENT] read-only: validate erp is serving after restore
+kubectl get pods -n erp
+kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/
+```
+
+> [!WARNING]
+> **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.
+
+## Recovery ordering (cluster rebuild)
+
+> [!CAUTION]
+> **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:
+>
+> 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online.
+> 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config.
+> 3. **Scale erp to 1** — only now does the pod come up with usable creds.
+> 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above).
+>
+> This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.
+
+## The ownership fix
+
+After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**:
+
+| Script | Mechanism | Use |
+|---|---|---|
+| [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table |
+| [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY <owner> TO erp_role` only when it differs | The idempotent, chart-shipped fix |
+
+The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation:
+
+```bash
+# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
+kubectl create job \
+  --from=cronjob/pg-fix-table-ownership \
+  pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
+  -n kube-system
+```
+
+Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md).
+
+## Flow
+
+```mermaid
+%%{init: {'theme': 'base'}}%%
+flowchart LR
+    classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
+    classDef proc fill:#059669,stroke:#047857,color:#fff
+    classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
+    classDef db fill:#b45309,stroke:#92400e,color:#fff
+
+    VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
+    CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
+    DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
+    GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
+    PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
+    RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
+    PSQL["psql -f dump.sql"]:::proc
+    DB["erp Postgres DB<br>via pgbouncer.tools"]:::db
+
+    CRON -- "spawns" --> DUMPJOB
+    DUMPJOB -- "pg_dump" --> GZIP
+    GZIP -- "writes archive" --> PVC
+    PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
+    RESTOREJOB -- "tar -xzf then" --> PSQL
+    PSQL -- "loads into" --> DB
+    DUMPJOB -- "dumps from" --> DB
+    VSO -. "DB creds" .-> DUMPJOB
+    VSO -. "DB creds" .-> RESTOREJOB
+```
+
+1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`).
+2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret.
+3. The dump streams through **gzip** and the resulting `pg_dump_erp_<version>_<timestamp>.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned.
+4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials.
+5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first).
+
+## Gotchas
+
+> [!WARNING]
+> - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump.
+> - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`.
+> - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.
+> - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild.
+> - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md).
+
+## Cross-references
+
+- [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
+- [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
+- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
+- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up.
+- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs.
+- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password.