feat(ops): dedicated Dolibarr backup (DB + documents → offsite GCS, 10y retention)

The accounting data + issued documents are legally retained 10 years and warrant a backup dedicated to Dolibarr. An audit found the generic Longhorn external backup NEVER covered the erp volume (its Longhorn volume sits in the orphaned `default` recurring-job group; the only job has groups=[] → serves nothing; lastBackupAt=never). So /var/www/documents (invoice PDFs, supplier pieces, contracts, ECM) had zero offsite copy — only in-cluster replicas. ops/backup/dolibarr-backup.sh (orchestrator) + ops/backup/backup-job.sh (in-container logic, env-driven, single source of truth): - pg_dump -Fc of the DB + tar of the documents PVC (RWX, read-only mount) -> s3://arcodange-backup/erp/<env>/{db,docs}/<ts>, then tiered prune (daily 30d / monthly 12m / yearly 10y). - prod is READ-only (dump+tar read; writes go only to the backup bucket); the DB is read with the env's own dynamic creds; the GCS HMAC secret is copied transiently (base64, deleted on exit) and never printed; the whole script ships base64. - fixes the aws-cli v2.23+ default-checksum incompatibility with GCS/S3-compat (SignatureDoesNotMatch) via AWS_*_CHECKSUM_*=when_required. Proven live: sandbox end-to-end (dump+tar+upload+prune, verified in GCS, cleaned up) and retention logic unit-tested (1100 daily -> 46 kept). The FIRST real prod backup was taken (erp/prod/db 1.2 MB + erp/prod/docs 12.5 MB) — closing the gap now. Automation (recurring CronJob in the chart + a dedicated erp Vault policy for its own S3 creds) is the documented next step; the orchestrator works today on demand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-30 15:32:36 +02:00
parent d27b5bfd45
commit 8ec8fde67e
3 changed files with 303 additions and 0 deletions
--- a/ops/backup/README.md
+++ b/ops/backup/README.md
@@ -0,0 +1,76 @@
+# Dolibarr dedicated backup
+
+A backup strategy **dedicated to Dolibarr**, because the accounting data and the
+issued documents are critical and legally retained **10 years** — they warrant more
+than the generic platform backup.
+
+## Why this exists (the gap it closes)
+
+On 2026-06-30 an audit of the Longhorn external backup found that **the erp documents
+volume had never been backed up offsite** (`lastBackupAt = never`): its Longhorn
+volume is enrolled only in the `default` recurring-job group, but the single backup
+job (`thrice-a-month-backup`) has `groups=[]`, so it serves *no* group — the erp
+volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas
+protected `/var/www/documents` (issued invoice PDFs, supplier pieces, contracts, ECM)
+— which does not survive a cluster loss / corruption / power-cut.
+
+This tool backs up **both halves** of Dolibarr state to the existing object store
+(`s3://arcodange-backup`, GCS via the S3-compatible API), under `erp/<env>/`:
+
+| half | how | key |
+|---|---|---|
+| Postgres DB | `pg_dump -Fc` (restorable) | `erp/<env>/db/<ts>.dump` |
+| documents PVC | `tar -czf` of `/var/www/documents` (RWX, mounted read-only) | `erp/<env>/docs/<ts>.tar.gz` |
+
+then prunes to a **tiered retention**: daily for 30 days, monthly for 12 months,
+yearly for ~10 years.
+
+## Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`)
+
+- **prod is read-only**: `pg_dump` and `tar` only read; the only writes go to the
+  backup bucket, never to prod. The DB is read with the env's *own* dynamic creds
+  (`vso-db-credentials`); prod and sandbox never cross.
+- **S3 creds are never exposed**: the GCS HMAC secret is copied into a *transient*
+  secret in the app namespace (values stay base64), deleted on exit. The whole
+  in-container script is shipped base64 — no secret is ever printed.
+
+## Usage
+
+```sh
+# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster)
+ops/backup/dolibarr-backup.sh backup --env prod
+ops/backup/dolibarr-backup.sh backup --env sandbox
+
+# what's in the store
+ops/backup/dolibarr-backup.sh list --env prod
+```
+
+`backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX DB PGHOST` +
+the mounted DB/S3 creds) — the single source of truth, also intended for the
+scheduled CronJob (see "Automation" below).
+
+**Status:** the first real prod backup was taken 2026-06-30
+(`erp/prod/db/…` 1.2 MB, `erp/prod/docs/…` 12.5 MB). Proven end-to-end live on the
+sandbox (dump + tar + GCS upload + retention prune).
+
+## Restore (manual, for now)
+
+```sh
+# DB:    aws s3 cp s3://arcodange-backup/erp/<env>/db/<ts>.dump - | pg_restore -h <host> -U <user> -d <db> --clean
+# docs:  aws s3 cp s3://arcodange-backup/erp/<env>/docs/<ts>.tar.gz - | tar -C /var/www/documents -xzf -
+```
+The sandbox iso-prod refresh (`ops/sandbox/sandbox-lifecycle.sh`) is the natural
+restore-drill bench. A `restore` subcommand is wired next.
+
+## Automation (next step — gated on creds)
+
+The recurring form is a k8s **CronJob** (ArgoCD-managed, in the chart) running the
+same `backup-job.sh` daily. It needs its **own** S3 creds rather than borrowing the
+Longhorn secret cross-namespace: a `VaultStaticSecret` in the erp namespace reading
+the GCS backup creds, which requires the `erp` Vault role to be granted read on that
+path (a `tools` change). Until that lands, run the orchestrator above on demand /
+from a host cron — it works today by borrowing the Longhorn creds transiently.
+
+> The generic Longhorn gap (the orphaned `default` group) should be fixed too, as a
+> platform concern — but this dedicated, offsite, 10-year-retention backup is the
+> one that matches Dolibarr's legal criticality.