feat(backup): skip-if-unchanged + scheduled CronJob in the chart

Builds on the dedicated backup (erp#31). Skip-if-unchanged: each half (DB / documents) carries a content fingerprint at erp/<env>/.fp-{db,docs} and is dumped+uploaded only if it differs from the last run — a quiet ERP day re-uploads nothing. Fingerprint = durable BUSINESS content only: DB = count+max(tms) over tms tables EXCEPT volatile churn (llx_const, llx_user, session/cron); docs EXCLUDE */temp/* (Dolibarr stats cache) — from both the fingerprint and the tar. Proven live: 1st run uploads both, immediate 2nd run skips both (uploaded=0). Automation: the in-container logic moves to chart/files/backup-job.sh (single source of truth, read by the orchestrator AND the chart). New chart/templates/backup-cronjob.yaml renders a daily CronJob + ConfigMap + VaultStaticSecret, gated by backup.enabled (default false). Helm-verified: off by default (0 CronJobs), on renders correctly, env-aware (PREFIX erp/prod vs erp/sandbox), script embedded. Activation (documented): store GCS HMAC creds at kvv2/<backup.vaultS3Path> (default erp/backup), grant the erp `auth` Vault role read on it (tools change), set backup.enabled=true. Until then the orchestrator runs on demand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-30 15:53:13 +02:00
parent e69717c2d9
commit a3f0586c77
6 changed files with 202 additions and 67 deletions
--- a/ops/backup/README.md
+++ b/ops/backup/README.md
@@ -25,6 +25,13 @@ This tool backs up **both halves** of Dolibarr state to the existing object stor
 then prunes to a **tiered retention**: daily for 30 days, monthly for 12 months,
 yearly for ~10 years.

+**Skip-if-unchanged:** each half carries a content fingerprint at `erp/<env>/.fp-{db,docs}`
+and is dumped+uploaded only if it **differs** from the last run — so a quiet ERP day
+re-uploads nothing. The fingerprint is over **durable business content only**: the DB
+side is `count + max(tms)` over every `tms` table *except* volatile ones (`llx_const`,
+`llx_user`, sessions/cron), and the documents side excludes `*/temp/*` (Dolibarr's
+constantly-regenerated stats cache) — from both the fingerprint *and* the tar.
+
 ## Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`)

 - **prod is read-only**: `pg_dump` and `tar` only read; the only writes go to the
@@ -45,9 +52,9 @@ ops/backup/dolibarr-backup.sh backup --env sandbox
 ops/backup/dolibarr-backup.sh list --env prod
 ```

-`backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX DB PGHOST` +
-the mounted DB/S3 creds) — the single source of truth, also intended for the
-scheduled CronJob (see "Automation" below).
+`chart/files/backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX
+DB PGHOST` + the mounted DB/S3 creds) — the single source of truth shared by this
+orchestrator and the scheduled CronJob (see "Automation" below).

 **Status:** the first real prod backup was taken 2026-06-30
 (`erp/prod/db/…` 1.2 MB, `erp/prod/docs/…` 12.5 MB). Proven end-to-end live on the
@@ -62,14 +69,22 @@ sandbox (dump + tar + GCS upload + retention prune).
 The sandbox iso-prod refresh (`ops/sandbox/sandbox-lifecycle.sh`) is the natural
 restore-drill bench. A `restore` subcommand is wired next.

-## Automation (next step — gated on creds)
+## Automation — the CronJob (gated on creds)

-The recurring form is a k8s **CronJob** (ArgoCD-managed, in the chart) running the
-same `backup-job.sh` daily. It needs its **own** S3 creds rather than borrowing the
-Longhorn secret cross-namespace: a `VaultStaticSecret` in the erp namespace reading
-the GCS backup creds, which requires the `erp` Vault role to be granted read on that
-path (a `tools` change). Until that lands, run the orchestrator above on demand /
-from a host cron — it works today by borrowing the Longhorn creds transiently.
+The recurring form ships in the chart (`chart/templates/backup-cronjob.yaml`,
+`backup.enabled=false` by default): a daily **CronJob** (ConfigMap-mounted
+`backup-job.sh`) with its **own** S3 creds via a `VaultStaticSecret` — no
+cross-namespace borrowing of the Longhorn secret. To activate:
+
+1. store the GCS HMAC creds (`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` /
+   `AWS_ENDPOINTS`, same shape as `longhorn-gcs-backup-credentials`) at
+   `kvv2/<backup.vaultS3Path>` (default `erp/backup`);
+2. grant the erp `auth` Vault role read on that path (a `tools` change) if its
+   policy doesn't already cover it;
+3. set `backup.enabled: true` (+ tune `schedule`).
+
+Until then, run the orchestrator above on demand / from a host cron — it works
+today by borrowing the Longhorn creds transiently.

 > The generic Longhorn gap (the orphaned `default` group) should be fixed too, as a
 > platform concern — but this dedicated, offsite, 10-year-retention backup is the