Files

Gabriel Radureau a3f0586c77 feat(backup): skip-if-unchanged + scheduled CronJob in the chart

Builds on the dedicated backup (erp#31).

Skip-if-unchanged: each half (DB / documents) carries a content fingerprint at
erp/<env>/.fp-{db,docs} and is dumped+uploaded only if it differs from the last
run — a quiet ERP day re-uploads nothing. Fingerprint = durable BUSINESS content
only: DB = count+max(tms) over tms tables EXCEPT volatile churn (llx_const,
llx_user, session/cron); docs EXCLUDE */temp/* (Dolibarr stats cache) — from both
the fingerprint and the tar. Proven live: 1st run uploads both, immediate 2nd run
skips both (uploaded=0).

Automation: the in-container logic moves to chart/files/backup-job.sh (single
source of truth, read by the orchestrator AND the chart). New
chart/templates/backup-cronjob.yaml renders a daily CronJob + ConfigMap +
VaultStaticSecret, gated by backup.enabled (default false). Helm-verified: off by
default (0 CronJobs), on renders correctly, env-aware (PREFIX erp/prod vs
erp/sandbox), script embedded.

Activation (documented): store GCS HMAC creds at kvv2/<backup.vaultS3Path>
(default erp/backup), grant the erp `auth` Vault role read on it (tools change),
set backup.enabled=true. Until then the orchestrator runs on demand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-30 15:53:13 +02:00

4.4 KiB

Raw Blame History

Dolibarr dedicated backup

A backup strategy dedicated to Dolibarr, because the accounting data and the issued documents are critical and legally retained 10 years — they warrant more than the generic platform backup.

Why this exists (the gap it closes)

On 2026-06-30 an audit of the Longhorn external backup found that the erp documents volume had never been backed up offsite (lastBackupAt = never): its Longhorn volume is enrolled only in the default recurring-job group, but the single backup job (thrice-a-month-backup) has groups=[], so it serves no group — the erp volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas protected /var/www/documents (issued invoice PDFs, supplier pieces, contracts, ECM) — which does not survive a cluster loss / corruption / power-cut.

This tool backs up both halves of Dolibarr state to the existing object store (s3://arcodange-backup, GCS via the S3-compatible API), under erp/<env>/:

half	how	key
Postgres DB	`pg_dump -Fc` (restorable)	`erp/<env>/db/<ts>.dump`
documents PVC	`tar -czf` of `/var/www/documents` (RWX, mounted read-only)	`erp/<env>/docs/<ts>.tar.gz`

then prunes to a tiered retention: daily for 30 days, monthly for 12 months, yearly for ~10 years.

Skip-if-unchanged: each half carries a content fingerprint at erp/<env>/.fp-{db,docs} and is dumped+uploaded only if it differs from the last run — so a quiet ERP day re-uploads nothing. The fingerprint is over durable business content only: the DB side is count + max(tms) over every tms table except volatile ones (llx_const, llx_user, sessions/cron), and the documents side excludes */temp/* (Dolibarr's constantly-regenerated stats cache) — from both the fingerprint and the tar.

Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`)

prod is read-only: pg_dump and tar only read; the only writes go to the backup bucket, never to prod. The DB is read with the env's own dynamic creds (vso-db-credentials); prod and sandbox never cross.
S3 creds are never exposed: the GCS HMAC secret is copied into a transient secret in the app namespace (values stay base64), deleted on exit. The whole in-container script is shipped base64 — no secret is ever printed.

Usage

# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster)
ops/backup/dolibarr-backup.sh backup --env prod
ops/backup/dolibarr-backup.sh backup --env sandbox

# what's in the store
ops/backup/dolibarr-backup.sh list --env prod

chart/files/backup-job.sh is the in-container logic (env-driven: BUCKET PREFIX DB PGHOST + the mounted DB/S3 creds) — the single source of truth shared by this orchestrator and the scheduled CronJob (see "Automation" below).

Status: the first real prod backup was taken 2026-06-30 (erp/prod/db/… 1.2 MB, erp/prod/docs/… 12.5 MB). Proven end-to-end live on the sandbox (dump + tar + GCS upload + retention prune).

Restore (manual, for now)

# DB:    aws s3 cp s3://arcodange-backup/erp/<env>/db/<ts>.dump - | pg_restore -h <host> -U <user> -d <db> --clean
# docs:  aws s3 cp s3://arcodange-backup/erp/<env>/docs/<ts>.tar.gz - | tar -C /var/www/documents -xzf -

The sandbox iso-prod refresh (ops/sandbox/sandbox-lifecycle.sh) is the natural restore-drill bench. A restore subcommand is wired next.

Automation — the CronJob (gated on creds)

The recurring form ships in the chart (chart/templates/backup-cronjob.yaml, backup.enabled=false by default): a daily CronJob (ConfigMap-mounted backup-job.sh) with its own S3 creds via a VaultStaticSecret — no cross-namespace borrowing of the Longhorn secret. To activate:

store the GCS HMAC creds (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_ENDPOINTS, same shape as longhorn-gcs-backup-credentials) at kvv2/<backup.vaultS3Path> (default erp/backup);
grant the erp auth Vault role read on that path (a tools change) if its policy doesn't already cover it;
set backup.enabled: true (+ tune schedule).

Until then, run the orchestrator above on demand / from a host cron — it works today by borrowing the Longhorn creds transiently.

The generic Longhorn gap (the orphaned default group) should be fixed too, as a platform concern — but this dedicated, offsite, 10-year-retention backup is the one that matches Dolibarr's legal criticality.

4.4 KiB Raw Blame History