Files
erp/ops/backup
Gabriel Radureau 8ec8fde67e feat(ops): dedicated Dolibarr backup (DB + documents → offsite GCS, 10y retention)
The accounting data + issued documents are legally retained 10 years and warrant a
backup dedicated to Dolibarr. An audit found the generic Longhorn external backup
NEVER covered the erp volume (its Longhorn volume sits in the orphaned `default`
recurring-job group; the only job has groups=[] → serves nothing; lastBackupAt=never).
So /var/www/documents (invoice PDFs, supplier pieces, contracts, ECM) had zero
offsite copy — only in-cluster replicas.

ops/backup/dolibarr-backup.sh (orchestrator) + ops/backup/backup-job.sh (in-container
logic, env-driven, single source of truth):
- pg_dump -Fc of the DB + tar of the documents PVC (RWX, read-only mount) ->
  s3://arcodange-backup/erp/<env>/{db,docs}/<ts>, then tiered prune (daily 30d /
  monthly 12m / yearly 10y).
- prod is READ-only (dump+tar read; writes go only to the backup bucket); the DB is
  read with the env's own dynamic creds; the GCS HMAC secret is copied transiently
  (base64, deleted on exit) and never printed; the whole script ships base64.
- fixes the aws-cli v2.23+ default-checksum incompatibility with GCS/S3-compat
  (SignatureDoesNotMatch) via AWS_*_CHECKSUM_*=when_required.

Proven live: sandbox end-to-end (dump+tar+upload+prune, verified in GCS, cleaned up)
and retention logic unit-tested (1100 daily -> 46 kept). The FIRST real prod backup
was taken (erp/prod/db 1.2 MB + erp/prod/docs 12.5 MB) — closing the gap now.

Automation (recurring CronJob in the chart + a dedicated erp Vault policy for its
own S3 creds) is the documented next step; the orchestrator works today on demand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-30 15:32:36 +02:00
..

Dolibarr dedicated backup

A backup strategy dedicated to Dolibarr, because the accounting data and the issued documents are critical and legally retained 10 years — they warrant more than the generic platform backup.

Why this exists (the gap it closes)

On 2026-06-30 an audit of the Longhorn external backup found that the erp documents volume had never been backed up offsite (lastBackupAt = never): its Longhorn volume is enrolled only in the default recurring-job group, but the single backup job (thrice-a-month-backup) has groups=[], so it serves no group — the erp volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas protected /var/www/documents (issued invoice PDFs, supplier pieces, contracts, ECM) — which does not survive a cluster loss / corruption / power-cut.

This tool backs up both halves of Dolibarr state to the existing object store (s3://arcodange-backup, GCS via the S3-compatible API), under erp/<env>/:

half how key
Postgres DB pg_dump -Fc (restorable) erp/<env>/db/<ts>.dump
documents PVC tar -czf of /var/www/documents (RWX, mounted read-only) erp/<env>/docs/<ts>.tar.gz

then prunes to a tiered retention: daily for 30 days, monthly for 12 months, yearly for ~10 years.

Safety (mirrors ops/sandbox/sandbox-lifecycle.sh)

  • prod is read-only: pg_dump and tar only read; the only writes go to the backup bucket, never to prod. The DB is read with the env's own dynamic creds (vso-db-credentials); prod and sandbox never cross.
  • S3 creds are never exposed: the GCS HMAC secret is copied into a transient secret in the app namespace (values stay base64), deleted on exit. The whole in-container script is shipped base64 — no secret is ever printed.

Usage

# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster)
ops/backup/dolibarr-backup.sh backup --env prod
ops/backup/dolibarr-backup.sh backup --env sandbox

# what's in the store
ops/backup/dolibarr-backup.sh list --env prod

backup-job.sh is the in-container logic (env-driven: BUCKET PREFIX DB PGHOST + the mounted DB/S3 creds) — the single source of truth, also intended for the scheduled CronJob (see "Automation" below).

Status: the first real prod backup was taken 2026-06-30 (erp/prod/db/… 1.2 MB, erp/prod/docs/… 12.5 MB). Proven end-to-end live on the sandbox (dump + tar + GCS upload + retention prune).

Restore (manual, for now)

# DB:    aws s3 cp s3://arcodange-backup/erp/<env>/db/<ts>.dump - | pg_restore -h <host> -U <user> -d <db> --clean
# docs:  aws s3 cp s3://arcodange-backup/erp/<env>/docs/<ts>.tar.gz - | tar -C /var/www/documents -xzf -

The sandbox iso-prod refresh (ops/sandbox/sandbox-lifecycle.sh) is the natural restore-drill bench. A restore subcommand is wired next.

Automation (next step — gated on creds)

The recurring form is a k8s CronJob (ArgoCD-managed, in the chart) running the same backup-job.sh daily. It needs its own S3 creds rather than borrowing the Longhorn secret cross-namespace: a VaultStaticSecret in the erp namespace reading the GCS backup creds, which requires the erp Vault role to be granted read on that path (a tools change). Until that lands, run the orchestrator above on demand / from a host cron — it works today by borrowing the Longhorn creds transiently.

The generic Longhorn gap (the orphaned default group) should be fixed too, as a platform concern — but this dedicated, offsite, 10-year-retention backup is the one that matches Dolibarr's legal criticality.