Merge pull request 'feat(ops): dedicated Dolibarr backup (DB + documents → offsite GCS, 10y retention)' (#31) from claude/dolibarr-backup-strategy into main

2026-06-30 15:33:29 +02:00
parent d27b5bfd45 8ec8fde67e
commit e69717c2d9
3 changed files with 303 additions and 0 deletions
--- a/ops/backup/README.md
+++ b/ops/backup/README.md
@@ -0,0 +1,76 @@
+# Dolibarr dedicated backup
+
+A backup strategy **dedicated to Dolibarr**, because the accounting data and the
+issued documents are critical and legally retained **10 years** — they warrant more
+than the generic platform backup.
+
+## Why this exists (the gap it closes)
+
+On 2026-06-30 an audit of the Longhorn external backup found that **the erp documents
+volume had never been backed up offsite** (`lastBackupAt = never`): its Longhorn
+volume is enrolled only in the `default` recurring-job group, but the single backup
+job (`thrice-a-month-backup`) has `groups=[]`, so it serves *no* group — the erp
+volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas
+protected `/var/www/documents` (issued invoice PDFs, supplier pieces, contracts, ECM)
+— which does not survive a cluster loss / corruption / power-cut.
+
+This tool backs up **both halves** of Dolibarr state to the existing object store
+(`s3://arcodange-backup`, GCS via the S3-compatible API), under `erp/<env>/`:
+
+| half | how | key |
+|---|---|---|
+| Postgres DB | `pg_dump -Fc` (restorable) | `erp/<env>/db/<ts>.dump` |
+| documents PVC | `tar -czf` of `/var/www/documents` (RWX, mounted read-only) | `erp/<env>/docs/<ts>.tar.gz` |
+
+then prunes to a **tiered retention**: daily for 30 days, monthly for 12 months,
+yearly for ~10 years.
+
+## Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`)
+
+- **prod is read-only**: `pg_dump` and `tar` only read; the only writes go to the
+  backup bucket, never to prod. The DB is read with the env's *own* dynamic creds
+  (`vso-db-credentials`); prod and sandbox never cross.
+- **S3 creds are never exposed**: the GCS HMAC secret is copied into a *transient*
+  secret in the app namespace (values stay base64), deleted on exit. The whole
+  in-container script is shipped base64 — no secret is ever printed.
+
+## Usage
+
+```sh
+# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster)
+ops/backup/dolibarr-backup.sh backup --env prod
+ops/backup/dolibarr-backup.sh backup --env sandbox
+
+# what's in the store
+ops/backup/dolibarr-backup.sh list --env prod
+```
+
+`backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX DB PGHOST` +
+the mounted DB/S3 creds) — the single source of truth, also intended for the
+scheduled CronJob (see "Automation" below).
+
+**Status:** the first real prod backup was taken 2026-06-30
+(`erp/prod/db/…` 1.2 MB, `erp/prod/docs/…` 12.5 MB). Proven end-to-end live on the
+sandbox (dump + tar + GCS upload + retention prune).
+
+## Restore (manual, for now)
+
+```sh
+# DB:    aws s3 cp s3://arcodange-backup/erp/<env>/db/<ts>.dump - | pg_restore -h <host> -U <user> -d <db> --clean
+# docs:  aws s3 cp s3://arcodange-backup/erp/<env>/docs/<ts>.tar.gz - | tar -C /var/www/documents -xzf -
+```
+The sandbox iso-prod refresh (`ops/sandbox/sandbox-lifecycle.sh`) is the natural
+restore-drill bench. A `restore` subcommand is wired next.
+
+## Automation (next step — gated on creds)
+
+The recurring form is a k8s **CronJob** (ArgoCD-managed, in the chart) running the
+same `backup-job.sh` daily. It needs its **own** S3 creds rather than borrowing the
+Longhorn secret cross-namespace: a `VaultStaticSecret` in the erp namespace reading
+the GCS backup creds, which requires the `erp` Vault role to be granted read on that
+path (a `tools` change). Until that lands, run the orchestrator above on demand /
+from a host cron — it works today by borrowing the Longhorn creds transiently.
+
+> The generic Longhorn gap (the orphaned `default` group) should be fixed too, as a
+> platform concern — but this dedicated, offsite, 10-year-retention backup is the
+> one that matches Dolibarr's legal criticality.
--- a/ops/backup/backup-job.sh
+++ b/ops/backup/backup-job.sh
@@ -0,0 +1,56 @@
+#!/bin/sh
+# In-container backup logic for Dolibarr — the single source of truth shared by the
+# manual orchestrator (ops/backup/dolibarr-backup.sh) and the scheduled CronJob
+# (chart/templates/backup-cronjob.yaml). Driven entirely by environment:
+#   BUCKET PREFIX DB PGHOST       (config)
+#   PGUSER PGPASSWORD             (DB creds, from vso-db-credentials)
+#   AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_ENDPOINTS   (S3 creds)
+# It dumps the DB (pg_dump -Fc) + tars the documents mounted at /docs, pushes both
+# to s3://$BUCKET/$PREFIX/{db,docs}/, then prunes to a tiered retention.
+set -eu
+apk add --no-cache aws-cli tar gzip >/dev/null 2>&1 || { echo "ABORT apk add"; exit 1; }
+: "${BUCKET:?}"; : "${PREFIX:?}"; : "${DB:?}"; : "${PGHOST:?}"
+export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}"
+# GCS / S3-compatible stores reject aws-cli v2.23+ default integrity checksums
+# ("SignatureDoesNotMatch / Invalid argument") — only sign/validate when required.
+export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
+export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
+S3() { aws --endpoint-url "$AWS_ENDPOINTS" s3 "$@"; }
+
+TS=$(date -u +%Y-%m-%dT%H-%M-%SZ)
+echo "timestamp=$TS  db=$DB  -> s3://$BUCKET/$PREFIX"
+pg_dump -h "$PGHOST" -U "$PGUSER" -d "$DB" -Fc -f /tmp/db.dump
+echo "db.dump $(wc -c < /tmp/db.dump) bytes"
+tar -C /docs -czf /tmp/docs.tar.gz . 2>/dev/null
+echo "docs.tar.gz $(wc -c < /tmp/docs.tar.gz) bytes"
+S3 cp /tmp/db.dump     "s3://$BUCKET/$PREFIX/db/$TS.dump"
+S3 cp /tmp/docs.tar.gz "s3://$BUCKET/$PREFIX/docs/$TS.tar.gz"
+echo "uploaded to s3://$BUCKET/$PREFIX/{db,docs}/$TS.*"
+
+# tiered retention: daily 30d / monthly 12m (latest per month) / yearly ~10y
+cat > /tmp/prune.py <<'PY'
+import sys, datetime
+keys=[k.strip() for k in open(sys.argv[1]) if k.strip()]
+now=datetime.datetime.strptime(sys.argv[2][:10], "%Y-%m-%d").date()
+def d(k):
+    try: return datetime.datetime.strptime(k[:10], "%Y-%m-%d").date()
+    except Exception: return None
+dated=sorted([(d(k),k) for k in keys if d(k)], key=lambda x:x[0])
+keep=set(); bymonth={}; byyear={}
+for dt,k in dated:
+    age=(now-dt).days
+    if age <= 30: keep.add(k)
+    elif age <= 365: bymonth[(dt.year,dt.month)]=k
+    elif age <= 3660: byyear[dt.year]=k
+keep |= set(bymonth.values()) | set(byyear.values())
+for dt,k in dated:
+    if k not in keep: print(k)
+PY
+for SUB in db docs; do
+  S3 ls "s3://$BUCKET/$PREFIX/$SUB/" | awk '{print $4}' > /tmp/keys.$SUB || true
+  python3 /tmp/prune.py "/tmp/keys.$SUB" "$TS" > /tmp/del.$SUB || true
+  while read -r DK; do
+    [ -n "$DK" ] && S3 rm "s3://$BUCKET/$PREFIX/$SUB/$DK" && echo "pruned $SUB/$DK"
+  done < /tmp/del.$SUB
+done
+echo "DONE."
--- a/ops/backup/dolibarr-backup.sh
+++ b/ops/backup/dolibarr-backup.sh
@@ -0,0 +1,171 @@
+#!/usr/bin/env bash
+#
+# dolibarr-backup.sh — dedicated, offsite backup for the Arcodange Dolibarr ERP.
+#
+# Critical-data-aware (10-year accounting retention) and INDEPENDENT of the generic
+# Longhorn platform backup — which today does NOT cover the erp volume (its volume
+# sits in the orphaned `default` recurring-job group, lastBackupAt=never). Backs up
+# BOTH halves of Dolibarr state to the existing object store (s3://arcodange-backup
+# on GCS), under erp/<env>/:
+#   - the Postgres DB   (pg_dump -Fc, restorable)          -> erp/<env>/db/<ts>.dump
+#   - the documents PVC (/var/www/documents, RWX, ro)      -> erp/<env>/docs/<ts>.tar.gz
+# then prunes to a tiered retention: daily 30d, monthly 12m, yearly 10y.
+#
+# Safety, mirroring ops/sandbox/sandbox-lifecycle.sh:
+#   - the DB is read with the app's OWN dynamic creds (vso-db-credentials), scoped
+#     to its env; prod and sandbox never cross.
+#   - S3 creds are a TRANSIENT copy of the Longhorn GCS secret (deleted on exit);
+#     no secret value is ever printed.
+#   - the whole in-container script is shipped base64 (no nested-heredoc/quoting).
+#
+# Usage:
+#   dolibarr-backup.sh backup  [--env prod|sandbox]            # one-shot backup + prune
+#   dolibarr-backup.sh list    [--env prod|sandbox]            # what's in the store
+#   dolibarr-backup.sh restore --db   <key> --env <e> --yes    # restore DB (DESTRUCTIVE)
+#   dolibarr-backup.sh restore --docs <key> --env <e> --yes    # restore documents
+#
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+PG_IMAGE="postgres:16-alpine"
+PGHOST="192.168.1.202"                 # direct Postgres (NOT pgbouncer)
+BUCKET="${ARCO_BACKUP_BUCKET:-arcodange-backup}"
+S3_SRC_NS="longhorn-system"            # where the GCS HMAC creds live today
+S3_SRC_SECRET="longhorn-gcs-backup-credentials"
+TMP_S3_SECRET="dolibarr-backup-s3-temp"
+
+log() { printf '\033[1;36m==>\033[0m %s\n' "$*"; }
+die() { printf '\033[1;31mABORT:\033[0m %s\n' "$*" >&2; exit 1; }
+
+CMD="${1:-}"; shift || true
+ENV="prod"; KEY=""; KIND=""; YES=0
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --env) ENV="${2:?}"; shift 2 ;;
+    --db) KIND="db"; KEY="${2:?}"; shift 2 ;;
+    --docs) KIND="docs"; KEY="${2:?}"; shift 2 ;;
+    --yes) YES=1; shift ;;
+    *) die "unknown arg '$1'" ;;
+  esac
+done
+
+case "$ENV" in
+  prod)    NS="erp";         DB="erp" ;;
+  sandbox) NS="erp-sandbox"; DB="erp-sandbox" ;;
+  *) die "--env must be prod|sandbox" ;;
+esac
+PVC="$NS"
+PREFIX="${ARCO_BACKUP_PREFIX:-erp/${ENV}}"
+
+# in-container preamble: install tools, export region, define S3()
+read -r -d '' PREAMBLE <<'SH' || true
+set -eu
+apk add --no-cache aws-cli tar gzip >/dev/null 2>&1 || { echo "ABORT apk add"; exit 1; }
+export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}"
+# GCS / S3-compatible stores reject aws-cli v2.23+ default integrity checksums
+# ("SignatureDoesNotMatch / Invalid argument"); only sign/validate when required.
+export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
+export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
+aws --version 2>&1 | head -1
+S3() { aws --endpoint-url "$AWS_ENDPOINTS" s3 "$@"; }
+SH
+
+copy_s3_secret() {
+  command -v python3 >/dev/null || die "python3 required to copy the S3 secret without exposing it"
+  kubectl get secret "$S3_SRC_SECRET" -n "$S3_SRC_NS" -o json \
+    | python3 -c "import json,sys; d=json.load(sys.stdin); d['metadata']={'name':'$TMP_S3_SECRET','namespace':'$NS'}; d.pop('status',None); d['data']={k:d['data'][k] for k in ('AWS_ACCESS_KEY_ID','AWS_SECRET_ACCESS_KEY','AWS_ENDPOINTS')}; print(json.dumps(d))" \
+    | kubectl apply -f - >/dev/null
+}
+cleanup_secret() { kubectl delete secret "$TMP_S3_SECRET" -n "$NS" --ignore-not-found >/dev/null 2>&1 || true; }
+
+# b64-encode an in-container script (host vars already substituted by the caller)
+b64() { printf '%s' "$1" | base64 | tr -d '\n'; }
+
+run_backup() {
+  trap cleanup_secret EXIT
+  log "Copying GCS creds into a transient secret in $NS (values stay base64)"
+  copy_s3_secret
+  log "Backup ${ENV}: DB=$DB  PVC=$PVC  ->  s3://$BUCKET/$PREFIX/{db,docs}/"
+  local B64; B64="$(b64 "$(cat "${SCRIPT_DIR}/backup-job.sh")")"
+  kubectl delete job dolibarr-backup -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
+  kubectl apply -f - >/dev/null <<EOF
+apiVersion: batch/v1
+kind: Job
+metadata: { name: dolibarr-backup, namespace: $NS }
+spec:
+  backoffLimit: 0
+  ttlSecondsAfterFinished: 600
+  template:
+    spec:
+      restartPolicy: Never
+      volumes:
+        - name: docs
+          persistentVolumeClaim: { claimName: $PVC, readOnly: true }
+      containers:
+        - name: backup
+          image: $PG_IMAGE
+          envFrom:
+            - secretRef: { name: $TMP_S3_SECRET }
+          env:
+            - { name: BUCKET, value: "$BUCKET" }
+            - { name: PREFIX, value: "$PREFIX" }
+            - { name: DB,     value: "$DB" }
+            - { name: PGHOST, value: "$PGHOST" }
+            - { name: PGUSER,     valueFrom: { secretKeyRef: { name: vso-db-credentials, key: username } } }
+            - { name: PGPASSWORD, valueFrom: { secretKeyRef: { name: vso-db-credentials, key: password } } }
+          volumeMounts:
+            - { name: docs, mountPath: /docs, readOnly: true }
+          command: ["/bin/sh","-c"]
+          args: ["echo $B64 | base64 -d | sh"]
+EOF
+  kubectl wait --for=condition=complete job/dolibarr-backup -n "$NS" --timeout=300s >/dev/null 2>&1 \
+    || die "backup Job did not complete — kubectl logs -n $NS job/dolibarr-backup"
+  kubectl logs -n "$NS" job/dolibarr-backup | sed 's/^/    /'
+  kubectl delete job dolibarr-backup -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
+  cleanup_secret; trap - EXIT
+  log "Backup complete."
+}
+
+run_list() {
+  trap cleanup_secret EXIT; copy_s3_secret
+  local SCRIPT
+  SCRIPT="$(cat <<EOF
+$PREAMBLE
+echo "db/:";   S3 ls "s3://$BUCKET/$PREFIX/db/"   || echo "  (empty)"
+echo "docs/:"; S3 ls "s3://$BUCKET/$PREFIX/docs/" || echo "  (empty)"
+EOF
+)"
+  kubectl delete job dolibarr-backup-list -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
+  kubectl apply -f - >/dev/null <<EOF
+apiVersion: batch/v1
+kind: Job
+metadata: { name: dolibarr-backup-list, namespace: $NS }
+spec:
+  backoffLimit: 0
+  ttlSecondsAfterFinished: 300
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: list
+          image: $PG_IMAGE
+          envFrom: [ { secretRef: { name: $TMP_S3_SECRET } } ]
+          command: ["/bin/sh","-c"]
+          args: ["echo $(b64 "$SCRIPT") | base64 -d | sh"]
+EOF
+  kubectl wait --for=condition=complete job/dolibarr-backup-list -n "$NS" --timeout=180s >/dev/null 2>&1 || true
+  kubectl logs -n "$NS" job/dolibarr-backup-list 2>/dev/null | sed 's/^/    /'
+  kubectl delete job dolibarr-backup-list -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
+  cleanup_secret; trap - EXIT
+}
+
+case "$CMD" in
+  backup) run_backup ;;
+  list)   run_list ;;
+  restore)
+    [[ -n "$KEY" && -n "$KIND" ]] || die "restore needs --db <key> or --docs <key>"
+    [[ "$YES" == "1" ]] || die "restore is DESTRUCTIVE on '$ENV' — re-run with --yes"
+    die "restore: wired in the chart Job (next iteration) — key=$KEY kind=$KIND env=$ENV"
+    ;;
+  *) echo "usage: $0 {backup|list|restore} [--env prod|sandbox] [--db|--docs <key>] [--yes]" >&2; exit 2 ;;
+esac