From 8ec8fde67e1be49f842536a8edfadbb02a561a5b Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 30 Jun 2026 15:32:36 +0200 Subject: [PATCH] =?UTF-8?q?feat(ops):=20dedicated=20Dolibarr=20backup=20(D?= =?UTF-8?q?B=20+=20documents=20=E2=86=92=20offsite=20GCS,=2010y=20retentio?= =?UTF-8?q?n)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The accounting data + issued documents are legally retained 10 years and warrant a backup dedicated to Dolibarr. An audit found the generic Longhorn external backup NEVER covered the erp volume (its Longhorn volume sits in the orphaned `default` recurring-job group; the only job has groups=[] → serves nothing; lastBackupAt=never). So /var/www/documents (invoice PDFs, supplier pieces, contracts, ECM) had zero offsite copy — only in-cluster replicas. ops/backup/dolibarr-backup.sh (orchestrator) + ops/backup/backup-job.sh (in-container logic, env-driven, single source of truth): - pg_dump -Fc of the DB + tar of the documents PVC (RWX, read-only mount) -> s3://arcodange-backup/erp//{db,docs}/, then tiered prune (daily 30d / monthly 12m / yearly 10y). - prod is READ-only (dump+tar read; writes go only to the backup bucket); the DB is read with the env's own dynamic creds; the GCS HMAC secret is copied transiently (base64, deleted on exit) and never printed; the whole script ships base64. - fixes the aws-cli v2.23+ default-checksum incompatibility with GCS/S3-compat (SignatureDoesNotMatch) via AWS_*_CHECKSUM_*=when_required. Proven live: sandbox end-to-end (dump+tar+upload+prune, verified in GCS, cleaned up) and retention logic unit-tested (1100 daily -> 46 kept). The FIRST real prod backup was taken (erp/prod/db 1.2 MB + erp/prod/docs 12.5 MB) — closing the gap now. Automation (recurring CronJob in the chart + a dedicated erp Vault policy for its own S3 creds) is the documented next step; the orchestrator works today on demand. Co-Authored-By: Claude Opus 4.7 (1M context) --- ops/backup/README.md | 76 +++++++++++++++ ops/backup/backup-job.sh | 56 +++++++++++ ops/backup/dolibarr-backup.sh | 171 ++++++++++++++++++++++++++++++++++ 3 files changed, 303 insertions(+) create mode 100644 ops/backup/README.md create mode 100755 ops/backup/backup-job.sh create mode 100755 ops/backup/dolibarr-backup.sh diff --git a/ops/backup/README.md b/ops/backup/README.md new file mode 100644 index 0000000..6f55ac4 --- /dev/null +++ b/ops/backup/README.md @@ -0,0 +1,76 @@ +# Dolibarr dedicated backup + +A backup strategy **dedicated to Dolibarr**, because the accounting data and the +issued documents are critical and legally retained **10 years** — they warrant more +than the generic platform backup. + +## Why this exists (the gap it closes) + +On 2026-06-30 an audit of the Longhorn external backup found that **the erp documents +volume had never been backed up offsite** (`lastBackupAt = never`): its Longhorn +volume is enrolled only in the `default` recurring-job group, but the single backup +job (`thrice-a-month-backup`) has `groups=[]`, so it serves *no* group — the erp +volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas +protected `/var/www/documents` (issued invoice PDFs, supplier pieces, contracts, ECM) +— which does not survive a cluster loss / corruption / power-cut. + +This tool backs up **both halves** of Dolibarr state to the existing object store +(`s3://arcodange-backup`, GCS via the S3-compatible API), under `erp//`: + +| half | how | key | +|---|---|---| +| Postgres DB | `pg_dump -Fc` (restorable) | `erp//db/.dump` | +| documents PVC | `tar -czf` of `/var/www/documents` (RWX, mounted read-only) | `erp//docs/.tar.gz` | + +then prunes to a **tiered retention**: daily for 30 days, monthly for 12 months, +yearly for ~10 years. + +## Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`) + +- **prod is read-only**: `pg_dump` and `tar` only read; the only writes go to the + backup bucket, never to prod. The DB is read with the env's *own* dynamic creds + (`vso-db-credentials`); prod and sandbox never cross. +- **S3 creds are never exposed**: the GCS HMAC secret is copied into a *transient* + secret in the app namespace (values stay base64), deleted on exit. The whole + in-container script is shipped base64 — no secret is ever printed. + +## Usage + +```sh +# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster) +ops/backup/dolibarr-backup.sh backup --env prod +ops/backup/dolibarr-backup.sh backup --env sandbox + +# what's in the store +ops/backup/dolibarr-backup.sh list --env prod +``` + +`backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX DB PGHOST` + +the mounted DB/S3 creds) — the single source of truth, also intended for the +scheduled CronJob (see "Automation" below). + +**Status:** the first real prod backup was taken 2026-06-30 +(`erp/prod/db/…` 1.2 MB, `erp/prod/docs/…` 12.5 MB). Proven end-to-end live on the +sandbox (dump + tar + GCS upload + retention prune). + +## Restore (manual, for now) + +```sh +# DB: aws s3 cp s3://arcodange-backup/erp//db/.dump - | pg_restore -h -U -d --clean +# docs: aws s3 cp s3://arcodange-backup/erp//docs/.tar.gz - | tar -C /var/www/documents -xzf - +``` +The sandbox iso-prod refresh (`ops/sandbox/sandbox-lifecycle.sh`) is the natural +restore-drill bench. A `restore` subcommand is wired next. + +## Automation (next step — gated on creds) + +The recurring form is a k8s **CronJob** (ArgoCD-managed, in the chart) running the +same `backup-job.sh` daily. It needs its **own** S3 creds rather than borrowing the +Longhorn secret cross-namespace: a `VaultStaticSecret` in the erp namespace reading +the GCS backup creds, which requires the `erp` Vault role to be granted read on that +path (a `tools` change). Until that lands, run the orchestrator above on demand / +from a host cron — it works today by borrowing the Longhorn creds transiently. + +> The generic Longhorn gap (the orphaned `default` group) should be fixed too, as a +> platform concern — but this dedicated, offsite, 10-year-retention backup is the +> one that matches Dolibarr's legal criticality. diff --git a/ops/backup/backup-job.sh b/ops/backup/backup-job.sh new file mode 100755 index 0000000..2055910 --- /dev/null +++ b/ops/backup/backup-job.sh @@ -0,0 +1,56 @@ +#!/bin/sh +# In-container backup logic for Dolibarr — the single source of truth shared by the +# manual orchestrator (ops/backup/dolibarr-backup.sh) and the scheduled CronJob +# (chart/templates/backup-cronjob.yaml). Driven entirely by environment: +# BUCKET PREFIX DB PGHOST (config) +# PGUSER PGPASSWORD (DB creds, from vso-db-credentials) +# AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_ENDPOINTS (S3 creds) +# It dumps the DB (pg_dump -Fc) + tars the documents mounted at /docs, pushes both +# to s3://$BUCKET/$PREFIX/{db,docs}/, then prunes to a tiered retention. +set -eu +apk add --no-cache aws-cli tar gzip >/dev/null 2>&1 || { echo "ABORT apk add"; exit 1; } +: "${BUCKET:?}"; : "${PREFIX:?}"; : "${DB:?}"; : "${PGHOST:?}" +export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" +# GCS / S3-compatible stores reject aws-cli v2.23+ default integrity checksums +# ("SignatureDoesNotMatch / Invalid argument") — only sign/validate when required. +export AWS_REQUEST_CHECKSUM_CALCULATION=when_required +export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required +S3() { aws --endpoint-url "$AWS_ENDPOINTS" s3 "$@"; } + +TS=$(date -u +%Y-%m-%dT%H-%M-%SZ) +echo "timestamp=$TS db=$DB -> s3://$BUCKET/$PREFIX" +pg_dump -h "$PGHOST" -U "$PGUSER" -d "$DB" -Fc -f /tmp/db.dump +echo "db.dump $(wc -c < /tmp/db.dump) bytes" +tar -C /docs -czf /tmp/docs.tar.gz . 2>/dev/null +echo "docs.tar.gz $(wc -c < /tmp/docs.tar.gz) bytes" +S3 cp /tmp/db.dump "s3://$BUCKET/$PREFIX/db/$TS.dump" +S3 cp /tmp/docs.tar.gz "s3://$BUCKET/$PREFIX/docs/$TS.tar.gz" +echo "uploaded to s3://$BUCKET/$PREFIX/{db,docs}/$TS.*" + +# tiered retention: daily 30d / monthly 12m (latest per month) / yearly ~10y +cat > /tmp/prune.py <<'PY' +import sys, datetime +keys=[k.strip() for k in open(sys.argv[1]) if k.strip()] +now=datetime.datetime.strptime(sys.argv[2][:10], "%Y-%m-%d").date() +def d(k): + try: return datetime.datetime.strptime(k[:10], "%Y-%m-%d").date() + except Exception: return None +dated=sorted([(d(k),k) for k in keys if d(k)], key=lambda x:x[0]) +keep=set(); bymonth={}; byyear={} +for dt,k in dated: + age=(now-dt).days + if age <= 30: keep.add(k) + elif age <= 365: bymonth[(dt.year,dt.month)]=k + elif age <= 3660: byyear[dt.year]=k +keep |= set(bymonth.values()) | set(byyear.values()) +for dt,k in dated: + if k not in keep: print(k) +PY +for SUB in db docs; do + S3 ls "s3://$BUCKET/$PREFIX/$SUB/" | awk '{print $4}' > /tmp/keys.$SUB || true + python3 /tmp/prune.py "/tmp/keys.$SUB" "$TS" > /tmp/del.$SUB || true + while read -r DK; do + [ -n "$DK" ] && S3 rm "s3://$BUCKET/$PREFIX/$SUB/$DK" && echo "pruned $SUB/$DK" + done < /tmp/del.$SUB +done +echo "DONE." diff --git a/ops/backup/dolibarr-backup.sh b/ops/backup/dolibarr-backup.sh new file mode 100755 index 0000000..2712079 --- /dev/null +++ b/ops/backup/dolibarr-backup.sh @@ -0,0 +1,171 @@ +#!/usr/bin/env bash +# +# dolibarr-backup.sh — dedicated, offsite backup for the Arcodange Dolibarr ERP. +# +# Critical-data-aware (10-year accounting retention) and INDEPENDENT of the generic +# Longhorn platform backup — which today does NOT cover the erp volume (its volume +# sits in the orphaned `default` recurring-job group, lastBackupAt=never). Backs up +# BOTH halves of Dolibarr state to the existing object store (s3://arcodange-backup +# on GCS), under erp//: +# - the Postgres DB (pg_dump -Fc, restorable) -> erp//db/.dump +# - the documents PVC (/var/www/documents, RWX, ro) -> erp//docs/.tar.gz +# then prunes to a tiered retention: daily 30d, monthly 12m, yearly 10y. +# +# Safety, mirroring ops/sandbox/sandbox-lifecycle.sh: +# - the DB is read with the app's OWN dynamic creds (vso-db-credentials), scoped +# to its env; prod and sandbox never cross. +# - S3 creds are a TRANSIENT copy of the Longhorn GCS secret (deleted on exit); +# no secret value is ever printed. +# - the whole in-container script is shipped base64 (no nested-heredoc/quoting). +# +# Usage: +# dolibarr-backup.sh backup [--env prod|sandbox] # one-shot backup + prune +# dolibarr-backup.sh list [--env prod|sandbox] # what's in the store +# dolibarr-backup.sh restore --db --env --yes # restore DB (DESTRUCTIVE) +# dolibarr-backup.sh restore --docs --env --yes # restore documents +# +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +PG_IMAGE="postgres:16-alpine" +PGHOST="192.168.1.202" # direct Postgres (NOT pgbouncer) +BUCKET="${ARCO_BACKUP_BUCKET:-arcodange-backup}" +S3_SRC_NS="longhorn-system" # where the GCS HMAC creds live today +S3_SRC_SECRET="longhorn-gcs-backup-credentials" +TMP_S3_SECRET="dolibarr-backup-s3-temp" + +log() { printf '\033[1;36m==>\033[0m %s\n' "$*"; } +die() { printf '\033[1;31mABORT:\033[0m %s\n' "$*" >&2; exit 1; } + +CMD="${1:-}"; shift || true +ENV="prod"; KEY=""; KIND=""; YES=0 +while [[ $# -gt 0 ]]; do + case "$1" in + --env) ENV="${2:?}"; shift 2 ;; + --db) KIND="db"; KEY="${2:?}"; shift 2 ;; + --docs) KIND="docs"; KEY="${2:?}"; shift 2 ;; + --yes) YES=1; shift ;; + *) die "unknown arg '$1'" ;; + esac +done + +case "$ENV" in + prod) NS="erp"; DB="erp" ;; + sandbox) NS="erp-sandbox"; DB="erp-sandbox" ;; + *) die "--env must be prod|sandbox" ;; +esac +PVC="$NS" +PREFIX="${ARCO_BACKUP_PREFIX:-erp/${ENV}}" + +# in-container preamble: install tools, export region, define S3() +read -r -d '' PREAMBLE <<'SH' || true +set -eu +apk add --no-cache aws-cli tar gzip >/dev/null 2>&1 || { echo "ABORT apk add"; exit 1; } +export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}" +# GCS / S3-compatible stores reject aws-cli v2.23+ default integrity checksums +# ("SignatureDoesNotMatch / Invalid argument"); only sign/validate when required. +export AWS_REQUEST_CHECKSUM_CALCULATION=when_required +export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required +aws --version 2>&1 | head -1 +S3() { aws --endpoint-url "$AWS_ENDPOINTS" s3 "$@"; } +SH + +copy_s3_secret() { + command -v python3 >/dev/null || die "python3 required to copy the S3 secret without exposing it" + kubectl get secret "$S3_SRC_SECRET" -n "$S3_SRC_NS" -o json \ + | python3 -c "import json,sys; d=json.load(sys.stdin); d['metadata']={'name':'$TMP_S3_SECRET','namespace':'$NS'}; d.pop('status',None); d['data']={k:d['data'][k] for k in ('AWS_ACCESS_KEY_ID','AWS_SECRET_ACCESS_KEY','AWS_ENDPOINTS')}; print(json.dumps(d))" \ + | kubectl apply -f - >/dev/null +} +cleanup_secret() { kubectl delete secret "$TMP_S3_SECRET" -n "$NS" --ignore-not-found >/dev/null 2>&1 || true; } + +# b64-encode an in-container script (host vars already substituted by the caller) +b64() { printf '%s' "$1" | base64 | tr -d '\n'; } + +run_backup() { + trap cleanup_secret EXIT + log "Copying GCS creds into a transient secret in $NS (values stay base64)" + copy_s3_secret + log "Backup ${ENV}: DB=$DB PVC=$PVC -> s3://$BUCKET/$PREFIX/{db,docs}/" + local B64; B64="$(b64 "$(cat "${SCRIPT_DIR}/backup-job.sh")")" + kubectl delete job dolibarr-backup -n "$NS" --ignore-not-found >/dev/null 2>&1 || true + kubectl apply -f - >/dev/null </dev/null 2>&1 \ + || die "backup Job did not complete — kubectl logs -n $NS job/dolibarr-backup" + kubectl logs -n "$NS" job/dolibarr-backup | sed 's/^/ /' + kubectl delete job dolibarr-backup -n "$NS" --ignore-not-found >/dev/null 2>&1 || true + cleanup_secret; trap - EXIT + log "Backup complete." +} + +run_list() { + trap cleanup_secret EXIT; copy_s3_secret + local SCRIPT + SCRIPT="$(cat </dev/null 2>&1 || true + kubectl apply -f - >/dev/null </dev/null 2>&1 || true + kubectl logs -n "$NS" job/dolibarr-backup-list 2>/dev/null | sed 's/^/ /' + kubectl delete job dolibarr-backup-list -n "$NS" --ignore-not-found >/dev/null 2>&1 || true + cleanup_secret; trap - EXIT +} + +case "$CMD" in + backup) run_backup ;; + list) run_list ;; + restore) + [[ -n "$KEY" && -n "$KIND" ]] || die "restore needs --db or --docs " + [[ "$YES" == "1" ]] || die "restore is DESTRUCTIVE on '$ENV' — re-run with --yes" + die "restore: wired in the chart Job (next iteration) — key=$KEY kind=$KIND env=$ENV" + ;; + *) echo "usage: $0 {backup|list|restore} [--env prod|sandbox] [--db|--docs ] [--yes]" >&2; exit 2 ;; +esac -- 2.49.1