Merge pull request 'feat(ops): dedicated Dolibarr backup (DB + documents → offsite GCS, 10y retention)' (#31) from claude/dolibarr-backup-strategy into main
This commit was merged in pull request #31.
This commit is contained in:
76
ops/backup/README.md
Normal file
76
ops/backup/README.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Dolibarr dedicated backup
|
||||
|
||||
A backup strategy **dedicated to Dolibarr**, because the accounting data and the
|
||||
issued documents are critical and legally retained **10 years** — they warrant more
|
||||
than the generic platform backup.
|
||||
|
||||
## Why this exists (the gap it closes)
|
||||
|
||||
On 2026-06-30 an audit of the Longhorn external backup found that **the erp documents
|
||||
volume had never been backed up offsite** (`lastBackupAt = never`): its Longhorn
|
||||
volume is enrolled only in the `default` recurring-job group, but the single backup
|
||||
job (`thrice-a-month-backup`) has `groups=[]`, so it serves *no* group — the erp
|
||||
volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas
|
||||
protected `/var/www/documents` (issued invoice PDFs, supplier pieces, contracts, ECM)
|
||||
— which does not survive a cluster loss / corruption / power-cut.
|
||||
|
||||
This tool backs up **both halves** of Dolibarr state to the existing object store
|
||||
(`s3://arcodange-backup`, GCS via the S3-compatible API), under `erp/<env>/`:
|
||||
|
||||
| half | how | key |
|
||||
|---|---|---|
|
||||
| Postgres DB | `pg_dump -Fc` (restorable) | `erp/<env>/db/<ts>.dump` |
|
||||
| documents PVC | `tar -czf` of `/var/www/documents` (RWX, mounted read-only) | `erp/<env>/docs/<ts>.tar.gz` |
|
||||
|
||||
then prunes to a **tiered retention**: daily for 30 days, monthly for 12 months,
|
||||
yearly for ~10 years.
|
||||
|
||||
## Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`)
|
||||
|
||||
- **prod is read-only**: `pg_dump` and `tar` only read; the only writes go to the
|
||||
backup bucket, never to prod. The DB is read with the env's *own* dynamic creds
|
||||
(`vso-db-credentials`); prod and sandbox never cross.
|
||||
- **S3 creds are never exposed**: the GCS HMAC secret is copied into a *transient*
|
||||
secret in the app namespace (values stay base64), deleted on exit. The whole
|
||||
in-container script is shipped base64 — no secret is ever printed.
|
||||
|
||||
## Usage
|
||||
|
||||
```sh
|
||||
# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster)
|
||||
ops/backup/dolibarr-backup.sh backup --env prod
|
||||
ops/backup/dolibarr-backup.sh backup --env sandbox
|
||||
|
||||
# what's in the store
|
||||
ops/backup/dolibarr-backup.sh list --env prod
|
||||
```
|
||||
|
||||
`backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX DB PGHOST` +
|
||||
the mounted DB/S3 creds) — the single source of truth, also intended for the
|
||||
scheduled CronJob (see "Automation" below).
|
||||
|
||||
**Status:** the first real prod backup was taken 2026-06-30
|
||||
(`erp/prod/db/…` 1.2 MB, `erp/prod/docs/…` 12.5 MB). Proven end-to-end live on the
|
||||
sandbox (dump + tar + GCS upload + retention prune).
|
||||
|
||||
## Restore (manual, for now)
|
||||
|
||||
```sh
|
||||
# DB: aws s3 cp s3://arcodange-backup/erp/<env>/db/<ts>.dump - | pg_restore -h <host> -U <user> -d <db> --clean
|
||||
# docs: aws s3 cp s3://arcodange-backup/erp/<env>/docs/<ts>.tar.gz - | tar -C /var/www/documents -xzf -
|
||||
```
|
||||
The sandbox iso-prod refresh (`ops/sandbox/sandbox-lifecycle.sh`) is the natural
|
||||
restore-drill bench. A `restore` subcommand is wired next.
|
||||
|
||||
## Automation (next step — gated on creds)
|
||||
|
||||
The recurring form is a k8s **CronJob** (ArgoCD-managed, in the chart) running the
|
||||
same `backup-job.sh` daily. It needs its **own** S3 creds rather than borrowing the
|
||||
Longhorn secret cross-namespace: a `VaultStaticSecret` in the erp namespace reading
|
||||
the GCS backup creds, which requires the `erp` Vault role to be granted read on that
|
||||
path (a `tools` change). Until that lands, run the orchestrator above on demand /
|
||||
from a host cron — it works today by borrowing the Longhorn creds transiently.
|
||||
|
||||
> The generic Longhorn gap (the orphaned `default` group) should be fixed too, as a
|
||||
> platform concern — but this dedicated, offsite, 10-year-retention backup is the
|
||||
> one that matches Dolibarr's legal criticality.
|
||||
56
ops/backup/backup-job.sh
Executable file
56
ops/backup/backup-job.sh
Executable file
@@ -0,0 +1,56 @@
|
||||
#!/bin/sh
|
||||
# In-container backup logic for Dolibarr — the single source of truth shared by the
|
||||
# manual orchestrator (ops/backup/dolibarr-backup.sh) and the scheduled CronJob
|
||||
# (chart/templates/backup-cronjob.yaml). Driven entirely by environment:
|
||||
# BUCKET PREFIX DB PGHOST (config)
|
||||
# PGUSER PGPASSWORD (DB creds, from vso-db-credentials)
|
||||
# AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_ENDPOINTS (S3 creds)
|
||||
# It dumps the DB (pg_dump -Fc) + tars the documents mounted at /docs, pushes both
|
||||
# to s3://$BUCKET/$PREFIX/{db,docs}/, then prunes to a tiered retention.
|
||||
set -eu
|
||||
apk add --no-cache aws-cli tar gzip >/dev/null 2>&1 || { echo "ABORT apk add"; exit 1; }
|
||||
: "${BUCKET:?}"; : "${PREFIX:?}"; : "${DB:?}"; : "${PGHOST:?}"
|
||||
export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}"
|
||||
# GCS / S3-compatible stores reject aws-cli v2.23+ default integrity checksums
|
||||
# ("SignatureDoesNotMatch / Invalid argument") — only sign/validate when required.
|
||||
export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
|
||||
export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
|
||||
S3() { aws --endpoint-url "$AWS_ENDPOINTS" s3 "$@"; }
|
||||
|
||||
TS=$(date -u +%Y-%m-%dT%H-%M-%SZ)
|
||||
echo "timestamp=$TS db=$DB -> s3://$BUCKET/$PREFIX"
|
||||
pg_dump -h "$PGHOST" -U "$PGUSER" -d "$DB" -Fc -f /tmp/db.dump
|
||||
echo "db.dump $(wc -c < /tmp/db.dump) bytes"
|
||||
tar -C /docs -czf /tmp/docs.tar.gz . 2>/dev/null
|
||||
echo "docs.tar.gz $(wc -c < /tmp/docs.tar.gz) bytes"
|
||||
S3 cp /tmp/db.dump "s3://$BUCKET/$PREFIX/db/$TS.dump"
|
||||
S3 cp /tmp/docs.tar.gz "s3://$BUCKET/$PREFIX/docs/$TS.tar.gz"
|
||||
echo "uploaded to s3://$BUCKET/$PREFIX/{db,docs}/$TS.*"
|
||||
|
||||
# tiered retention: daily 30d / monthly 12m (latest per month) / yearly ~10y
|
||||
cat > /tmp/prune.py <<'PY'
|
||||
import sys, datetime
|
||||
keys=[k.strip() for k in open(sys.argv[1]) if k.strip()]
|
||||
now=datetime.datetime.strptime(sys.argv[2][:10], "%Y-%m-%d").date()
|
||||
def d(k):
|
||||
try: return datetime.datetime.strptime(k[:10], "%Y-%m-%d").date()
|
||||
except Exception: return None
|
||||
dated=sorted([(d(k),k) for k in keys if d(k)], key=lambda x:x[0])
|
||||
keep=set(); bymonth={}; byyear={}
|
||||
for dt,k in dated:
|
||||
age=(now-dt).days
|
||||
if age <= 30: keep.add(k)
|
||||
elif age <= 365: bymonth[(dt.year,dt.month)]=k
|
||||
elif age <= 3660: byyear[dt.year]=k
|
||||
keep |= set(bymonth.values()) | set(byyear.values())
|
||||
for dt,k in dated:
|
||||
if k not in keep: print(k)
|
||||
PY
|
||||
for SUB in db docs; do
|
||||
S3 ls "s3://$BUCKET/$PREFIX/$SUB/" | awk '{print $4}' > /tmp/keys.$SUB || true
|
||||
python3 /tmp/prune.py "/tmp/keys.$SUB" "$TS" > /tmp/del.$SUB || true
|
||||
while read -r DK; do
|
||||
[ -n "$DK" ] && S3 rm "s3://$BUCKET/$PREFIX/$SUB/$DK" && echo "pruned $SUB/$DK"
|
||||
done < /tmp/del.$SUB
|
||||
done
|
||||
echo "DONE."
|
||||
171
ops/backup/dolibarr-backup.sh
Executable file
171
ops/backup/dolibarr-backup.sh
Executable file
@@ -0,0 +1,171 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# dolibarr-backup.sh — dedicated, offsite backup for the Arcodange Dolibarr ERP.
|
||||
#
|
||||
# Critical-data-aware (10-year accounting retention) and INDEPENDENT of the generic
|
||||
# Longhorn platform backup — which today does NOT cover the erp volume (its volume
|
||||
# sits in the orphaned `default` recurring-job group, lastBackupAt=never). Backs up
|
||||
# BOTH halves of Dolibarr state to the existing object store (s3://arcodange-backup
|
||||
# on GCS), under erp/<env>/:
|
||||
# - the Postgres DB (pg_dump -Fc, restorable) -> erp/<env>/db/<ts>.dump
|
||||
# - the documents PVC (/var/www/documents, RWX, ro) -> erp/<env>/docs/<ts>.tar.gz
|
||||
# then prunes to a tiered retention: daily 30d, monthly 12m, yearly 10y.
|
||||
#
|
||||
# Safety, mirroring ops/sandbox/sandbox-lifecycle.sh:
|
||||
# - the DB is read with the app's OWN dynamic creds (vso-db-credentials), scoped
|
||||
# to its env; prod and sandbox never cross.
|
||||
# - S3 creds are a TRANSIENT copy of the Longhorn GCS secret (deleted on exit);
|
||||
# no secret value is ever printed.
|
||||
# - the whole in-container script is shipped base64 (no nested-heredoc/quoting).
|
||||
#
|
||||
# Usage:
|
||||
# dolibarr-backup.sh backup [--env prod|sandbox] # one-shot backup + prune
|
||||
# dolibarr-backup.sh list [--env prod|sandbox] # what's in the store
|
||||
# dolibarr-backup.sh restore --db <key> --env <e> --yes # restore DB (DESTRUCTIVE)
|
||||
# dolibarr-backup.sh restore --docs <key> --env <e> --yes # restore documents
|
||||
#
|
||||
set -euo pipefail
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
PG_IMAGE="postgres:16-alpine"
|
||||
PGHOST="192.168.1.202" # direct Postgres (NOT pgbouncer)
|
||||
BUCKET="${ARCO_BACKUP_BUCKET:-arcodange-backup}"
|
||||
S3_SRC_NS="longhorn-system" # where the GCS HMAC creds live today
|
||||
S3_SRC_SECRET="longhorn-gcs-backup-credentials"
|
||||
TMP_S3_SECRET="dolibarr-backup-s3-temp"
|
||||
|
||||
log() { printf '\033[1;36m==>\033[0m %s\n' "$*"; }
|
||||
die() { printf '\033[1;31mABORT:\033[0m %s\n' "$*" >&2; exit 1; }
|
||||
|
||||
CMD="${1:-}"; shift || true
|
||||
ENV="prod"; KEY=""; KIND=""; YES=0
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--env) ENV="${2:?}"; shift 2 ;;
|
||||
--db) KIND="db"; KEY="${2:?}"; shift 2 ;;
|
||||
--docs) KIND="docs"; KEY="${2:?}"; shift 2 ;;
|
||||
--yes) YES=1; shift ;;
|
||||
*) die "unknown arg '$1'" ;;
|
||||
esac
|
||||
done
|
||||
|
||||
case "$ENV" in
|
||||
prod) NS="erp"; DB="erp" ;;
|
||||
sandbox) NS="erp-sandbox"; DB="erp-sandbox" ;;
|
||||
*) die "--env must be prod|sandbox" ;;
|
||||
esac
|
||||
PVC="$NS"
|
||||
PREFIX="${ARCO_BACKUP_PREFIX:-erp/${ENV}}"
|
||||
|
||||
# in-container preamble: install tools, export region, define S3()
|
||||
read -r -d '' PREAMBLE <<'SH' || true
|
||||
set -eu
|
||||
apk add --no-cache aws-cli tar gzip >/dev/null 2>&1 || { echo "ABORT apk add"; exit 1; }
|
||||
export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-us-east-1}"
|
||||
# GCS / S3-compatible stores reject aws-cli v2.23+ default integrity checksums
|
||||
# ("SignatureDoesNotMatch / Invalid argument"); only sign/validate when required.
|
||||
export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
|
||||
export AWS_RESPONSE_CHECKSUM_VALIDATION=when_required
|
||||
aws --version 2>&1 | head -1
|
||||
S3() { aws --endpoint-url "$AWS_ENDPOINTS" s3 "$@"; }
|
||||
SH
|
||||
|
||||
copy_s3_secret() {
|
||||
command -v python3 >/dev/null || die "python3 required to copy the S3 secret without exposing it"
|
||||
kubectl get secret "$S3_SRC_SECRET" -n "$S3_SRC_NS" -o json \
|
||||
| python3 -c "import json,sys; d=json.load(sys.stdin); d['metadata']={'name':'$TMP_S3_SECRET','namespace':'$NS'}; d.pop('status',None); d['data']={k:d['data'][k] for k in ('AWS_ACCESS_KEY_ID','AWS_SECRET_ACCESS_KEY','AWS_ENDPOINTS')}; print(json.dumps(d))" \
|
||||
| kubectl apply -f - >/dev/null
|
||||
}
|
||||
cleanup_secret() { kubectl delete secret "$TMP_S3_SECRET" -n "$NS" --ignore-not-found >/dev/null 2>&1 || true; }
|
||||
|
||||
# b64-encode an in-container script (host vars already substituted by the caller)
|
||||
b64() { printf '%s' "$1" | base64 | tr -d '\n'; }
|
||||
|
||||
run_backup() {
|
||||
trap cleanup_secret EXIT
|
||||
log "Copying GCS creds into a transient secret in $NS (values stay base64)"
|
||||
copy_s3_secret
|
||||
log "Backup ${ENV}: DB=$DB PVC=$PVC -> s3://$BUCKET/$PREFIX/{db,docs}/"
|
||||
local B64; B64="$(b64 "$(cat "${SCRIPT_DIR}/backup-job.sh")")"
|
||||
kubectl delete job dolibarr-backup -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
|
||||
kubectl apply -f - >/dev/null <<EOF
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata: { name: dolibarr-backup, namespace: $NS }
|
||||
spec:
|
||||
backoffLimit: 0
|
||||
ttlSecondsAfterFinished: 600
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
volumes:
|
||||
- name: docs
|
||||
persistentVolumeClaim: { claimName: $PVC, readOnly: true }
|
||||
containers:
|
||||
- name: backup
|
||||
image: $PG_IMAGE
|
||||
envFrom:
|
||||
- secretRef: { name: $TMP_S3_SECRET }
|
||||
env:
|
||||
- { name: BUCKET, value: "$BUCKET" }
|
||||
- { name: PREFIX, value: "$PREFIX" }
|
||||
- { name: DB, value: "$DB" }
|
||||
- { name: PGHOST, value: "$PGHOST" }
|
||||
- { name: PGUSER, valueFrom: { secretKeyRef: { name: vso-db-credentials, key: username } } }
|
||||
- { name: PGPASSWORD, valueFrom: { secretKeyRef: { name: vso-db-credentials, key: password } } }
|
||||
volumeMounts:
|
||||
- { name: docs, mountPath: /docs, readOnly: true }
|
||||
command: ["/bin/sh","-c"]
|
||||
args: ["echo $B64 | base64 -d | sh"]
|
||||
EOF
|
||||
kubectl wait --for=condition=complete job/dolibarr-backup -n "$NS" --timeout=300s >/dev/null 2>&1 \
|
||||
|| die "backup Job did not complete — kubectl logs -n $NS job/dolibarr-backup"
|
||||
kubectl logs -n "$NS" job/dolibarr-backup | sed 's/^/ /'
|
||||
kubectl delete job dolibarr-backup -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
|
||||
cleanup_secret; trap - EXIT
|
||||
log "Backup complete."
|
||||
}
|
||||
|
||||
run_list() {
|
||||
trap cleanup_secret EXIT; copy_s3_secret
|
||||
local SCRIPT
|
||||
SCRIPT="$(cat <<EOF
|
||||
$PREAMBLE
|
||||
echo "db/:"; S3 ls "s3://$BUCKET/$PREFIX/db/" || echo " (empty)"
|
||||
echo "docs/:"; S3 ls "s3://$BUCKET/$PREFIX/docs/" || echo " (empty)"
|
||||
EOF
|
||||
)"
|
||||
kubectl delete job dolibarr-backup-list -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
|
||||
kubectl apply -f - >/dev/null <<EOF
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata: { name: dolibarr-backup-list, namespace: $NS }
|
||||
spec:
|
||||
backoffLimit: 0
|
||||
ttlSecondsAfterFinished: 300
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: list
|
||||
image: $PG_IMAGE
|
||||
envFrom: [ { secretRef: { name: $TMP_S3_SECRET } } ]
|
||||
command: ["/bin/sh","-c"]
|
||||
args: ["echo $(b64 "$SCRIPT") | base64 -d | sh"]
|
||||
EOF
|
||||
kubectl wait --for=condition=complete job/dolibarr-backup-list -n "$NS" --timeout=180s >/dev/null 2>&1 || true
|
||||
kubectl logs -n "$NS" job/dolibarr-backup-list 2>/dev/null | sed 's/^/ /'
|
||||
kubectl delete job dolibarr-backup-list -n "$NS" --ignore-not-found >/dev/null 2>&1 || true
|
||||
cleanup_secret; trap - EXIT
|
||||
}
|
||||
|
||||
case "$CMD" in
|
||||
backup) run_backup ;;
|
||||
list) run_list ;;
|
||||
restore)
|
||||
[[ -n "$KEY" && -n "$KIND" ]] || die "restore needs --db <key> or --docs <key>"
|
||||
[[ "$YES" == "1" ]] || die "restore is DESTRUCTIVE on '$ENV' — re-run with --yes"
|
||||
die "restore: wired in the chart Job (next iteration) — key=$KEY kind=$KIND env=$ENV"
|
||||
;;
|
||||
*) echo "usage: $0 {backup|list|restore} [--env prod|sandbox] [--db|--docs <key>] [--yes]" >&2; exit 2 ;;
|
||||
esac
|
||||
Reference in New Issue
Block a user