erp/ops/backup/README.md

# Dolibarr dedicated backup

A backup strategy **dedicated to Dolibarr**, because the accounting data and the
issued documents are critical and legally retained **10 years** — they warrant more
than the generic platform backup.

## Why this exists (the gap it closes)

On 2026-06-30 an audit of the Longhorn external backup found that **the erp documents
volume had never been backed up offsite** (`lastBackupAt = never`): its Longhorn
volume is enrolled only in the `default` recurring-job group, but the single backup
job (`thrice-a-month-backup`) has `groups=[]`, so it serves *no* group — the erp
volume (and erp-sandbox) fell through the crack. Only in-cluster Longhorn replicas
protected `/var/www/documents` (issued invoice PDFs, supplier pieces, contracts, ECM)
— which does not survive a cluster loss / corruption / power-cut.

This tool backs up **both halves** of Dolibarr state to the existing object store
(`s3://arcodange-backup`, GCS via the S3-compatible API), under `erp/<env>/`:

| half | how | key |
|---|---|---|
| Postgres DB | `pg_dump -Fc` (restorable) | `erp/<env>/db/<ts>.dump` |
| documents PVC | `tar -czf` of `/var/www/documents` (RWX, mounted read-only) | `erp/<env>/docs/<ts>.tar.gz` |

then prunes to a **tiered retention**: daily for 30 days, monthly for 12 months,
yearly for ~10 years.

## Safety (mirrors `ops/sandbox/sandbox-lifecycle.sh`)

- **prod is read-only**: `pg_dump` and `tar` only read; the only writes go to the
  backup bucket, never to prod. The DB is read with the env's *own* dynamic creds
  (`vso-db-credentials`); prod and sandbox never cross.
- **S3 creds are never exposed**: the GCS HMAC secret is copied into a *transient*
  secret in the app namespace (values stay base64), deleted on exit. The whole
  in-container script is shipped base64 — no secret is ever printed.

## Usage

```sh
# one-shot backup + prune (run from anywhere; needs kubectl on the lab cluster)
ops/backup/dolibarr-backup.sh backup --env prod
ops/backup/dolibarr-backup.sh backup --env sandbox

# what's in the store
ops/backup/dolibarr-backup.sh list --env prod
```

`backup-job.sh` is the in-container logic (env-driven: `BUCKET PREFIX DB PGHOST` +
the mounted DB/S3 creds) — the single source of truth, also intended for the
scheduled CronJob (see "Automation" below).

**Status:** the first real prod backup was taken 2026-06-30
(`erp/prod/db/…` 1.2 MB, `erp/prod/docs/…` 12.5 MB). Proven end-to-end live on the
sandbox (dump + tar + GCS upload + retention prune).

## Restore (manual, for now)

```sh
# DB:    aws s3 cp s3://arcodange-backup/erp/<env>/db/<ts>.dump - | pg_restore -h <host> -U <user> -d <db> --clean
# docs:  aws s3 cp s3://arcodange-backup/erp/<env>/docs/<ts>.tar.gz - | tar -C /var/www/documents -xzf -
```
The sandbox iso-prod refresh (`ops/sandbox/sandbox-lifecycle.sh`) is the natural
restore-drill bench. A `restore` subcommand is wired next.

## Automation (next step — gated on creds)

The recurring form is a k8s **CronJob** (ArgoCD-managed, in the chart) running the
same `backup-job.sh` daily. It needs its **own** S3 creds rather than borrowing the
Longhorn secret cross-namespace: a `VaultStaticSecret` in the erp namespace reading
the GCS backup creds, which requires the `erp` Vault role to be granted read on that
path (a `tools` change). Until that lands, run the orchestrator above on demand /
from a host cron — it works today by borrowing the Longhorn creds transiently.

> The generic Longhorn gap (the orphaned `default` group) should be fixed too, as a
> platform concern — but this dedicated, offsite, 10-year-retention backup is the
> one that matches Dolibarr's legal criticality.