The accounting data + issued documents are legally retained 10 years and warrant a
backup dedicated to Dolibarr. An audit found the generic Longhorn external backup
NEVER covered the erp volume (its Longhorn volume sits in the orphaned `default`
recurring-job group; the only job has groups=[] → serves nothing; lastBackupAt=never).
So /var/www/documents (invoice PDFs, supplier pieces, contracts, ECM) had zero
offsite copy — only in-cluster replicas.
ops/backup/dolibarr-backup.sh (orchestrator) + ops/backup/backup-job.sh (in-container
logic, env-driven, single source of truth):
- pg_dump -Fc of the DB + tar of the documents PVC (RWX, read-only mount) ->
s3://arcodange-backup/erp/<env>/{db,docs}/<ts>, then tiered prune (daily 30d /
monthly 12m / yearly 10y).
- prod is READ-only (dump+tar read; writes go only to the backup bucket); the DB is
read with the env's own dynamic creds; the GCS HMAC secret is copied transiently
(base64, deleted on exit) and never printed; the whole script ships base64.
- fixes the aws-cli v2.23+ default-checksum incompatibility with GCS/S3-compat
(SignatureDoesNotMatch) via AWS_*_CHECKSUM_*=when_required.
Proven live: sandbox end-to-end (dump+tar+upload+prune, verified in GCS, cleaned up)
and retention logic unit-tested (1100 daily -> 46 kept). The FIRST real prod backup
was taken (erp/prod/db 1.2 MB + erp/prod/docs 12.5 MB) — closing the gap now.
Automation (recurring CronJob in the chart + a dedicated erp Vault policy for its
own S3 creds) is the documented next step; the orchestrator works today on demand.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
refresh-from-prod was structurally broken and silently no-op'd the restore:
1. pg_restore lacked -U, so the postgres image connected as its OS user `root`
and auth-failed. The failure was swallowed by `|| echo "ignorable warnings"`,
so the script reported success while the DROP OWNED had already emptied the DB.
E2's original seed was a manual process, so this path had never really run.
Fix: pass `-h $PGHOST -U $SB_PGUSER`; don't trust pg_restore's exit code (it
returns non-zero on the harmless "schema public already exists" notice) — verify
by counting restored llx_* tables and FAIL the Job if < 250.
2. erp-sandbox is ArgoCD-managed with self-heal ON, which reverts the
`kubectl scale --replicas=0` within seconds — so the seed ran with Dolibarr
still connected. Fix: pause self-heal for the duration, re-arm it after; app
restore + self-heal restoration + secret cleanup are guarded by an EXIT trap so
an interrupt can't strand the sandbox at replicas=0 / self-heal off.
Validated end-to-end on the live sandbox: 295 llx tables, company=Arcodange,
owner=erp_sandbox_role, self-heal re-armed, pod 1/1. README documents the self-heal
pause and the iso-prod consequence (ai_agent_sandbox is wiped → re-provision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Productionizes the sandbox state-lifecycle mechanisms validated live against
erp-sandbox. `ops/sandbox/sandbox-lifecycle.sh`:
- refresh-from-prod: read-only pg_dump of prod erp (default_transaction_read_only)
-> DROP OWNED BY erp_sandbox_role CASCADE -> pg_restore into erp-sandbox, using
the sandbox's own membership creds (no DROP/CREATE DATABASE, no CREATEDB, no
superuser). Dumps the full public schema (so app helper functions + triggers
come over) and filters the provisioner-owned pgbouncer user_lookup function
from the restore TOC. Scales the pod to 0 for exclusive access; copies prod
creds into a transient secret that is deleted on exit.
- sync-documents: tar-pipe the documents/mycompany tree (company logo + uploads)
prod -> sandbox, since uploaded files live on the PVC, not the DB.
Prod integrity is structural: prod is read-only during dump; the restore can only
write erp-sandbox (erp_sandbox_role owns only the sandbox DB and cannot drop prod
erp/erp_role); the platform's only prod-capable superuser stays behind the
human-gated postgres.yaml CI and is never used here.
README documents the integrity guarantee, the encryption + PVC fidelity caveats,
the BDD reset loop, and the hardening backlog (dedicated read-only dump role,
golden-cache PVC).
Refs ADR-0003 (factory#19). Chart owner-role fix = erp#13.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>