Files
factory/vibe/investigations/INV-001-prod-blast-radius-couplings.md
Gabriel Radureau 7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00

18 KiB

vibe > Investigations > INV-001 · Prod blast-radius couplings

INV-001: Prod blast-radius couplings

Status: Complete Date: 2026-06-23 Priority: 🔴 P1 Related: ADR-0001 · Safe prod-like environment · PRD · Isolation boundary

Note

Origin. This investigation was spun out of the safe-environment ADR and PRD design work, to enumerate exactly which prod couplings a sandbox must isolate before any sandbox run can be trusted not to mutate live production.

Objectives

What this investigation set out to answer:

  • Enumerate every place where the GitOps repos hardcode a live prod endpoint, credential path, state location, or auto-reconciling controller.
  • For each, cite the concrete file and value so the isolation control can be written against a known target.
  • State the blast radius of an accidental sandbox-targets-prod mistake per coupling.
  • Name the sandbox control that severs each coupling (cross-referenced to the PRD isolation boundary).
  • Classify severity so Phase 0 guardrails know what must land first.

Executive summary

The lab is administered from a single MacBook holding kubeconfig, the Vault unseal key, and every cloud admin token; the repos point at live prod by default, so any sandbox seeded from those same repos will hit prod unless explicitly fenced off. The worst couplings are the PostgreSQL superuser provider hardwired to 192.168.1.202 (a wrong apply can DROP/ALTER the live ERP, CMS and other business DBs), the Vault unseal key at a fixed path ~/.arcodange/cluster-keys.json (a botched sandbox init could overwrite the one key that unseals prod), and ArgoCD app-of-apps with prune: true + selfHeal: true pointed at targetRevision: HEAD of the live Gitea (an auto-reconcile can delete live resources fleet-wide). Secondary but real: the Ansible inventory targeting 192.168.1.201-203, the single GCS state bucket arcodange-tf shared across all stacks, and the Cloudflare/OVH/Zoho tokens that control public arcodange.fr DNS and email. Each coupling has a clean sandbox control (separate inventory + prod-IP abort guard, separate Vault + unseal path, separate state prefix family, plan-only DNS). None of these controls exist yet — Phase 0 must build them before the first sandbox run.

Findings

Each finding leads with a one-line Brief, then the concrete evidence (file + value), then the bold Finding with a severity tag.

1. PostgreSQL superuser provider hardwired to the live host

Brief. The Postgres OpenTofu stack connects as superuser straight to the production database host with no environment switch.

Evidence — postgres/iac/providers.tf:

provider "postgresql" {
  host      = "192.168.1.202"
  username  = var.POSTGRES_USERNAME
  password  = var.POSTGRES_PASSWORD
  sslmode   = "disable"
  superuser = true
}

The host is a literal — there is no var.pg_host, no workspace gate, no profile. This is the same PG instance (the docker-compose on pi2, outside k3s) that backs the Dolibarr ERP, the CMS, and every app DB created per the <app> convention. A superuser session here can DROP DATABASE, ALTER ROLE, or revoke logins on any of them. Running tofu apply from a sandbox shell that still has the prod state and creds wired would act on live data.

Finding: Highest-risk coupling. A single wrong apply against 192.168.1.202 as superuser can cause irreversible ERP/business data loss. The sandbox PG must be the docker-compose on the sandbox "pi2-equivalent", and a guard must refuse to apply when host == 192.168.1.202 and workspace != prod. 🔴

2. Vault address + unseal key at a fixed local path

Brief. Every IaC stack authenticates to the one prod Vault, and the unseal key sits at a single hardcoded path that a sandbox init could clobber.

Evidence — both iac/providers.tf and postgres/iac/providers.tf declare:

provider "vault" {
  address = "https://vault.arcodange.lab"
  auth_login_jwt {
    mount = "gitea_jwt"
    role  = "gitea_cicd"
  }
}

The prod unseal key (1 key, threshold 1) lives at ~/.arcodange/cluster-keys.json — the single secret that brings prod Vault back after a seal (see the lab-root recovery runbook, named below). Vault is the single source of truth for all secrets, so a policy/auth/mount change against vault.arcodange.lab, or an init/operator step that writes to the default key path, has fleet-wide reach: VSO across every namespace re-reads from this Vault.

Finding: Two failure modes. (a) Sandbox IaC pointed at vault.arcodange.lab can rewrite prod policies/auth and lock out VSO → fleet-wide secret outage. (b) A botched sandbox vault operator init writing to ~/.arcodange/cluster-keys.json would overwrite the prod unseal key, making prod unrecoverable after the next seal. Sandbox needs a separate Vault and the unseal-key path overridden to ~/.arcodange/sandbox/cluster-keys.json. 🔴

3. ArgoCD app-of-apps: live repoURL + HEAD + prune/selfHeal

Brief. The app-of-apps template renders Applications that auto-reconcile against HEAD of the live Gitea, with pruning and self-heal on by default.

Evidence — argocd/templates/apps.yaml loops values.gitea_applications into Application CRDs:

source:
  repoURL: https://gitea.arcodange.lab/{{ $org }}/{{ $app_name }}
  targetRevision: HEAD
  path: chart
syncPolicy:
  automated:
    prune: true
    selfHeal: true

argocd/values.yaml lists the live apps (url-shortener, tools, webapp, telegram-gateway, erp, cms, dance-lessons-coach); several add ArgoCD Image Updater annotations that chase :latest digests. prune: true means a resource that disappears from git is deleted from the cluster; selfHeal: true means manual changes are reverted. targetRevision: HEAD means the live cluster follows whatever lands on the default branch.

Note

ArgoCD itself is not currently deployed in-cluster (it is commented out in 03_cicd), so this controller is latent today. The coupling matters the moment ArgoCD is enabled — and sandbox work to enable/iterate on ArgoCD is exactly when it would bite.

Finding: When ArgoCD is live, a sandbox change to the app-of-apps (or an Image Updater misfire) reconciled against the prod repoURL/HEAD can prune live resources fleet-wide. Sandbox needs its own ArgoCD pointed at a sandbox branch/Gitea so it only syncs sandbox refs. 🟠

4. Ansible inventory targets the live Pi IPs

Brief. The only inventory targets the three production Raspberry Pis directly; a stray playbook run hits prod hardware.

Evidence — ansible/arcodange/factory/inventory/hosts.yml:

raspberries:
  hosts:
    pi1: { preferred_ip: 192.168.1.201, ... }
    pi2: { preferred_ip: 192.168.1.202, ... }
    pi3: { preferred_ip: 192.168.1.203, ... }
postgres:
  hosts: { pi2: }

The numbered playbooks (01_system05_backup) and the recover/ plays operate on these hosts; 01_system-class roles can wipe disks, re-init k3s, or disturb Longhorn replicas. There is no sandbox inventory and no guardansible-playbook defaults straight at prod.

Finding: A misdirected playbook against 192.168.1.201-203 can wipe disks / reset k3s / corrupt Longhorn on prod. Sandbox needs a separate inventory/sandbox/hosts.yml (VM hosts only) plus a pre-task that aborts if any target IP is in 192.168.1.201-203 unless i_mean_prod=true. 🔴

5. Single GCS state bucket shared by every stack

Brief. All OpenTofu stacks store state in one bucket, separated only by prefix; a wrong backend config writes prod state.

Evidence — iac/backend.tf uses bucket = "arcodange-tf", prefix = "factory/main"; postgres/iac/backend.tf uses the same bucket with prefix = "factory/postgres". Sibling stacks (tools, cms) follow the same bucket-with-prefix pattern. State is the authoritative map of real resources — a sandbox run that inherits the prod backend will read prod state, plan against prod resources, and on apply mutate them.

Tip

Note the name collision risk: iac/cloudflare.tf also creates a Cloudflare R2 bucket literally named arcodange-tf. The GCS state bucket and the R2 object bucket share a name but are different stores; do not conflate them when scoping sandbox state.

Finding: Without isolation, sandbox tofu reads/writes prod state in arcodange-tf. Sandbox needs a sandbox prefix family (sandbox/factory/main, sandbox/factory/postgres, …) via a backend-config override, or a separate bucket arcodange-tf-sandbox. 🟠

6. Cloudflare account / OVH arcodange.fr / Zoho — live public DNS & email

Brief. IaC holds tokens that manage the public arcodange.fr zone, the OVH registrar nameservers, and the Zoho mail records; a wrong record silently breaks company email.

Evidence — iac/providers.tf declares provider "cloudflare" {} (token via CLOUDFLARE_API_TOKEN) and provider "ovh" { endpoint = "ovh-eu" }. iac/cloudflare.tf resolves the account by arcodange@gmail.com and a cf_arcodange_cms_token granting zone:DNS Write, Cloudflare Tunnel Write, Turnstile Sites Write, etc. iac/ovh.tf grants domain:apiovh:nameServer/edit on urn:v1:eu:resource:domain:arcodange.fr. The Zoho mail wiring (MX/SPF/DKIM/DMARC/BIMI + aliases) lives in the sibling cms repo's zoho/ and the arcodange.fr zone is managed at Cloudflare. A bad MX/SPF/DKIM record breaks arcodange.fr mail silently, for days.

Finding: These are public, customer-facing, and slow-to-detect. The blast radius is broken company email and public site/tunnel exposure. Sandbox must run these modules plan-only against a throwaway zone/subdomain with a separate token; the real arcodange.fr token must never be exported into a sandbox shell. Real public DNS/ACME end-to-end is out of scope. 🟠

7. Longhorn backup target points at the prod backup bucket

Brief. Longhorn's backup target is a fixed S3 bucket; a sandbox restore drill could overwrite prod backups.

Evidence — argocd/templates/longhorn_backup_target.yaml sets:

"backup-target": "s3://arcodange-backup@us-east-1/"
"backup-target-credential-secret": "longhorn-gcs-backup-credentials"

The credentials are injected by VSO from Vault path kvv2 longhorn/gcs-backup. Recovery drills are a core sandbox use-case (re-run recover/longhorn*.yml, validate engine-ID re-association per the Longhorn PVC recovery ADR) — but a drill that writes to arcodange-backup could clobber the real restore points.

Finding: Sandbox restore drills against s3://arcodange-backup risk overwriting prod backups — the worst kind of failure during a recovery rehearsal. Sandbox backup target must be a separate bucket/prefix. 🟠

8. Gitea base_url + the restricted CI module-reader user

Brief. The Gitea provider and a created CI user both bind to the live Gitea; sandbox IaC can mutate prod repos, secrets, and users.

Evidence — provider "gitea" { base_url = "https://gitea.arcodange.lab" } in iac/providers.tf. iac/gitea_tofu_ci_user.tf creates the tofu_module_reader user, an SSH key, and stores it in Vault kvv1/gitea/tofu_module_reader; iac/cloudflare.tf and iac/ovh.tf push repo actions secrets (CLOUDFLARE_API_TOKEN, OVH_CLIENT_ID, …) onto the live cms repo. A sandbox apply against this provider rewrites prod repo secrets and CI users.

Finding: Sandbox IaC pointed at gitea.arcodange.lab can mutate prod repo CI secrets and users, indirectly poisoning prod CI. Sandbox needs its own Gitea (sandbox cluster or org arcodange-sandbox); ArgoCD app-of-apps then points at sandbox refs (see finding 3). 🟠

Summary table

Coupling Where (file · value) Blast radius Sandbox control
PG superuser provider postgres/iac/providers.tf · host = 192.168.1.202, superuser = true Drop/alter live ERP + app DBs → irreversible data loss Sandbox PG (docker-compose on sandbox pi2-eq); guard refuses apply if host == 192.168.1.202 && workspace != prod
Vault address + unseal key iac/providers.tf / postgres/iac/providers.tf · vault.arcodange.lab · key ~/.arcodange/cluster-keys.json Lock out VSO fleet-wide; overwrite prod unseal key → prod unrecoverable Separate sandbox Vault; unseal path → ~/.arcodange/sandbox/cluster-keys.json
ArgoCD app-of-apps argocd/templates/apps.yaml · repoURL gitea.arcodange.lab/..., HEAD, prune+selfHeal Auto-prune live resources fleet-wide (latent until ArgoCD deployed) Sandbox ArgoCD → sandbox branch/Gitea; sync only sandbox refs
Ansible inventory ansible/.../inventory/hosts.yml · 192.168.1.201-203 Wipe disks / reset k3s / corrupt Longhorn on prod Pis inventory/sandbox/hosts.yml (VMs only) + prod-IP abort guard unless i_mean_prod=true
GCS state bucket iac/backend.tf / postgres/iac/backend.tf · bucket = arcodange-tf Read/write prod state → plan & apply mutate prod resources sandbox/... prefix family or arcodange-tf-sandbox bucket via backend override
Cloudflare / OVH / Zoho iac/providers.tf, iac/cloudflare.tf, iac/ovh.tf (+ cms zoho/) · arcodange.fr, account arcodange@gmail.com Break public DNS / company email silently for days Plan-only against throwaway zone + separate token; never export the real arcodange.fr token into sandbox
Longhorn backup target argocd/templates/longhorn_backup_target.yaml · s3://arcodange-backup@us-east-1/ Restore drill overwrites prod backups Separate sandbox backup bucket/prefix
Gitea base_url + CI user iac/providers.tf · gitea.arcodange.lab · iac/gitea_tofu_ci_user.tf Rewrite prod repo CI secrets/users → poison prod CI Sandbox Gitea (cluster or org arcodange-sandbox)

Classification

# Coupling Severity
1 PG superuser provider → 192.168.1.202 🔴 Critical — confirmed irreversible data-loss path
2 Vault address + unseal key path 🔴 Critical — prod-secret outage and unrecoverable-unseal path
4 Ansible inventory → live Pi IPs 🔴 Critical — disk-wipe / cluster-reset on prod hardware
3 ArgoCD app-of-apps prune/selfHeal 🟠 Significant — fleet-wide prune; latent until ArgoCD is deployed
5 Shared GCS state bucket 🟠 Significant — prod state mutation if backend not overridden
6 Cloudflare / OVH / Zoho public DNS & email 🟠 Significant — public, customer-facing, slow to detect
7 Longhorn backup target 🟠 Significant — prod backup overwrite during drills
8 Gitea base_url + CI user 🟠 Significant — prod CI-secret/user mutation
Emoji Meaning
🔴 Critical — data-loss or breaking risk confirmed
🟠 Significant — degraded but working, or a real coupling to watch
🟡 Minor — worth noting, low impact
🟢 Healthy — verified safe / no issue found
🔵 Informational — context, no action implied

Open questions

  • Guard enforcement layer. Should the prod-IP abort live as an Ansible pre-task only, or also as a wrapper script around tofu/ansible-playbook so the same fence covers both tools uniformly? (Phase 0 decision.)
  • Vault path discipline. Beyond the unseal-key path, are there other tools (backup scripts, recovery runbook steps) that read a hardcoded ~/.arcodange/cluster-keys.json? A grep sweep of the lab-root scripts is needed so the sandbox override is complete, not partial.
  • State backend override ergonomics. Prefix family vs. a separate arcodange-tf-sandbox bucket — the bucket option is harder to misconfigure (no shared blast radius at all) but adds a provisioning step. Decide in the PRD isolation boundary.
  • ArgoCD readiness. Since ArgoCD is currently latent (commented in 03_cicd), confirm whether enabling it should happen first in the sandbox (so its prune behaviour is rehearsed before it ever reaches prod).
  • Throwaway DNS zone choice. Which subdomain/zone and token scope are acceptable for plan-only Cloudflare/OVH tests without touching arcodange.fr?

References