[vibe](../README.md) > [Investigations](README.md) > **INV-001 ยท Prod blast-radius couplings** # INV-001: Prod blast-radius couplings > **Status**: Complete > **Date**: 2026-06-23 > **Priority**: ๐Ÿ”ด P1 > **Related**: [ADR-0001 ยท Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) ยท [PRD ยท Isolation boundary](../PRD/safe-prod-like-environment/isolation-boundary.md) > [!NOTE] > **Origin.** This investigation was spun out of the [safe-environment ADR](../ADR/0001-safe-prod-like-environment.md) and [PRD](../PRD/safe-prod-like-environment/README.md) design work, to enumerate exactly which prod couplings a sandbox must isolate before any sandbox run can be trusted not to mutate live production. ## Objectives What this investigation set out to answer: - [x] Enumerate every place where the GitOps repos hardcode a **live prod endpoint, credential path, state location, or auto-reconciling controller**. - [x] For each, cite the **concrete file and value** so the isolation control can be written against a known target. - [x] State the **blast radius** of an accidental sandbox-targets-prod mistake per coupling. - [x] Name the **sandbox control** that severs each coupling (cross-referenced to the PRD isolation boundary). - [x] Classify severity so Phase 0 guardrails know what *must* land first. ## Executive summary The lab is administered from a single MacBook holding kubeconfig, the Vault unseal key, and every cloud admin token; the repos point at live prod by default, so any sandbox seeded from those same repos will hit prod unless explicitly fenced off. The worst couplings are the **PostgreSQL superuser provider hardwired to `192.168.1.202`** (a wrong apply can `DROP`/`ALTER` the live ERP, CMS and other business DBs), the **Vault unseal key at a fixed path `~/.arcodange/cluster-keys.json`** (a botched sandbox init could overwrite the one key that unseals prod), and **ArgoCD app-of-apps with `prune: true` + `selfHeal: true`** pointed at `targetRevision: HEAD` of the live Gitea (an auto-reconcile can delete live resources fleet-wide). Secondary but real: the Ansible inventory targeting `192.168.1.201-203`, the single GCS state bucket `arcodange-tf` shared across all stacks, and the Cloudflare/OVH/Zoho tokens that control public `arcodange.fr` DNS and email. Each coupling has a clean sandbox control (separate inventory + prod-IP abort guard, separate Vault + unseal path, separate state prefix family, plan-only DNS). **None of these controls exist yet โ€” Phase 0 must build them before the first sandbox run.** ## Findings Each finding leads with a one-line **Brief**, then the concrete evidence (file + value), then the bold **Finding** with a severity tag. ### 1. PostgreSQL superuser provider hardwired to the live host **Brief.** The Postgres OpenTofu stack connects as **superuser** straight to the production database host with no environment switch. Evidence โ€” [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf): ```hcl provider "postgresql" { host = "192.168.1.202" username = var.POSTGRES_USERNAME password = var.POSTGRES_PASSWORD sslmode = "disable" superuser = true } ``` The host is a literal โ€” there is no `var.pg_host`, no workspace gate, no profile. This is the same PG instance (the docker-compose on pi2, outside k3s) that backs the Dolibarr ERP, the CMS, and every app DB created per the [`` convention](../../doc/runbooks/new-web-app/conventions.md). A superuser session here can `DROP DATABASE`, `ALTER ROLE`, or revoke logins on any of them. Running `tofu apply` from a sandbox shell that still has the prod state and creds wired would act on live data. **Finding:** Highest-risk coupling. A single wrong `apply` against `192.168.1.202` as superuser can cause **irreversible ERP/business data loss**. The sandbox PG must be the docker-compose on the sandbox "pi2-equivalent", and a guard must **refuse to apply when `host == 192.168.1.202` and `workspace != prod`**. ๐Ÿ”ด ### 2. Vault address + unseal key at a fixed local path **Brief.** Every IaC stack authenticates to the one prod Vault, and the unseal key sits at a single hardcoded path that a sandbox init could clobber. Evidence โ€” both [`iac/providers.tf`](../../iac/providers.tf) and [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf) declare: ```hcl provider "vault" { address = "https://vault.arcodange.lab" auth_login_jwt { mount = "gitea_jwt" role = "gitea_cicd" } } ``` The prod unseal key (1 key, threshold 1) lives at `~/.arcodange/cluster-keys.json` โ€” the single secret that brings prod Vault back after a seal (see the lab-root recovery runbook, named below). Vault is the **single source of truth** for all secrets, so a policy/auth/mount change against `vault.arcodange.lab`, or an init/operator step that writes to the default key path, has fleet-wide reach: VSO across every namespace re-reads from this Vault. **Finding:** Two failure modes. (a) Sandbox IaC pointed at `vault.arcodange.lab` can rewrite prod policies/auth and lock out VSO โ†’ fleet-wide secret outage. (b) A botched sandbox `vault operator init` writing to `~/.arcodange/cluster-keys.json` would **overwrite the prod unseal key**, making prod unrecoverable after the next seal. Sandbox needs a **separate Vault** and the unseal-key path overridden to `~/.arcodange/sandbox/cluster-keys.json`. ๐Ÿ”ด ### 3. ArgoCD app-of-apps: live repoURL + HEAD + prune/selfHeal **Brief.** The app-of-apps template renders Applications that auto-reconcile against `HEAD` of the live Gitea, with pruning and self-heal on by default. Evidence โ€” [`argocd/templates/apps.yaml`](../../argocd/templates/apps.yaml) loops `values.gitea_applications` into Application CRDs: ```yaml source: repoURL: https://gitea.arcodange.lab/{{ $org }}/{{ $app_name }} targetRevision: HEAD path: chart syncPolicy: automated: prune: true selfHeal: true ``` [`argocd/values.yaml`](../../argocd/values.yaml) lists the live apps (`url-shortener`, `tools`, `webapp`, `telegram-gateway`, `erp`, `cms`, `dance-lessons-coach`); several add ArgoCD Image Updater annotations that chase `:latest` digests. `prune: true` means a resource that disappears from git is deleted from the cluster; `selfHeal: true` means manual changes are reverted. `targetRevision: HEAD` means the live cluster follows whatever lands on the default branch. > [!NOTE] > ArgoCD itself is not currently deployed in-cluster (it is commented out in `03_cicd`), so this controller is **latent** today. The coupling matters the moment ArgoCD is enabled โ€” and sandbox work to enable/iterate on ArgoCD is exactly when it would bite. **Finding:** When ArgoCD is live, a sandbox change to the app-of-apps (or an Image Updater misfire) reconciled against the prod `repoURL`/`HEAD` can **prune live resources fleet-wide**. Sandbox needs its own ArgoCD pointed at a **sandbox branch/Gitea** so it only syncs sandbox refs. ๐ŸŸ  ### 4. Ansible inventory targets the live Pi IPs **Brief.** The only inventory targets the three production Raspberry Pis directly; a stray playbook run hits prod hardware. Evidence โ€” [`ansible/arcodange/factory/inventory/hosts.yml`](../../ansible/arcodange/factory/inventory/hosts.yml): ```yaml raspberries: hosts: pi1: { preferred_ip: 192.168.1.201, ... } pi2: { preferred_ip: 192.168.1.202, ... } pi3: { preferred_ip: 192.168.1.203, ... } postgres: hosts: { pi2: } ``` The numbered playbooks (`01_system` โ€ฆ `05_backup`) and the `recover/` plays operate on these hosts; `01_system`-class roles can wipe disks, re-init k3s, or disturb Longhorn replicas. There is **no sandbox inventory and no guard** โ€” `ansible-playbook` defaults straight at prod. **Finding:** A misdirected playbook against `192.168.1.201-203` can **wipe disks / reset k3s / corrupt Longhorn** on prod. Sandbox needs a separate `inventory/sandbox/hosts.yml` (VM hosts only) **plus a pre-task that aborts if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true`**. ๐Ÿ”ด ### 5. Single GCS state bucket shared by every stack **Brief.** All OpenTofu stacks store state in one bucket, separated only by `prefix`; a wrong backend config writes prod state. Evidence โ€” [`iac/backend.tf`](../../iac/backend.tf) uses `bucket = "arcodange-tf"`, `prefix = "factory/main"`; [`postgres/iac/backend.tf`](../../postgres/iac/backend.tf) uses the same bucket with `prefix = "factory/postgres"`. Sibling stacks (`tools`, `cms`) follow the same bucket-with-prefix pattern. State is the authoritative map of real resources โ€” a sandbox run that inherits the prod backend will read prod state, plan against prod resources, and on `apply` mutate them. > [!TIP] > Note the name collision risk: [`iac/cloudflare.tf`](../../iac/cloudflare.tf) also creates a Cloudflare **R2** bucket literally named `arcodange-tf`. The GCS state bucket and the R2 object bucket share a name but are different stores; do not conflate them when scoping sandbox state. **Finding:** Without isolation, sandbox `tofu` reads/writes **prod state** in `arcodange-tf`. Sandbox needs a **sandbox prefix family** (`sandbox/factory/main`, `sandbox/factory/postgres`, โ€ฆ) via a backend-config override, or a separate bucket `arcodange-tf-sandbox`. ๐ŸŸ  ### 6. Cloudflare account / OVH arcodange.fr / Zoho โ€” live public DNS & email **Brief.** IaC holds tokens that manage the public `arcodange.fr` zone, the OVH registrar nameservers, and the Zoho mail records; a wrong record silently breaks company email. Evidence โ€” [`iac/providers.tf`](../../iac/providers.tf) declares `provider "cloudflare" {}` (token via `CLOUDFLARE_API_TOKEN`) and `provider "ovh" { endpoint = "ovh-eu" }`. [`iac/cloudflare.tf`](../../iac/cloudflare.tf) resolves the account by `arcodange@gmail.com` and a `cf_arcodange_cms_token` granting `zone:DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write`, etc. [`iac/ovh.tf`](../../iac/ovh.tf) grants `domain:apiovh:nameServer/edit` on `urn:v1:eu:resource:domain:arcodange.fr`. The Zoho mail wiring (MX/SPF/DKIM/DMARC/BIMI + aliases) lives in the sibling `cms` repo's `zoho/` and the `arcodange.fr` zone is managed at Cloudflare. A bad MX/SPF/DKIM record breaks `arcodange.fr` mail **silently, for days**. **Finding:** These are **public, customer-facing, and slow-to-detect**. The blast radius is broken company email and public site/tunnel exposure. Sandbox must run these modules **plan-only against a throwaway zone/subdomain with a separate token**; the real `arcodange.fr` token must **never** be exported into a sandbox shell. Real public DNS/ACME end-to-end is out of scope. ๐ŸŸ  ### 7. Longhorn backup target points at the prod backup bucket **Brief.** Longhorn's backup target is a fixed S3 bucket; a sandbox restore drill could overwrite prod backups. Evidence โ€” [`argocd/templates/longhorn_backup_target.yaml`](../../argocd/templates/longhorn_backup_target.yaml) sets: ```yaml "backup-target": "s3://arcodange-backup@us-east-1/" "backup-target-credential-secret": "longhorn-gcs-backup-credentials" ``` The credentials are injected by VSO from Vault path `kvv2 longhorn/gcs-backup`. Recovery drills are a core sandbox use-case (re-run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)) โ€” but a drill that writes to `arcodange-backup` could clobber the real restore points. **Finding:** Sandbox restore drills against `s3://arcodange-backup` risk **overwriting prod backups** โ€” the worst kind of failure during a recovery rehearsal. Sandbox backup target must be a **separate bucket/prefix**. ๐ŸŸ  ### 8. Gitea base_url + the restricted CI module-reader user **Brief.** The Gitea provider and a created CI user both bind to the live Gitea; sandbox IaC can mutate prod repos, secrets, and users. Evidence โ€” `provider "gitea" { base_url = "https://gitea.arcodange.lab" }` in [`iac/providers.tf`](../../iac/providers.tf). [`iac/gitea_tofu_ci_user.tf`](../../iac/gitea_tofu_ci_user.tf) creates the `tofu_module_reader` user, an SSH key, and stores it in Vault `kvv1/gitea/tofu_module_reader`; [`iac/cloudflare.tf`](../../iac/cloudflare.tf) and [`iac/ovh.tf`](../../iac/ovh.tf) push repo actions secrets (`CLOUDFLARE_API_TOKEN`, `OVH_CLIENT_ID`, โ€ฆ) onto the live `cms` repo. A sandbox apply against this provider rewrites prod repo secrets and CI users. **Finding:** Sandbox IaC pointed at `gitea.arcodange.lab` can **mutate prod repo CI secrets and users**, indirectly poisoning prod CI. Sandbox needs its own Gitea (sandbox cluster or org `arcodange-sandbox`); ArgoCD app-of-apps then points at sandbox refs (see finding 3). ๐ŸŸ  ## Summary table | Coupling | Where (file ยท value) | Blast radius | Sandbox control | | --- | --- | --- | --- | | PG superuser provider | `postgres/iac/providers.tf` ยท `host = 192.168.1.202`, `superuser = true` | Drop/alter live ERP + app DBs โ†’ irreversible data loss | Sandbox PG (docker-compose on sandbox pi2-eq); guard refuses apply if `host == 192.168.1.202 && workspace != prod` | | Vault address + unseal key | `iac/providers.tf` / `postgres/iac/providers.tf` ยท `vault.arcodange.lab` ยท key `~/.arcodange/cluster-keys.json` | Lock out VSO fleet-wide; overwrite prod unseal key โ†’ prod unrecoverable | Separate sandbox Vault; unseal path โ†’ `~/.arcodange/sandbox/cluster-keys.json` | | ArgoCD app-of-apps | `argocd/templates/apps.yaml` ยท `repoURL gitea.arcodange.lab/...`, `HEAD`, `prune+selfHeal` | Auto-prune live resources fleet-wide (latent until ArgoCD deployed) | Sandbox ArgoCD โ†’ sandbox branch/Gitea; sync only sandbox refs | | Ansible inventory | `ansible/.../inventory/hosts.yml` ยท `192.168.1.201-203` | Wipe disks / reset k3s / corrupt Longhorn on prod Pis | `inventory/sandbox/hosts.yml` (VMs only) + prod-IP abort guard unless `i_mean_prod=true` | | GCS state bucket | `iac/backend.tf` / `postgres/iac/backend.tf` ยท `bucket = arcodange-tf` | Read/write prod state โ†’ plan & apply mutate prod resources | `sandbox/...` prefix family or `arcodange-tf-sandbox` bucket via backend override | | Cloudflare / OVH / Zoho | `iac/providers.tf`, `iac/cloudflare.tf`, `iac/ovh.tf` (+ cms `zoho/`) ยท `arcodange.fr`, account `arcodange@gmail.com` | Break public DNS / company email silently for days | Plan-only against throwaway zone + separate token; never export the real `arcodange.fr` token into sandbox | | Longhorn backup target | `argocd/templates/longhorn_backup_target.yaml` ยท `s3://arcodange-backup@us-east-1/` | Restore drill overwrites prod backups | Separate sandbox backup bucket/prefix | | Gitea base_url + CI user | `iac/providers.tf` ยท `gitea.arcodange.lab` ยท `iac/gitea_tofu_ci_user.tf` | Rewrite prod repo CI secrets/users โ†’ poison prod CI | Sandbox Gitea (cluster or org `arcodange-sandbox`) | ## Classification | # | Coupling | Severity | | --- | --- | --- | | 1 | PG superuser provider โ†’ `192.168.1.202` | ๐Ÿ”ด Critical โ€” confirmed irreversible data-loss path | | 2 | Vault address + unseal key path | ๐Ÿ”ด Critical โ€” prod-secret outage and unrecoverable-unseal path | | 4 | Ansible inventory โ†’ live Pi IPs | ๐Ÿ”ด Critical โ€” disk-wipe / cluster-reset on prod hardware | | 3 | ArgoCD app-of-apps prune/selfHeal | ๐ŸŸ  Significant โ€” fleet-wide prune; latent until ArgoCD is deployed | | 5 | Shared GCS state bucket | ๐ŸŸ  Significant โ€” prod state mutation if backend not overridden | | 6 | Cloudflare / OVH / Zoho public DNS & email | ๐ŸŸ  Significant โ€” public, customer-facing, slow to detect | | 7 | Longhorn backup target | ๐ŸŸ  Significant โ€” prod backup overwrite during drills | | 8 | Gitea base_url + CI user | ๐ŸŸ  Significant โ€” prod CI-secret/user mutation | | Emoji | Meaning | | --- | --- | | ๐Ÿ”ด | Critical โ€” data-loss or breaking risk confirmed | | ๐ŸŸ  | Significant โ€” degraded but working, or a real coupling to watch | | ๐ŸŸก | Minor โ€” worth noting, low impact | | ๐ŸŸข | Healthy โ€” verified safe / no issue found | | ๐Ÿ”ต | Informational โ€” context, no action implied | ## Open questions - **Guard enforcement layer.** Should the prod-IP abort live as an Ansible pre-task only, or also as a wrapper script around `tofu`/`ansible-playbook` so the same fence covers both tools uniformly? (Phase 0 decision.) - **Vault path discipline.** Beyond the unseal-key path, are there other tools (backup scripts, recovery runbook steps) that read a hardcoded `~/.arcodange/cluster-keys.json`? A grep sweep of the lab-root scripts is needed so the sandbox override is complete, not partial. - **State backend override ergonomics.** Prefix family vs. a separate `arcodange-tf-sandbox` bucket โ€” the bucket option is harder to misconfigure (no shared blast radius at all) but adds a provisioning step. Decide in the PRD isolation boundary. - **ArgoCD readiness.** Since ArgoCD is currently latent (commented in `03_cicd`), confirm whether enabling it should happen *first* in the sandbox (so its prune behaviour is rehearsed before it ever reaches prod). - **Throwaway DNS zone choice.** Which subdomain/zone and token scope are acceptable for plan-only Cloudflare/OVH tests without touching `arcodange.fr`? ## References - [ADR-0001 ยท Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) โ€” the decision this investigation supports. - [PRD ยท Safe prod-like environment (hub)](../PRD/safe-prod-like-environment/README.md) and its [Isolation boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) โ€” where each coupling's control is specified in full. - [Lab ecosystem guidebook](../guidebooks/lab-ecosystem/README.md) โ€” background on the prod topology and the `` join key. - [New-web-app conventions (`` join key)](../../doc/runbooks/new-web-app/conventions.md) โ€” why one identifier keys repo, DB, Vault, namespace, and DNS together. - [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) โ€” engine-ID re-association relevant to the backup-target and restore-drill coupling. - `CLUSTER_RECOVERY.md` (at the lab root, outside this repo) โ€” the tested power-cut recovery runbook; the unseal-key path and Vault-seal recovery referenced in findings 2 and 7 come from it.