docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md

Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 11:52:37 +02:00
parent 827af6b392
commit 7647a68cdc
25 changed files with 1878 additions and 0 deletions

View File

@@ -0,0 +1,206 @@
[vibe](../README.md) > [Investigations](README.md) > **INV-001 · Prod blast-radius couplings**
# INV-001: Prod blast-radius couplings
> **Status**: Complete
> **Date**: 2026-06-23
> **Priority**: 🔴 P1
> **Related**: [ADR-0001 · Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) · [PRD · Isolation boundary](../PRD/safe-prod-like-environment/isolation-boundary.md)
> [!NOTE]
> **Origin.** This investigation was spun out of the [safe-environment ADR](../ADR/0001-safe-prod-like-environment.md) and [PRD](../PRD/safe-prod-like-environment/README.md) design work, to enumerate exactly which prod couplings a sandbox must isolate before any sandbox run can be trusted not to mutate live production.
## Objectives
What this investigation set out to answer:
- [x] Enumerate every place where the GitOps repos hardcode a **live prod endpoint, credential path, state location, or auto-reconciling controller**.
- [x] For each, cite the **concrete file and value** so the isolation control can be written against a known target.
- [x] State the **blast radius** of an accidental sandbox-targets-prod mistake per coupling.
- [x] Name the **sandbox control** that severs each coupling (cross-referenced to the PRD isolation boundary).
- [x] Classify severity so Phase 0 guardrails know what *must* land first.
## Executive summary
The lab is administered from a single MacBook holding kubeconfig, the Vault unseal key, and every cloud admin token; the repos point at live prod by default, so any sandbox seeded from those same repos will hit prod unless explicitly fenced off. The worst couplings are the **PostgreSQL superuser provider hardwired to `192.168.1.202`** (a wrong apply can `DROP`/`ALTER` the live ERP, CMS and other business DBs), the **Vault unseal key at a fixed path `~/.arcodange/cluster-keys.json`** (a botched sandbox init could overwrite the one key that unseals prod), and **ArgoCD app-of-apps with `prune: true` + `selfHeal: true`** pointed at `targetRevision: HEAD` of the live Gitea (an auto-reconcile can delete live resources fleet-wide). Secondary but real: the Ansible inventory targeting `192.168.1.201-203`, the single GCS state bucket `arcodange-tf` shared across all stacks, and the Cloudflare/OVH/Zoho tokens that control public `arcodange.fr` DNS and email. Each coupling has a clean sandbox control (separate inventory + prod-IP abort guard, separate Vault + unseal path, separate state prefix family, plan-only DNS). **None of these controls exist yet — Phase 0 must build them before the first sandbox run.**
## Findings
Each finding leads with a one-line **Brief**, then the concrete evidence (file + value), then the bold **Finding** with a severity tag.
### 1. PostgreSQL superuser provider hardwired to the live host
**Brief.** The Postgres OpenTofu stack connects as **superuser** straight to the production database host with no environment switch.
Evidence — [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf):
```hcl
provider "postgresql" {
host = "192.168.1.202"
username = var.POSTGRES_USERNAME
password = var.POSTGRES_PASSWORD
sslmode = "disable"
superuser = true
}
```
The host is a literal — there is no `var.pg_host`, no workspace gate, no profile. This is the same PG instance (the docker-compose on pi2, outside k3s) that backs the Dolibarr ERP, the CMS, and every app DB created per the [`<app>` convention](../../doc/runbooks/new-web-app/conventions.md). A superuser session here can `DROP DATABASE`, `ALTER ROLE`, or revoke logins on any of them. Running `tofu apply` from a sandbox shell that still has the prod state and creds wired would act on live data.
**Finding:** Highest-risk coupling. A single wrong `apply` against `192.168.1.202` as superuser can cause **irreversible ERP/business data loss**. The sandbox PG must be the docker-compose on the sandbox "pi2-equivalent", and a guard must **refuse to apply when `host == 192.168.1.202` and `workspace != prod`**. 🔴
### 2. Vault address + unseal key at a fixed local path
**Brief.** Every IaC stack authenticates to the one prod Vault, and the unseal key sits at a single hardcoded path that a sandbox init could clobber.
Evidence — both [`iac/providers.tf`](../../iac/providers.tf) and [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf) declare:
```hcl
provider "vault" {
address = "https://vault.arcodange.lab"
auth_login_jwt {
mount = "gitea_jwt"
role = "gitea_cicd"
}
}
```
The prod unseal key (1 key, threshold 1) lives at `~/.arcodange/cluster-keys.json` — the single secret that brings prod Vault back after a seal (see the lab-root recovery runbook, named below). Vault is the **single source of truth** for all secrets, so a policy/auth/mount change against `vault.arcodange.lab`, or an init/operator step that writes to the default key path, has fleet-wide reach: VSO across every namespace re-reads from this Vault.
**Finding:** Two failure modes. (a) Sandbox IaC pointed at `vault.arcodange.lab` can rewrite prod policies/auth and lock out VSO → fleet-wide secret outage. (b) A botched sandbox `vault operator init` writing to `~/.arcodange/cluster-keys.json` would **overwrite the prod unseal key**, making prod unrecoverable after the next seal. Sandbox needs a **separate Vault** and the unseal-key path overridden to `~/.arcodange/sandbox/cluster-keys.json`. 🔴
### 3. ArgoCD app-of-apps: live repoURL + HEAD + prune/selfHeal
**Brief.** The app-of-apps template renders Applications that auto-reconcile against `HEAD` of the live Gitea, with pruning and self-heal on by default.
Evidence — [`argocd/templates/apps.yaml`](../../argocd/templates/apps.yaml) loops `values.gitea_applications` into Application CRDs:
```yaml
source:
repoURL: https://gitea.arcodange.lab/{{ $org }}/{{ $app_name }}
targetRevision: HEAD
path: chart
syncPolicy:
automated:
prune: true
selfHeal: true
```
[`argocd/values.yaml`](../../argocd/values.yaml) lists the live apps (`url-shortener`, `tools`, `webapp`, `telegram-gateway`, `erp`, `cms`, `dance-lessons-coach`); several add ArgoCD Image Updater annotations that chase `:latest` digests. `prune: true` means a resource that disappears from git is deleted from the cluster; `selfHeal: true` means manual changes are reverted. `targetRevision: HEAD` means the live cluster follows whatever lands on the default branch.
> [!NOTE]
> ArgoCD itself is not currently deployed in-cluster (it is commented out in `03_cicd`), so this controller is **latent** today. The coupling matters the moment ArgoCD is enabled — and sandbox work to enable/iterate on ArgoCD is exactly when it would bite.
**Finding:** When ArgoCD is live, a sandbox change to the app-of-apps (or an Image Updater misfire) reconciled against the prod `repoURL`/`HEAD` can **prune live resources fleet-wide**. Sandbox needs its own ArgoCD pointed at a **sandbox branch/Gitea** so it only syncs sandbox refs. 🟠
### 4. Ansible inventory targets the live Pi IPs
**Brief.** The only inventory targets the three production Raspberry Pis directly; a stray playbook run hits prod hardware.
Evidence — [`ansible/arcodange/factory/inventory/hosts.yml`](../../ansible/arcodange/factory/inventory/hosts.yml):
```yaml
raspberries:
hosts:
pi1: { preferred_ip: 192.168.1.201, ... }
pi2: { preferred_ip: 192.168.1.202, ... }
pi3: { preferred_ip: 192.168.1.203, ... }
postgres:
hosts: { pi2: }
```
The numbered playbooks (`01_system``05_backup`) and the `recover/` plays operate on these hosts; `01_system`-class roles can wipe disks, re-init k3s, or disturb Longhorn replicas. There is **no sandbox inventory and no guard**`ansible-playbook` defaults straight at prod.
**Finding:** A misdirected playbook against `192.168.1.201-203` can **wipe disks / reset k3s / corrupt Longhorn** on prod. Sandbox needs a separate `inventory/sandbox/hosts.yml` (VM hosts only) **plus a pre-task that aborts if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true`**. 🔴
### 5. Single GCS state bucket shared by every stack
**Brief.** All OpenTofu stacks store state in one bucket, separated only by `prefix`; a wrong backend config writes prod state.
Evidence — [`iac/backend.tf`](../../iac/backend.tf) uses `bucket = "arcodange-tf"`, `prefix = "factory/main"`; [`postgres/iac/backend.tf`](../../postgres/iac/backend.tf) uses the same bucket with `prefix = "factory/postgres"`. Sibling stacks (`tools`, `cms`) follow the same bucket-with-prefix pattern. State is the authoritative map of real resources — a sandbox run that inherits the prod backend will read prod state, plan against prod resources, and on `apply` mutate them.
> [!TIP]
> Note the name collision risk: [`iac/cloudflare.tf`](../../iac/cloudflare.tf) also creates a Cloudflare **R2** bucket literally named `arcodange-tf`. The GCS state bucket and the R2 object bucket share a name but are different stores; do not conflate them when scoping sandbox state.
**Finding:** Without isolation, sandbox `tofu` reads/writes **prod state** in `arcodange-tf`. Sandbox needs a **sandbox prefix family** (`sandbox/factory/main`, `sandbox/factory/postgres`, …) via a backend-config override, or a separate bucket `arcodange-tf-sandbox`. 🟠
### 6. Cloudflare account / OVH arcodange.fr / Zoho — live public DNS & email
**Brief.** IaC holds tokens that manage the public `arcodange.fr` zone, the OVH registrar nameservers, and the Zoho mail records; a wrong record silently breaks company email.
Evidence — [`iac/providers.tf`](../../iac/providers.tf) declares `provider "cloudflare" {}` (token via `CLOUDFLARE_API_TOKEN`) and `provider "ovh" { endpoint = "ovh-eu" }`. [`iac/cloudflare.tf`](../../iac/cloudflare.tf) resolves the account by `arcodange@gmail.com` and a `cf_arcodange_cms_token` granting `zone:DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write`, etc. [`iac/ovh.tf`](../../iac/ovh.tf) grants `domain:apiovh:nameServer/edit` on `urn:v1:eu:resource:domain:arcodange.fr`. The Zoho mail wiring (MX/SPF/DKIM/DMARC/BIMI + aliases) lives in the sibling `cms` repo's `zoho/` and the `arcodange.fr` zone is managed at Cloudflare. A bad MX/SPF/DKIM record breaks `arcodange.fr` mail **silently, for days**.
**Finding:** These are **public, customer-facing, and slow-to-detect**. The blast radius is broken company email and public site/tunnel exposure. Sandbox must run these modules **plan-only against a throwaway zone/subdomain with a separate token**; the real `arcodange.fr` token must **never** be exported into a sandbox shell. Real public DNS/ACME end-to-end is out of scope. 🟠
### 7. Longhorn backup target points at the prod backup bucket
**Brief.** Longhorn's backup target is a fixed S3 bucket; a sandbox restore drill could overwrite prod backups.
Evidence — [`argocd/templates/longhorn_backup_target.yaml`](../../argocd/templates/longhorn_backup_target.yaml) sets:
```yaml
"backup-target": "s3://arcodange-backup@us-east-1/"
"backup-target-credential-secret": "longhorn-gcs-backup-credentials"
```
The credentials are injected by VSO from Vault path `kvv2 longhorn/gcs-backup`. Recovery drills are a core sandbox use-case (re-run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)) — but a drill that writes to `arcodange-backup` could clobber the real restore points.
**Finding:** Sandbox restore drills against `s3://arcodange-backup` risk **overwriting prod backups** — the worst kind of failure during a recovery rehearsal. Sandbox backup target must be a **separate bucket/prefix**. 🟠
### 8. Gitea base_url + the restricted CI module-reader user
**Brief.** The Gitea provider and a created CI user both bind to the live Gitea; sandbox IaC can mutate prod repos, secrets, and users.
Evidence — `provider "gitea" { base_url = "https://gitea.arcodange.lab" }` in [`iac/providers.tf`](../../iac/providers.tf). [`iac/gitea_tofu_ci_user.tf`](../../iac/gitea_tofu_ci_user.tf) creates the `tofu_module_reader` user, an SSH key, and stores it in Vault `kvv1/gitea/tofu_module_reader`; [`iac/cloudflare.tf`](../../iac/cloudflare.tf) and [`iac/ovh.tf`](../../iac/ovh.tf) push repo actions secrets (`CLOUDFLARE_API_TOKEN`, `OVH_CLIENT_ID`, …) onto the live `cms` repo. A sandbox apply against this provider rewrites prod repo secrets and CI users.
**Finding:** Sandbox IaC pointed at `gitea.arcodange.lab` can **mutate prod repo CI secrets and users**, indirectly poisoning prod CI. Sandbox needs its own Gitea (sandbox cluster or org `arcodange-sandbox`); ArgoCD app-of-apps then points at sandbox refs (see finding 3). 🟠
## Summary table
| Coupling | Where (file · value) | Blast radius | Sandbox control |
| --- | --- | --- | --- |
| PG superuser provider | `postgres/iac/providers.tf` · `host = 192.168.1.202`, `superuser = true` | Drop/alter live ERP + app DBs → irreversible data loss | Sandbox PG (docker-compose on sandbox pi2-eq); guard refuses apply if `host == 192.168.1.202 && workspace != prod` |
| Vault address + unseal key | `iac/providers.tf` / `postgres/iac/providers.tf` · `vault.arcodange.lab` · key `~/.arcodange/cluster-keys.json` | Lock out VSO fleet-wide; overwrite prod unseal key → prod unrecoverable | Separate sandbox Vault; unseal path → `~/.arcodange/sandbox/cluster-keys.json` |
| ArgoCD app-of-apps | `argocd/templates/apps.yaml` · `repoURL gitea.arcodange.lab/...`, `HEAD`, `prune+selfHeal` | Auto-prune live resources fleet-wide (latent until ArgoCD deployed) | Sandbox ArgoCD → sandbox branch/Gitea; sync only sandbox refs |
| Ansible inventory | `ansible/.../inventory/hosts.yml` · `192.168.1.201-203` | Wipe disks / reset k3s / corrupt Longhorn on prod Pis | `inventory/sandbox/hosts.yml` (VMs only) + prod-IP abort guard unless `i_mean_prod=true` |
| GCS state bucket | `iac/backend.tf` / `postgres/iac/backend.tf` · `bucket = arcodange-tf` | Read/write prod state → plan & apply mutate prod resources | `sandbox/...` prefix family or `arcodange-tf-sandbox` bucket via backend override |
| Cloudflare / OVH / Zoho | `iac/providers.tf`, `iac/cloudflare.tf`, `iac/ovh.tf` (+ cms `zoho/`) · `arcodange.fr`, account `arcodange@gmail.com` | Break public DNS / company email silently for days | Plan-only against throwaway zone + separate token; never export the real `arcodange.fr` token into sandbox |
| Longhorn backup target | `argocd/templates/longhorn_backup_target.yaml` · `s3://arcodange-backup@us-east-1/` | Restore drill overwrites prod backups | Separate sandbox backup bucket/prefix |
| Gitea base_url + CI user | `iac/providers.tf` · `gitea.arcodange.lab` · `iac/gitea_tofu_ci_user.tf` | Rewrite prod repo CI secrets/users → poison prod CI | Sandbox Gitea (cluster or org `arcodange-sandbox`) |
## Classification
| # | Coupling | Severity |
| --- | --- | --- |
| 1 | PG superuser provider → `192.168.1.202` | 🔴 Critical — confirmed irreversible data-loss path |
| 2 | Vault address + unseal key path | 🔴 Critical — prod-secret outage and unrecoverable-unseal path |
| 4 | Ansible inventory → live Pi IPs | 🔴 Critical — disk-wipe / cluster-reset on prod hardware |
| 3 | ArgoCD app-of-apps prune/selfHeal | 🟠 Significant — fleet-wide prune; latent until ArgoCD is deployed |
| 5 | Shared GCS state bucket | 🟠 Significant — prod state mutation if backend not overridden |
| 6 | Cloudflare / OVH / Zoho public DNS & email | 🟠 Significant — public, customer-facing, slow to detect |
| 7 | Longhorn backup target | 🟠 Significant — prod backup overwrite during drills |
| 8 | Gitea base_url + CI user | 🟠 Significant — prod CI-secret/user mutation |
| Emoji | Meaning |
| --- | --- |
| 🔴 | Critical — data-loss or breaking risk confirmed |
| 🟠 | Significant — degraded but working, or a real coupling to watch |
| 🟡 | Minor — worth noting, low impact |
| 🟢 | Healthy — verified safe / no issue found |
| 🔵 | Informational — context, no action implied |
## Open questions
- **Guard enforcement layer.** Should the prod-IP abort live as an Ansible pre-task only, or also as a wrapper script around `tofu`/`ansible-playbook` so the same fence covers both tools uniformly? (Phase 0 decision.)
- **Vault path discipline.** Beyond the unseal-key path, are there other tools (backup scripts, recovery runbook steps) that read a hardcoded `~/.arcodange/cluster-keys.json`? A grep sweep of the lab-root scripts is needed so the sandbox override is complete, not partial.
- **State backend override ergonomics.** Prefix family vs. a separate `arcodange-tf-sandbox` bucket — the bucket option is harder to misconfigure (no shared blast radius at all) but adds a provisioning step. Decide in the PRD isolation boundary.
- **ArgoCD readiness.** Since ArgoCD is currently latent (commented in `03_cicd`), confirm whether enabling it should happen *first* in the sandbox (so its prune behaviour is rehearsed before it ever reaches prod).
- **Throwaway DNS zone choice.** Which subdomain/zone and token scope are acceptable for plan-only Cloudflare/OVH tests without touching `arcodange.fr`?
## References
- [ADR-0001 · Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) — the decision this investigation supports.
- [PRD · Safe prod-like environment (hub)](../PRD/safe-prod-like-environment/README.md) and its [Isolation boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) — where each coupling's control is specified in full.
- [Lab ecosystem guidebook](../guidebooks/lab-ecosystem/README.md) — background on the prod topology and the `<app>` join key.
- [New-web-app conventions (`<app>` join key)](../../doc/runbooks/new-web-app/conventions.md) — why one identifier keys repo, DB, Vault, namespace, and DNS together.
- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association relevant to the backup-target and restore-drill coupling.
- `CLUSTER_RECOVERY.md` (at the lab root, outside this repo) — the tested power-cut recovery runbook; the unseal-key path and Vault-seal recovery referenced in findings 2 and 7 come from it.