From 7647a68cdc1e0277cd4572af0e0d8ead89df8f2c Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 11:52:37 +0200 Subject: [PATCH] docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 --- AGENTS.md | 131 +++++++++++ vibe/ADR/0001-safe-prod-like-environment.md | 84 +++++++ vibe/ADR/README.md | 45 ++++ vibe/ADR/_template.md | 41 ++++ vibe/PRD/README.md | 34 +++ vibe/PRD/safe-prod-like-environment/README.md | 69 ++++++ vibe/PRD/safe-prod-like-environment/STATUS.md | 52 +++++ .../isolation-boundary.md | 31 +++ .../safe-prod-like-environment/qa-strategy.md | 49 +++++ vibe/README.md | 50 +++++ vibe/guidebooks/README.md | 47 ++++ vibe/guidebooks/lab-ecosystem/01-factory.md | 122 +++++++++++ vibe/guidebooks/lab-ecosystem/02-tools.md | 76 +++++++ vibe/guidebooks/lab-ecosystem/03-cms.md | 82 +++++++ vibe/guidebooks/lab-ecosystem/README.md | 116 ++++++++++ .../lab-ecosystem/naming-conventions.md | 96 ++++++++ .../lab-ecosystem/secrets-and-vault.md | 110 ++++++++++ .../lab-ecosystem/storage-and-recovery.md | 76 +++++++ .../INV-001-prod-blast-radius-couplings.md | 206 ++++++++++++++++++ vibe/investigations/README.md | 45 ++++ vibe/investigations/_template.md | 66 ++++++ vibe/runbooks/README.md | 60 +++++ vibe/runbooks/_template.md | 80 +++++++ .../2026-06-23-vibe-and-safe-env/README.md | 54 +++++ vibe/shareouts/README.md | 56 +++++ 25 files changed, 1878 insertions(+) create mode 100644 AGENTS.md create mode 100644 vibe/ADR/0001-safe-prod-like-environment.md create mode 100644 vibe/ADR/README.md create mode 100644 vibe/ADR/_template.md create mode 100644 vibe/PRD/README.md create mode 100644 vibe/PRD/safe-prod-like-environment/README.md create mode 100644 vibe/PRD/safe-prod-like-environment/STATUS.md create mode 100644 vibe/PRD/safe-prod-like-environment/isolation-boundary.md create mode 100644 vibe/PRD/safe-prod-like-environment/qa-strategy.md create mode 100644 vibe/README.md create mode 100644 vibe/guidebooks/README.md create mode 100644 vibe/guidebooks/lab-ecosystem/01-factory.md create mode 100644 vibe/guidebooks/lab-ecosystem/02-tools.md create mode 100644 vibe/guidebooks/lab-ecosystem/03-cms.md create mode 100644 vibe/guidebooks/lab-ecosystem/README.md create mode 100644 vibe/guidebooks/lab-ecosystem/naming-conventions.md create mode 100644 vibe/guidebooks/lab-ecosystem/secrets-and-vault.md create mode 100644 vibe/guidebooks/lab-ecosystem/storage-and-recovery.md create mode 100644 vibe/investigations/INV-001-prod-blast-radius-couplings.md create mode 100644 vibe/investigations/README.md create mode 100644 vibe/investigations/_template.md create mode 100644 vibe/runbooks/README.md create mode 100644 vibe/runbooks/_template.md create mode 100644 vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md create mode 100644 vibe/shareouts/README.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..1077e18 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,131 @@ +# Arcodange Lab — Agent Guide & Operating Rules + +The Arcodange lab is a self-hosted home/company platform running on three Raspberry Pi (pi1/pi2/pi3) behind a home Livebox, driven from a MacBook Pro M4 control node. **factory** is the cornerstone admin repo: it provisions the cluster (k3s), defines what gets deployed (ArgoCD app-of-apps), manages cloud/forge state (OpenTofu), and provisions databases (PostgreSQL). Two sibling repos carry workloads: **tools** (platform services — Vault, Prometheus, Grafana, CrowdSec, poolers) and **cms** (the public Nuxt site `arcodange.fr`). Everything is deployed into k3s namespaces via ArgoCD, every secret comes from Vault, and public traffic enters through a Cloudflared Zero-Trust tunnel into the internal Traefik. + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart TB + subgraph control["Control node (MacBook Pro M4)"] + ansible["Ansible"]:::proc + tofu["OpenTofu"]:::proc + end + + factory["factory repo
orchestrator: ansible + argocd + iac + postgres + doc"]:::src + tools["tools repo
Vault, Prometheus, Grafana, CrowdSec, poolers"]:::src + cms["cms repo
Nuxt site arcodange.fr"]:::src + + subgraph cluster["k3s cluster (pi1 server, pi2/pi3 agents)"] + argocd["ArgoCD app-of-apps"]:::proc + vault["Vault + VSO"]:::store + nsapps["namespaces: tools / cms / webapp / erp / ..."]:::proc + traefik["internal Traefik"]:::proc + end + + cflared["Cloudflared Zero-Trust tunnel"]:::proc + public["public: *.arcodange.fr"]:::store + + ansible -- "provision k3s + base" --> cluster + tofu -- "state in GCS" --> factory + factory -- "defines Applications" --> argocd + argocd -- "deploy charts" --> tools + argocd -- "deploy charts" --> cms + tools --> nsapps + cms --> nsapps + vault -- "inject secrets" --> nsapps + nsapps --> traefik + cflared -- "ingress" --> traefik + public --> cflared + + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff +``` + +1. The **control node** (MacBook Pro M4) runs Ansible and OpenTofu — the two hands that build everything. +2. **Ansible** provisions the **k3s cluster** (pi1 server, pi2/pi3 agents) and its base layer. +3. **OpenTofu** manages forge/cloud state, persisted in **GCS** (`gs://arcodange-tf`), serving the **factory** repo's intent. +4. **factory** defines the **ArgoCD app-of-apps**, which is the single deployment authority. +5. **ArgoCD** deploys the Helm charts of **tools** and **cms** into per-app k3s **namespaces**. +6. **Vault** (with the Vault Secrets Operator) injects secrets into those namespace pods — no secrets live in git. +7. Workload pods route through the **internal Traefik**. +8. The **Cloudflared Zero-Trust tunnel** is the only public ingress, forwarding `*.arcodange.fr` traffic into internal Traefik. + +--- + +## Repos at a glance + +| Repo | Purpose | Key dirs | How deployed | +|---|---|---|---| +| **factory** | Cornerstone admin repo: provisions the cluster, defines deployments, owns infra state and DBs, holds canonical docs | [ansible/](ansible/) · [argocd/](argocd/) · [iac/](iac/) · [postgres/](postgres/) · [doc/](doc/) | Ansible (cluster + base) + OpenTofu (forge/cloud/PG state); ArgoCD reads its app-of-apps | +| **tools** | Platform services in the `tools` namespace | [hashicorp-vault](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault), [prometheus](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus), [grafana](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana), [crowdsec](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec), [pgbouncer](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgbouncer) | Helm/Kustomize charts deployed via ArgoCD | +| **cms** | Public Nuxt static site `arcodange.fr` + its zone/email IaC | [chart](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart), [cloudflare](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare), [zoho](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/zoho) | Helm chart via ArgoCD; Cloudflare/Zoho via OpenTofu | + +Self-hosted Gitea is at `gitea.arcodange.lab` (org `arcodange-org`). **Pick the forge tool from the remote**: these repos live on Gitea, so use the `mcp__gitea__*` MCP tools for PRs/issues/releases — `gh` will silently fail. + +## The `` join key + +One kebab-case identifier — `` — is reused **identically** across the Gitea repo, the PG database + `_role`, Vault (`postgres/creds/`, k8s auth role ``, policies `` / `-ops`, `gitea_cicd_`), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix `/main`, and DNS (`.arcodange.lab` / `.arcodange.fr`). Bricks wire together **by name convention, not explicit config**, so a single typo breaks the chain silently. Source of truth: [doc/runbooks/new-web-app/conventions.md](doc/runbooks/new-web-app/conventions.md); concept page (going forward): [vibe/guidebooks/lab-ecosystem/naming-conventions.md](vibe/guidebooks/lab-ecosystem/naming-conventions.md). + +## Where knowledge lives + +Start at the knowledge-base front door: [vibe/README.md](vibe/README.md). The six `vibe/` folders: + +- [vibe/ADR/](vibe/ADR/README.md) — architecture decision records (the *why*); canonical home going forward. +- [vibe/PRD/](vibe/PRD/README.md) — product/project requirement docs, each with a mandatory `STATUS.md`. +- [vibe/investigations/](vibe/investigations/README.md) — numbered investigations (`INV-NNN-slug`), with notebooks when data-heavy. +- [vibe/guidebooks/](vibe/guidebooks/README.md) — tree-docs that map the lab's components (the *how it fits together*). +- [vibe/runbooks/](vibe/runbooks/README.md) — step-by-step operational procedures with `[AGENT]` / `[HUMAN]` markers. +- [vibe/shareouts/](vibe/shareouts/README.md) — handouts and presentations (FRENCH; the one exception to the English rule). + +Historical infra docs still live under [doc/](doc/) (ADRs, the new-web-app runbook) — see also `CLUSTER_RECOVERY.md` (at the lab root, **outside** this repo) for tested power-cut recovery. + +## Operating rules for agents + +### No-tombstone rule (FOREMOST) +Write every file as **currently true**. NEVER leave "Correction (date): …", "previously X, now Y", changelogs, or "updated to …" notes — git history is the audit trail. The **only** allowed exception is a forward-looking `> [!CAUTION]` about a live operational risk. + +### Mermaid preferences +Begin each block with an init directive selecting `theme base` (or `forest`). Define a `classDef` palette legible on both light and dark backgrounds (dark fills + light text), e.g. `classDef src fill:#2563eb,stroke:#1e40af,color:#fff`. Use HTML `
` for line breaks (never `\n`). Put a leading space before any label starting with a slash, and escape angle brackets inside labels. **Validate every diagram with the Mermaid MCP** before committing, and **immediately after each diagram add a numbered ordered list restating the same flow in words**. + +### Tree-docs (guidebooks & big PRDs) +Guidebooks and large PRDs are written as navigable trees: every file's **first line is a breadcrumb** (ancestors are relative links, current page is bold-unlinked, separator ` > `); **every folder has a `README.md` index hub** (a table of its children — link + one-line summary + status — sorted by importance/sequence, not alphabetically); **cross-references are bidirectional** (if A links B, B links A); use numbered file prefixes only for ordered narratives; **stamp `Last Updated: 2026-06-23` at each tree root**. + +### Optimized ADR format (MADR-lite) +Sections: **Context / Decision / Consequences / Alternatives / QA & validation / References**. Once an ADR is **Accepted, the body is immutable** — only the status field mutates (Proposed → Accepted → Superseded). The canonical home going forward is [vibe/ADR/](vibe/ADR/README.md); [doc/adr/](doc/adr/README.md) stays as the historical record. + +### Investigations +Prefer a **single `INV-NNN-slug.md`** when the finding fits in one file. When data-heavy, write a short stub `.md` beside a same-named folder containing notebooks (`.ipynb` + paired `.py`), a `_data/` dir, and a plain-language `notebook_simple.md` (visuals anyone can read). + +### PRD convention +**One subfolder per PRD.** A `STATUS.md` is **mandatory** and must be updated whenever something ships. Big PRDs use tree-docs and **must detail a QA strategy**. The flow is Problem → … → QA strategy → STATUS.md. + +### PR crosslinking (bidirectional) +**Every PR body references the ADR/PRD it advances.** In return, the ADR's **References** section and the PRD's **STATUS.md** link back to that PR. Links must be bidirectional — never one-way. + +### Guidebook maintenance +Altering a component that is documented in `guidebooks/` **requires updating that guidebook page in the same change**. A code/infra change that leaves its guidebook stale is incomplete. + +### Language policy +**English** for everything in `vibe/` and for `AGENTS.md`/`CLAUDE.md` (this tree is for LLM agents). The single exception: **shareouts handouts are FRENCH**. + +## The cohort + workflow + +These personas are spawned via the Agent tool or a Workflow; document and reuse them across sessions: + +| Persona | Role | +|---|---| +| **Lab Cartographer** | Explores the three repos and maps them into guidebooks (tree-docs). Read-mostly; never edits infra. | +| **ADR Scribe** | Writes optimized MADR-lite ADRs; enforces immutability (status-only mutation) and PR crosslinks. | +| **PRD Architect** | Writes PRDs (Problem → … → QA strategy → STATUS.md); uses tree-docs for big ones. | +| **Runbook Engineer** | Writes step-by-step runbooks with `[AGENT]` (read-only, safe) and `[HUMAN]` (prod-mutating, needs approval) markers. | +| **Investigator** | Writes investigations: a single `INV-NNN-slug.md` when possible, else a stub `.md` beside a same-named folder with `.ipynb` + paired `.py`, a `_data/` dir, and `notebook_simple.md`. | +| **Diagram Smith** | Authors and validates every mermaid diagram with the Mermaid MCP validator; enforces the mermaid preferences + the ordered-list-after-diagram rule. | +| **Continuity Warden** | The adversarial reviewer: checks the no-tombstone rule, breadcrumb depth, bidirectional links, dead links, naming, STATUS/PR crosslinks, and the Last Updated stamp. | + +**Recommended workflow for substantial `vibe/` contributions:** + +1. **Scaffold** — folders + README hubs + templates. +2. **Author** — personas in parallel, each on distinct files. +3. **Validate** — Diagram Smith runs the Mermaid MCP on every diagram. +4. **Review** — Continuity Warden runs the adversarial checklist. +5. **Assemble** — wire cross-refs, run the dead-link self-test, stamp `Last Updated`. diff --git a/vibe/ADR/0001-safe-prod-like-environment.md b/vibe/ADR/0001-safe-prod-like-environment.md new file mode 100644 index 0000000..1d587db --- /dev/null +++ b/vibe/ADR/0001-safe-prod-like-environment.md @@ -0,0 +1,84 @@ +[vibe](../README.md) > [ADR](README.md) > **0001 · Safe, production-like environment** + +# ADR-0001: Safe, production-like environment for the lab + +> **Status**: Accepted +> **Date**: 2026-06-23 +> **Deciders**: @arcodange + +## Context + +The Arcodange lab doubles as company production. The same three Raspberry Pis and one MacBook control node run the public CMS (`arcodange.fr`), Zoho-backed business email, the Dolibarr ERP holding accounting and business records, the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one laptop holding the kubeconfig, the Vault root token, and every cloud admin token. + +There is no separation between *where I experiment* and *where the business runs*. Every risky change is tested directly in production. The 2026-04-13 power-cut proved recovery is manual, multi-step, and only ever validated by a real incident — never rehearsed. + +The danger is concentrated in a handful of change-classes. Each one can cause silent, fleet-wide, or data-losing damage when applied to the live environment: + +| Change-class | Blast radius if wrong | +| --- | --- | +| Ansible playbook edits | Can wipe disks, reset k3s, or corrupt Longhorn across the fleet. | +| Vault policy / auth / mount changes | Lock out the Vault Secrets Operator → fleet-wide secret outage; a botched init could overwrite the single unseal key. | +| Postgres migrations / role changes | The superuser provider on `192.168.1.202` can drop or alter live databases → ERP data loss. | +| ArgoCD sync / app-of-apps changes | `prune` + `selfHeal` auto-prunes live resources fleet-wide. | +| Cloudflare / DNS / email changes | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | +| Longhorn / storage ops | Volume recreation orphans replicas via new engine IDs. | +| Recovery drills | The runbook is only validated by real incidents, never rehearsed. | +| Cert / PKI re-init | Rotates the internal CA, invalidating every issued `*.arcodange.lab` cert. | + +A change-management process is not enough: the operator needs a place to *make the mistake first*, where the mistake cannot reach production. + +## Decision + +We will build a **local-only safe environment on the MacBook control node**, seeded from the *same* GitOps repos via a dedicated sandbox inventory, with two modes: + +- **(a) k3d single-node "fast inner loop"** (~60s bring-up) for app, Vault, and ArgoCD iteration. +- **(b) Three arm64 VMs** (multipass or Vagrant on the M4) reproducing the three-node topology — Postgres + Gitea as docker-compose *outside* k3s on the "pi2-equivalent" VM, Longhorn across the three VM disks — for Ansible, Longhorn, and recovery work. + +The load-bearing requirement is the **isolation boundary**: the sandbox must be *unable* to mutate real production even on a wrong command. Each production coupling maps to a concrete guardrail — a separate sandbox inventory with a prod-IP abort guard, sandbox GCS state prefixes, a separate sandbox Vault with its own unseal-key path, a sandbox Postgres host check, and plan-only DNS against a throwaway zone. The real `arcodange.fr` Cloudflare/Zoho tokens are never exported into a sandbox shell. Because the `` convention keys everything *within* a cluster/Vault/DB, the sandbox reuses identical `` names with no collision — the boundary is the cluster + Vault + state + DNS zone, not the names, so runbooks read identically in both environments. + +See the [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) for the full coupling→control mapping, and the [PRD](../PRD/safe-prod-like-environment/README.md) for the complete product view. + +## Consequences + +- **+** $0 inner loop that runs on the existing control node — no new hardware. +- **+** Rehearses every dangerous change-class except real public DNS/email. +- **+** The 2026-04-13 recovery sequence becomes a repeatable drill instead of a once-per-incident gamble. +- **+** Identical `` names mean runbooks are environment-agnostic. +- **−** x86/ARM nuance must be handled (use arm64 VMs/images on the M4). +- **−** New guardrail and parity-manifest maintenance burden. +- **−** Single-laptop resource limits — k3d for speed, VMs only when multi-node fidelity is actually needed. +- **→** Real public DNS/ACME and physical-ARM always-on testing remain unsolved by design; revisit only if recurring game-days demand them. + +## Alternatives considered + +| Option | Fidelity | Isolation | $ / effort | Verdict | +| --- | --- | --- | --- | --- | +| 1 · Ephemeral local cluster (k3d/kind) | Medium (single-node) | Full (separate cluster) | $0 / low | ✅ **Chosen** as the fast mode. | +| 2 · Three arm64 VMs reproducing the topology | High (3-node, PG+Gitea outside k3s, Longhorn) | Full (separate VMs) | $0 / medium | ✅ **Chosen** for fidelity. | +| 3 · Sandbox namespace on the real cluster | High | None — shared Vault/PG/Longhorn/ArgoCD | $0 / low | ❌ **Rejected**: shared blast radius fails the core isolation requirement. | +| 4 · Dedicated physical node (4th Pi / mini-PC) | High (real ARM, always-on) | Full | $$ hardware / medium | ⛔ **Out of scope**: hardware cost; revisit only for recurring always-on ARM game-days. | +| 5 · Disposable cloud k3s for real public DNS/ACME | High infra, but arch drift | Full | $ recurring / medium | ⛔ **Out of scope**: cost + ARM drift; its only unique value is real DNS/email, which we explicitly do not test. | + +## QA & validation + +- **Parity manifest** — k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, PG + Gitea outside k3s, three nodes. Any drift from this manifest is a failure. +- **Provisioning-parity test** — run the new-web-app runbook for a throwaway `` "canary" and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running. +- **Idempotence gate** — `ansible-playbook --check --diff` reports `changed=0` on the converged sandbox *before* any change is promoted to prod. +- **`tofu plan` diff gate** — plan against sandbox state; for DNS, assert it touches *only* the throwaway zone. +- **Chaos drills** (mapped to CLUSTER_RECOVERY.md sections): node-kill; Vault-seal (unseal via the *sandbox* key; VSO re-auths); Longhorn volume corruption (run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)); DB drop/restore; full power-cut simulation (execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green); ArgoCD bad-sync (observe prune/selfHeal, author a rollback runbook section); cert/PKI re-issue. +- **Monthly game-day** — the operator follows *only* the runbook; any improvised step becomes a runbook PR. This is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident. +- **Promotion gate** — no infra/Vault/storage/DNS change reaches prod until it has been applied to the sandbox *and* survived the matching drill. Each drill records time-to-recover and which step failed or was improvised. + +See the [QA strategy leaf](../PRD/safe-prod-like-environment/qa-strategy.md) for the detailed drill table and evidence trail. + +## References + +- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — full product view, requirements, and phased rollout. +- [INV-001 · Prod blast-radius couplings](../investigations/INV-001-prod-blast-radius-couplings.md) — the investigation that mapped every prod coupling. +- [Guidebook · Lab ecosystem](../guidebooks/lab-ecosystem/README.md) — the end-to-end map of the prod topology this sandbox mirrors. +- [Guidebook · Storage and recovery](../guidebooks/lab-ecosystem/storage-and-recovery.md) — how Longhorn and the recovery sequence work today. +- [doc/adr index](../../doc/adr/README.md) — foundational infrastructure ADRs (read-only history). +- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association, exercised by the Longhorn chaos drill. +- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the `` convention reused identically across sandbox and prod. +- CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section. +- PRs: to be backfilled on PR open. diff --git a/vibe/ADR/README.md b/vibe/ADR/README.md new file mode 100644 index 0000000..e2378ec --- /dev/null +++ b/vibe/ADR/README.md @@ -0,0 +1,45 @@ +[vibe](../README.md) > **ADR** + +# Architecture Decision Records + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [vibe/PRD](../PRD/README.md) · [vibe/Investigations](../investigations/README.md) +> **Historical**: [doc/adr](../../doc/adr/README.md) (foundational infra) · [ansible/.../docs/adr](../../ansible/arcodange/factory/docs/adr/) (dated infra ADRs) + +`vibe/ADR/` is the **canonical home for Architecture Decision Records going forward**. The format is MADR-lite: one short, self-contained Markdown file per decision, focused on the *why* rather than the *how*. Use the [`_template.md`](_template.md) skeleton to start a new one. + +## Where ADRs live + +There are three ADR locations in this repo. Only the first accepts new records; the other two are read-only history kept for context. + +| Location | Role | Accepts new ADRs? | +| --- | --- | --- | +| `vibe/ADR/` (this folder) | Canonical, MADR-lite, going forward | ✅ Yes | +| [`doc/adr/`](../../doc/adr/README.md) | Foundational infrastructure ADRs (DNS, k3s, CI/CD, Vault, telegram-gateway auth) | ❌ Historical | +| [`ansible/arcodange/factory/docs/adr/`](../../ansible/arcodange/factory/docs/adr/) | Dated infra ADRs (network, CI/CD, Longhorn PVC recovery, internal DNS) | ❌ Historical | + +When a new decision *supersedes* one of the historical records, write the new ADR here, set the old one's status note to `Superseded by ADR-NNNN`, and cross-link both ways. + +## Rules + +- **One file per decision**, named `NNNN-kebab-title.md` (zero-padded sequence, e.g. `0001-safe-prod-like-environment.md`). +- **The body is immutable once `Accepted`.** A decision is a historical fact: do not rewrite the Context/Decision/Consequences after acceptance. The *only* mutation allowed on an accepted ADR is its **status** (e.g. flipping `Accepted` → `Superseded`). +- **Statuses**: `Proposed` (under discussion) → `Accepted` (decided, body frozen) → `Superseded` (replaced; points to the successor ADR). A `Proposed` ADR may still be edited freely. +- **No-tombstone rule.** Each file reads as currently true. Never leave "previously X, now Y", changelog lines, or "updated to ..." notes inside an ADR — git history is the audit trail. A superseded ADR keeps its original frozen body; the supersession is recorded only in its status line and the successor's References. +- **PR cross-link both ways.** The ADR References section links the PR that introduced it; the PR description links back to the ADR. Keep links bidirectional. + +## Index + +| # | Title | Status | Date | +| --- | --- | --- | --- | +| [0001](0001-safe-prod-like-environment.md) | Safe, production-like environment | 🟢 Accepted | 2026-06-23 | + +## Rules to contribute + +1. Copy [`_template.md`](_template.md) to `NNNN-kebab-title.md` using the next free sequence number and delete the top HTML-comment note. +2. Fill in the blockquote (Status/Date/Deciders), then Context, Decision, Consequences, Alternatives considered, QA & validation, References. +3. Open the ADR with status `Proposed`. Flip it to `Accepted` once the decision is settled — and from that point treat the body as frozen. +4. Add a row to the Index table above (newest at the bottom to preserve chronological numbering). +5. In the PR that lands the ADR, link to the ADR file; in the ADR's References, link back to the PR. Bidirectional links are mandatory. +6. If this ADR supersedes a historical one in `doc/adr/` or the Ansible ADR folder, update the old record's status note and cross-reference both directions. diff --git a/vibe/ADR/_template.md b/vibe/ADR/_template.md new file mode 100644 index 0000000..20c56fd --- /dev/null +++ b/vibe/ADR/_template.md @@ -0,0 +1,41 @@ +[vibe](../README.md) > [ADR](README.md) > **_template** + + + +# ADR-NNNN: Title + +> **Status**: Proposed | Accepted | Superseded by ADR-NNNN +> **Date**: YYYY-MM-DD +> **Deciders**: name(s) + +## Context + +What forces are at play? Describe the problem, the constraints, and the situation that makes a decision necessary. State facts, not opinions. Keep it short enough that a future reader understands *why* a decision was needed without prior context. + +## Decision + +The decision, stated in the active voice: "We will ...". One clear choice. If the decision has sub-parts, use a short bulleted list. + +## Consequences + +What becomes easier or harder as a result of this decision? + +- **+** A positive outcome / something now enabled. +- **−** A trade-off / cost / new constraint accepted. +- **→** A future follow-up this implies (work deferred, a door left open, a re-evaluation trigger). + +## Alternatives considered + +| Option | Why not | +| --- | --- | +| Alternative A | Reason it was rejected. | +| Alternative B | Reason it was rejected. | + +## QA & validation + +How was (or will be) this decision validated? Tests, smoke checks, manual verification, rollback plan, or the criteria that would tell us the decision was wrong. + +## References + +- Link to the PR that introduces this ADR (and ensure the PR links back here). +- Related ADRs, PRDs, investigations, or external docs (descriptive link text, never "here"/"this"). diff --git a/vibe/PRD/README.md b/vibe/PRD/README.md new file mode 100644 index 0000000..7189eb5 --- /dev/null +++ b/vibe/PRD/README.md @@ -0,0 +1,34 @@ +[vibe](../README.md) > **PRD** + +# Product Requirement Documents + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [vibe/ADR](../ADR/README.md) · [vibe/Investigations](../investigations/README.md) + +`vibe/PRD/` holds the Product Requirement Documents that drive larger pieces of work in the lab. A PRD captures *what* we want and *why it matters*; the matching ADRs capture *how we decided to build it*, and investigations capture *what we learned* along the way. + +## Convention + +- **One subfolder per PRD**, kebab-case (e.g. `safe-prod-like-environment/`). +- Each subfolder **MUST** contain: + - `README.md` — the PRD hub: problem, goals/non-goals, requirements, success criteria, and a QA strategy. + - `STATUS.md` — the implementation tracker. **Update it whenever something ships** (a PR merges, a brick lands, a milestone closes). It is the living view of "where are we" against the PRD. +- A **big PRD uses tree-docs**: the `README.md` stays a hub and detail lives in leaf pages (each with its own breadcrumb and bidirectional cross-links). A tree-sized PRD **MUST** detail an explicit **QA strategy** — how the delivered work will be verified, and what "done and safe" means. +- **PRs cross-link to the PRD**, and the PRD's `STATUS.md` **cross-links back** to the PRs/ADRs/investigations that realised each part. Links are bidirectional. +- **No-tombstone rule** applies: the PRD reads as currently true. Progress lives in `STATUS.md` (which *is* a tracker and may legitimately list shipped items), not as "previously / now" edits scattered through the hub. + +## Index + +| PRD | Hub | Status | +| --- | --- | --- | +| Safe, production-like environment | [safe-prod-like-environment/README.md](safe-prod-like-environment/README.md) | 🟡 In design | + +## Rules to contribute + +1. Create a kebab-case subfolder named for the PRD. +2. Add `README.md` (the hub) and `STATUS.md` (the tracker). Both carry a breadcrumb first line and the leaf header blockquote (Status / Last Updated / Related). +3. In the hub, state the problem, goals and non-goals, requirements, success criteria, and the QA strategy. If the PRD is large, split detail into leaf pages and keep the README as a navigable hub. +4. Keep `STATUS.md` current: every time a piece ships, record it there and link the PR/ADR that delivered it. +5. Add a row to the Index table above. +6. Ensure every PR that implements part of the PRD links to the PRD, and that `STATUS.md` links back. Bidirectional links are mandatory. diff --git a/vibe/PRD/safe-prod-like-environment/README.md b/vibe/PRD/safe-prod-like-environment/README.md new file mode 100644 index 0000000..a821950 --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/README.md @@ -0,0 +1,69 @@ +[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment** + +# Safe, production-like environment + +> **Status:** In design +> **Last Updated:** 2026-06-23 +> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) +> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) +> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md) + +## Problem + +The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens. + +There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days. + +## Users & personas + +A **single operator wearing two hats**: + +- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing. +- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage. + +Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod. + +## Goals & non-goals + +**Goals** + +- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**. +- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated. +- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod. +- Run on the **existing control node** (the MacBook) at **$0** marginal cost. + +**Non-goals** + +- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope. +- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred. +- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred. +- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware. + +## Requirements + +Functional: + +- **One-command bring-up**, seeded from the same GitOps repos as prod. +- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md). +- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail. +- **Two modes** seeded from the same repos via the sandbox inventory: + - **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration. + - **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work. + +The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md). + +## QA strategy + +Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md). + +## Implementation status + +Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md). + +## Leaves + +| Page | Summary | Status | +| --- | --- | --- | +| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `` naming note; the token caution. | 🟡 In design | +| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design | +| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started | diff --git a/vibe/PRD/safe-prod-like-environment/STATUS.md b/vibe/PRD/safe-prod-like-environment/STATUS.md new file mode 100644 index 0000000..d345be6 --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/STATUS.md @@ -0,0 +1,52 @@ +[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **STATUS** + +# STATUS — Safe, production-like environment + +> **Last Updated:** 2026-06-23 + +Legend: ⬜ not started · 🟡 in progress · ✅ done + +> [!IMPORTANT] +> This file MUST be updated whenever something ships. Every PR that advances a phase crosslinks back here (and the matching checkbox flips), and the [PRs](#prs) table gets a row. + +## Phase 0 — Isolation guardrails + +*Must land before any sandbox run.* + +- [ ] ⬜ Sandbox inventory `inventory/sandbox/hosts.yml` (VM/cloud hosts only) +- [ ] ⬜ Prod-IP abort guard (aborts on `192.168.1.201-203` unless `i_mean_prod=true`) +- [ ] ⬜ Sandbox GCS state prefixes (`sandbox/...`) or `gs://arcodange-tf-sandbox` +- [ ] ⬜ Sandbox Vault unseal-key path (`~/.arcodange/sandbox/cluster-keys.json`) +- [ ] ⬜ Sandbox env profile / plan-only DNS against a throwaway zone + +## Phase 1 — Tier-1 k3d fast mode + +- [ ] ⬜ One-command bring-up seeded from GitOps +- [ ] ⬜ Parity manifest v1 +- [ ] ⬜ Canary provisioning-parity test +- [ ] ⬜ `changed=0` idempotence gate documented + +## Phase 2 — Tier-1 3-VM cluster + +- [ ] ⬜ Three arm64 VMs (multipass / Vagrant on the M4) +- [ ] ⬜ Same `system_k3s`; Postgres + Gitea outside k3s on the pi2-equivalent VM +- [ ] ⬜ Longhorn across the three VM disks +- [ ] ⬜ Chaos drills: node-kill / Vault-seal / DB-drop +- [ ] ⬜ First full `CLUSTER_RECOVERY` dry-run against the sandbox + +## Phase 3 — Game-day operationalization + +- [ ] ⬜ Monthly cadence + promotion gate in the PR checklist +- [ ] ⬜ Longhorn engine-ID drill +- [ ] ⬜ ArgoCD bad-sync rollback runbook +- [ ] ⬜ Evidence trail for ≥1 cycle + +## Phase 4 — out of scope + +Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s for real public DNS/ACME. See [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) for the rejected-alternatives rationale. + +## PRs + +| PR | Scope | Phase | Merged | +| --- | --- | --- | --- | +| _pending_ | Bootstrap the PRD tree (this `vibe/` set) — backfilled on open | — | ⬜ | diff --git a/vibe/PRD/safe-prod-like-environment/isolation-boundary.md b/vibe/PRD/safe-prod-like-environment/isolation-boundary.md new file mode 100644 index 0000000..c56b67a --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/isolation-boundary.md @@ -0,0 +1,31 @@ +[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **Isolation boundary** + +# Isolation boundary + +> **Status:** In design +> **Last Updated:** 2026-06-23 +> **Upstream:** [Safe, production-like environment](README.md) +> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) + +The isolation boundary is the load-bearing part of this PRD: the sandbox must be **unable to mutate real prod even on a wrong command**. Every prod coupling that a sandbox run could touch is mapped below to a concrete control. The boundary is the **cluster + Vault + state + DNS zone** — not the names (see the naming note). + +## Prod couplings → sandbox controls + +| Prod coupling | What it can break in prod | Sandbox control | +| --- | --- | --- | +| Ansible inventory `hosts.yml` → `192.168.1.201-203` | Wipe disks, reset k3s, corrupt Longhorn on the live Pis. | Separate `inventory/sandbox/hosts.yml` (VM/cloud hosts only) **plus** a pre-task guard that **aborts** if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true` is set explicitly. | +| OpenTofu state in `gs://arcodange-tf` (prefixes) | A sandbox apply rewrites live state and re-plans prod resources. | A sandbox prefix family (`sandbox/factory/main`, `sandbox/tools/...`, `sandbox/factory/postgres`) via a backend-config override, **or** a separate bucket `gs://arcodange-tf-sandbox`. Sandbox runs never touch prod state. | +| Gitea provider `base_url` `gitea.arcodange.lab` + ArgoCD `repoURL` / `targetRevision` | Sandbox commits/pushes into the prod forge; ArgoCD syncs sandbox refs onto the prod cluster. | Sandbox Gitea on the sandbox cluster (or org `arcodange-sandbox`); the sandbox app-of-apps points at a **sandbox branch** so the sandbox cluster syncs only sandbox refs. | +| Vault provider `address` `vault.arcodange.lab` + unseal key `~/.arcodange/cluster-keys.json` | Sandbox writes clobber prod policies/auth/mounts; a botched init overwrites the prod unseal key. | A **separate sandbox Vault**; override the unseal-key path to `~/.arcodange/sandbox/cluster-keys.json` so prod's key can never be overwritten. | +| PostgreSQL provider `host` `192.168.1.202` (superuser) | Drop or alter live DBs — including ERP business records. | Sandbox PG is the docker-compose on the sandbox pi2-equivalent; a guard **refuses apply** if `host == 192.168.1.202` and `workspace != prod`. | +| Cloudflare account / OVH `arcodange.fr` / Zoho live mail | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | DNS/email modules run **plan-only** against a throwaway zone/subdomain with a separate token. The real `arcodange.fr` token is **never** exported into a sandbox shell. Real public DNS/ACME is out of scope. | +| Longhorn backup bucket | A restore drill overwrites prod backups. | Sandbox backup target is a **separate bucket/prefix** so restore drills cannot overwrite prod backups. | + +## The `` naming note + +The `` key threads one kebab-case identifier through the Gitea repo, the PG db + role, the Vault paths/policies, the k8s namespace + SA, the ArgoCD Application, the GCS state prefix, and DNS — see [conventions](../../../doc/runbooks/new-web-app/conventions.md). + +Because `` keys everything **within** a cluster / Vault / DB / zone, the sandbox can reuse **identical `` names with no collision**. The isolation boundary is the cluster + Vault + state + DNS zone, not the names. This is deliberate: runbooks read **identically** in both environments, so a drill exercises the exact same convention chain an operator runs in prod. + +> [!CAUTION] +> The real `arcodange.fr` Cloudflare token must **never** be exported into a sandbox shell. DNS/email work in the sandbox is plan-only against a throwaway zone with its own separate token. Exporting the prod token into a sandbox session would defeat the entire isolation boundary — a single `tofu apply` could rewrite live public DNS or mail records. diff --git a/vibe/PRD/safe-prod-like-environment/qa-strategy.md b/vibe/PRD/safe-prod-like-environment/qa-strategy.md new file mode 100644 index 0000000..db5c444 --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/qa-strategy.md @@ -0,0 +1,49 @@ +[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy** + +# QA strategy + +> **Status:** In design +> **Last Updated:** 2026-06-23 +> **Upstream:** [Safe, production-like environment](README.md) +> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md) + +The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate). + +## Fidelity gates + +- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate. +- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test. +- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready. +- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate. + +## Chaos drills + +Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR. + +| Drill | Action | Acceptance | Recovery section | +| --- | --- | --- | --- | +| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. | +| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. | +| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. | +| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. | +| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. | +| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). | +| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. | + +1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state. +2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets. +3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR. +4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact. +5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last. +6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section. +7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate. + +## Recurring game-day & promotion gate + +A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident. + +**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md). + +## Evidence trail + +Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs. diff --git a/vibe/README.md b/vibe/README.md new file mode 100644 index 0000000..e33365c --- /dev/null +++ b/vibe/README.md @@ -0,0 +1,50 @@ +# vibe/ — Arcodange Knowledge Base + +You-are-here: the **root** of the `vibe/` knowledge tree — the front door for every doc agents write and read. +Up: [factory](../README.md) / [AGENTS.md](../AGENTS.md) + +> **Status:** Active +> **Last Updated:** 2026-06-23 + +## What is `vibe/`? + +`vibe/` is the knowledge base dedicated to **LLM agents** working on the Arcodange lab. It collects the *why* (ADRs), the *what/when* (PRDs), the *what-we-found* (investigations), the *how-it-fits-together* (guidebooks), the *how-to-do-it* (runbooks), and the *what-we-told-humans* (shareouts). Everything here is written in **English** — the single exception is **shareouts handouts, which are FRENCH**. Operating rules (no-tombstone, mermaid prefs, tree-docs, ADR/PRD/investigation conventions, PR crosslinking, language policy) are defined authoritatively in [AGENTS.md](../AGENTS.md); this page summarizes them and points there. + +## Folder map + +| Folder | When to use it | Status | +|---|---|---| +| [ADR](ADR/README.md) | Recording an architecture **decision** (MADR-lite; body immutable once Accepted). Canonical home going forward. | ⬜ | +| [PRD](PRD/README.md) | Specifying a **product/project**: Problem → … → QA strategy → `STATUS.md` (mandatory, kept current). | ⬜ | +| [investigations](investigations/README.md) | Capturing a **finding/analysis** — single `INV-NNN-slug.md`, or stub + notebooks when data-heavy. | ⬜ | +| [guidebooks](guidebooks/README.md) | Mapping a **component or the ecosystem** as navigable tree-docs (the lab cartography). | ⬜ | +| [runbooks](runbooks/README.md) | Documenting an **operational procedure** step-by-step with `[AGENT]` / `[HUMAN]` markers. | ⬜ | +| [shareouts](shareouts/README.md) | Producing **handouts/presentations** for humans (FRENCH). | ⬜ | + +Status legend: ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. + +## Conventions at a glance + +- **No-tombstone rule (foremost)** — write each file as currently true; never leave "previously X, now Y", changelogs, or "updated to …" notes. Git history is the audit trail. Only exception: a forward-looking `> [!CAUTION]` about a live risk. +- **Breadcrumb spine** — every non-root file starts with a breadcrumb: ancestors as relative links, current page bold-unlinked, separator ` > `. This root has no breadcrumb (it uses the you-are-here + up-link above instead). +- **README hub per folder** — each folder's `README.md` is an index table of its children (link + one-line summary + status), sorted by importance/sequence. +- **Bidirectional links** — if A references B as related, B references A. Use descriptive link text (never "here"/"this"). +- **Mermaid prefs** — `theme base`/`forest` init directive; legible `classDef` palette (dark fills + light text); `
` not `\n`; leading space before slash-labels; validate with the Mermaid MCP; **a numbered ordered list restating the flow after every diagram**. +- **GitHub alert legend** — `[!NOTE]` info/forward-looking · `[!TIP]` aside · `[!IMPORTANT]` inherent constraint · `[!WARNING]` degraded-but-working · `[!CAUTION]` data-loss/breaking. +- **Status emoji legend** — ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. +- **Language policy** — English throughout `vibe/`; FRENCH only for shareouts handouts. + +Authority for all of the above: [AGENTS.md](../AGENTS.md). + +## Maintenance policy + +- **Adding a page** → also add its row to the parent folder's `README.md` index table. +- **Keep links bidirectional** → when you link A→B, add B→A. +- **Stamp `Last Updated:`** at each tree root (this file and every guidebook/big-PRD root) after any structural change. +- **Never tombstone** → edit content in place; let git carry the history. +- **Guidebook coupling** → changing a documented component means updating its guidebook page in the same change. +- **PR crosslinks** → every PR references the ADR/PRD it advances; that ADR's References and the PRD's `STATUS.md` link back. + +## Cohort + workflow (recap) + +Docs here are produced by a cohort of persona subagents — Lab Cartographer, ADR Scribe, PRD Architect, Runbook Engineer, Investigator, Diagram Smith, Continuity Warden — spawned via the Agent tool or a Workflow. The recommended pipeline for substantial contributions is **Scaffold → Author → Validate → Review → Assemble**. Full descriptions and responsibilities live in [AGENTS.md](../AGENTS.md). diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md new file mode 100644 index 0000000..1c11363 --- /dev/null +++ b/vibe/guidebooks/README.md @@ -0,0 +1,47 @@ +[vibe](../README.md) > **Guidebooks** + +# Guidebooks + +> **Status:** Active +> **Last Updated:** 2026-06-23 +> **Related:** [vibe runbooks](../runbooks/README.md) · [vibe shareouts](../shareouts/README.md) · canonical docs under [doc/](../../doc/README.md) + +## What a guidebook is + +A **guidebook** is a *tree-doc reference map* of the lab: a navigable set of linked Markdown pages (a root index, per-folder README hubs, and leaf pages wired with breadcrumbs and bidirectional cross-references) whose job is to **describe how the system is actually wired right now** — components, the conventions that join them, and the data/control flows between them. + +Guidebooks are descriptive maps, not procedures. They answer *"how does this fit together?"* For *"how do I execute X step by step?"* see the [runbooks](../runbooks/README.md). For *"why was it built this way?"* see the architecture decision records under [doc/adr](../../doc/adr/README.md). + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + SYS["Lab system
(factory + tools + cms)"]:::src --> GB["Guidebook
(tree-doc reference map)"]:::proc --> READER["Reader
(human or agent)
understands the wiring"]:::store +``` + +1. The lab system spans three repos — `factory`, `tools`, and `cms` — joined by the `` naming convention. +2. A guidebook surveys that system and renders it as a tree-doc reference map: indexed folders, breadcrumb-linked leaves, Mermaid flow diagrams. +3. A reader (a human onboarding, or an agent planning a change) consumes the guidebook to understand how the pieces wire together before touching anything. + +## Key maintenance rule + +> [!IMPORTANT] +> **If a component documented in a guidebook is altered, the guidebook page describing it MUST be updated in the same change.** A reference map that drifts from reality is worse than no map — it sends readers (and agents) confidently down dead paths. Treat the guidebook edit as part of the diff, not a follow-up: the PR that changes the component is the PR that updates its guidebook page. + +## Index + +| Guidebook | What it maps | Status | +|---|---|---| +| [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active | + +## Rules to contribute + +1. **Use the `tree-docs` skill.** Guidebooks are tree-docs: author and grow them with the skill so breadcrumbs, hubs, and cross-links stay consistent. +2. **Breadcrumb spine on every file.** The first line of each page is its breadcrumb trail: ancestors are relative links, the current page is the bold-unlinked last item, separator is ` > ` (space-gt-space). +3. **README hub per subfolder.** Every folder carries a `README.md` index hub: a table of its children (link + one-line summary + status), sorted by importance/sequence, never alphabetically. +4. **Bidirectional links.** When page A references page B as related, page B references A back. Use descriptive link text — never "here" or "this". +5. **Mermaid preferences.** Begin each diagram with a `%%{init: {'theme': 'base'}}%%` directive, define a `classDef` palette legible on both light and dark backgrounds (dark fills, light text), use HTML `
` for line breaks, and follow every diagram immediately with a numbered ordered list restating the same flow in words. +6. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. +7. **Honour the maintenance rule above** — update the relevant guidebook page in the same change that alters the component it documents. diff --git a/vibe/guidebooks/lab-ecosystem/01-factory.md b/vibe/guidebooks/lab-ecosystem/01-factory.md new file mode 100644 index 0000000..79e54cd --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/01-factory.md @@ -0,0 +1,122 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **01 · factory** + +# 01 · factory + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md) +> **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) + +`factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md). + +## Pillar 1 — Ansible ([`ansible/`](../../../ansible/)) + +The collection lives at `ansible/arcodange/factory/`. The inventory groups the three Pis and pins the service placement; numbered playbooks run an ordered narrative from bare OS to backups; `recover/` holds the disaster-recovery playbooks. + +### Inventory (`inventory/hosts.yml`) + +| Group | Hosts | Purpose | +|---|---|---| +| `raspberries` | `pi1`, `pi2`, `pi3` (`192.168.1.201-203`) | All three Pis; `ansible_user: pi` | +| `postgres` | `pi2` | The PostgreSQL host (docker-compose, outside k3s) | +| `gitea` | children of `postgres` (→ `pi2`) | Gitea co-located with PG on `pi2` | +| `pihole` | `pi1`, `pi3` | Internal DNS resolvers | +| `step_ca` | `pi1`, `pi2`, `pi3` | Step-CA PKI for `*.arcodange.lab` (primary `pi1`, replicas `pi2`/`pi3`) | +| `local` | `localhost` + the Pis | Control-node-local tasks | + +### Numbered playbooks (`playbooks/`) + +| Playbook | Imports / does | Notes | +|---|---|---| +| `01_system` | `system/system.yml` → rpi base, DNS, SSL, prepare disks, Docker, iSCSI, **k3s install** (`--docker --disable traefik`), CoreDNS, cert-issuer, Longhorn/Traefik config | k3s `v1.34.3+k3s1` via upstream `k3s-ansible`; pi1 server, pi2/pi3 agents | +| `02_setup` | `setup/setup.yml` → PostgreSQL + Gitea docker-compose; optional backup-NFS share | Stands up the two out-of-cluster source-of-truth services on `pi2` | +| `03_cicd` | Gitea **act-runner** docker-compose on `pi1`/`pi3` (`raspberries:&local:!gitea`), plus the ArgoCD/Image-Updater install | See the ArgoCD caveat below | +| `04_tools` | `tools/tools.yml` → `hashicorp_vault.yml`, `crowdsec.yml` | Platform tooling that bootstraps the cluster's Vault + CrowdSec | +| `05_backup` | `backup/backup.yml` → `postgres.yml`, `gitea.yml`, `k3s_pvc.yml` to `/mnt/backups` | Scheduled PG/Gitea/PVC backups; cron-report wiring present | + +### Recovery playbooks (`playbooks/recover/`) + +| Playbook | When to use | +|---|---| +| `longhorn.yml` | Recover Longhorn after a power cut when **Volume CRDs still exist** (CSI driver registration loss) | +| `longhorn_data.yml` | Recover app data from **raw replica `.img` files** when Volume CRDs are gone (block-device level) | + +The tested power-cut recovery sequence (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last) is documented in `CLUSTER_RECOVERY.md` at the lab root (outside this repo) and summarized in [storage-and-recovery.md](storage-and-recovery.md). Background on PVC recovery is in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). + +### Key roles + +`deploy_docker_compose` (renders compose stacks), `gitea_repo` / `gitea_token` / `gitea_secret` / `gitea_sync` (Gitea repo/token/secret/mirror management), `traefik_certs`, `playwright`, plus sub-roles `step_ca`, `hashicorp_vault`, `crowdsec`, `pihole`. + +## Pillar 2 — ArgoCD app-of-apps ([`argocd/`](../../../argocd/)) + +A Helm chart whose `templates/apps.yaml` loops over `values.gitea_applications` and emits one `Application` CRD per app. Each Application derives everything from the app name: `repoURL = https://gitea.arcodange.lab//`, `path = chart`, `namespace = ` (`CreateNamespace=true`), with `syncPolicy.automated` `prune: true` + `selfHeal: true` by default. + +| App | Org override | Image Updater | +|---|---|---| +| `url-shortener` | — | — | +| `tools` | — | explicit `prune`+`selfHeal` | +| `webapp` | — | ✅ digest strategy | +| `telegram-gateway` | `arcodange` | ✅ digest strategy | +| `erp` | — | — | +| `cms` | — | ✅ digest strategy | +| `dance-lessons-coach` | `arcodange` | ✅ digest strategy | + +> [!NOTE] +> The chart also templates a `longhorn_backup_target` and the ArgoCD Image Updater config (`argocd.arcodange.lab`). **ArgoCD itself is not currently deployed in-cluster** — its install is commented out in `03_cicd`. This page documents the intended steady state; treat ArgoCD as "designed, not live" until that step is enabled. + +## Pillar 3 — OpenTofu ([`iac/`](../../../iac/)) + +Manages the cloud/Gitea/Vault edge. State lives in **GCS** (`backend "gcs"`, bucket `arcodange-tf`, prefix `factory/main`). Tofu authenticates to Vault via **Gitea OIDC JWT** (mount `gitea_jwt`, role `gitea_cicd`). + +| Provider | Used for | +|---|---| +| `go-gitea/gitea` (`0.6.0`) | Repos, users, action secrets (e.g. the restricted `tofu_module_reader` CI user, CMS secrets) | +| `vault` (`4.4.0`) | KV secrets + policies + k8s auth roles (e.g. Longhorn GCS-backup creds & policy) | +| `google` (`7.0.1`) | GCS backup bucket + service account + HMAC key for Longhorn | +| `cloudflare/cloudflare` (`~> 5`) | R2 bucket, API tokens, CMS edge wiring (detailed in [03 · cms](03-cms.md)) | +| `ovh/ovh` (`2.8.0`) | OAuth2 client + IAM policy for the `arcodange.fr` domain (registrar = OVH) | + +`modules/cloudflare_token` is a reusable scoped-token factory. The whole module reuses the `` name as the GCS state prefix (`/main`) — see [naming-conventions.md](naming-conventions.md). + +## Pillar 4 — per-app PostgreSQL ([`postgres/iac/`](../../../postgres/)) + +OpenTofu using the `cyrilgdn/postgresql` provider against PG on `192.168.1.202` (state prefix `factory/postgres`). It iterates over a `var.applications` set and, **per app**, creates: + +| Resource | Name pattern | Purpose | +|---|---|---| +| Database | `` | The app's database (`template0`, owned by the role) | +| Owner role (non-login) | `_role` | Database owner; granted to dynamic users by Vault | +| Editor role (login) | `credentials_editor` | Shared admin role that can grant the per-app roles | +| `user_lookup()` function | per-`` db | `SECURITY DEFINER` lookup for **pgbouncer** auth (granted to `pgbouncer_auth`, revoked from `public`) | + +Current `applications` set: `webapp`, `erp`, `crowdsec`, `plausible`, `dance-lessons-coach`. Vault's PostgreSQL secrets engine then issues **dynamic** credentials on top of these roles — see [secrets-and-vault.md](secrets-and-vault.md). The pooler (`pgbouncer`) that consumes `user_lookup()` lives in the `tools` namespace — see [02 · tools](02-tools.md). + +## Provisioning order + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + S1["01_system
OS + k3s + Longhorn"]:::proc --> S2["02_setup
PG + Gitea (pi2)"]:::proc --> S3["03_cicd
runners + ArgoCD"]:::proc --> S4["04_tools
Vault + CrowdSec"]:::proc --> S5["05_backup
PG/Gitea/PVC"]:::proc + IAC["iac/ + postgres/iac
(OpenTofu state in GCS)"]:::store -. "declares cloud/Gitea/Vault/PG" .- S2 +``` + +1. **`01_system`** lays the OS, disks, Docker, and k3s with Longhorn + Traefik onto the three Pis. +2. **`02_setup`** stands up PostgreSQL and Gitea as docker-compose on `pi2` — the out-of-cluster source-of-truth services. +3. **`03_cicd`** registers the Gitea act-runners (and is where ArgoCD would install, currently commented out). +4. **`04_tools`** bootstraps the cluster's Vault and CrowdSec. +5. **`05_backup`** schedules PostgreSQL, Gitea, and k3s-PVC backups to `/mnt/backups`. +6. In parallel, **OpenTofu** (`iac/` and `postgres/iac/`) declares the cloud, Gitea, Vault, and PostgreSQL objects, keeping state in GCS. + +## Cross-references + +- [Lab ecosystem hub](README.md) — the whole-lab map this page sits under. +- [02 · tools](02-tools.md) — what ArgoCD deploys into the `tools` namespace (incl. pgbouncer that consumes the PG `user_lookup()`). +- [03 · cms](03-cms.md) — the CMS edge that `iac/cloudflare.tf` and `iac/ovh.tf` wire up. +- [naming-conventions.md](naming-conventions.md) — the `` join key these pillars share. +- [secrets-and-vault.md](secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and dynamic PG creds. +- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn + GCS backup + power-cut recovery. +- [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md) · [conventions](../../../doc/runbooks/new-web-app/conventions.md) — the step-by-step procedure these pillars support. +- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs. +- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — recovery background. diff --git a/vibe/guidebooks/lab-ecosystem/02-tools.md b/vibe/guidebooks/lab-ecosystem/02-tools.md new file mode 100644 index 0000000..116b7ef --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/02-tools.md @@ -0,0 +1,76 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **02 · tools** + +# 02 · tools + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [01 · factory](01-factory.md) +> **Related:** [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) + +The [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) is deployed by factory's ArgoCD into the **`tools` namespace**. It is the platform layer that every app namespace depends on: secrets (Vault + VSO), observability (Prometheus + Grafana), edge security (CrowdSec), database pooling (pgbouncer / pgcat), caching (Redis/KeyDB), and analytics (Plausible + ClickHouse). Each component ships its own Helm chart or Kustomize overlay, and most carry an `iac/` directory of OpenTofu that declares the Vault config (roles, policies, dynamic-secret backends) that wires the component to secrets — see [secrets-and-vault.md](secrets-and-vault.md). + +## Components in the `tools` namespace + +| Component | What it does | How declared | How it gets secrets | +|---|---|---|---| +| **Vault** | Secrets engine: KV v1 + v2, transit, PostgreSQL **dynamic creds**; auth backends `kubernetes` + Gitea **OIDC/JWT** | Helm chart + `iac/` (Vault config of itself + apps) | Is the source of truth; unsealed at boot (1 key, threshold 1) | +| **VSO** (Vault Secrets Operator) | Injects Vault secrets into pods via `VaultAuth` + `VaultDynamicSecret` CRDs | Helm chart | Authenticates to Vault via **Kubernetes auth** (per-`` role) | +| **Prometheus** | Metrics scraping + storage | Helm (community subchart) | — (scrape configs) | +| **Grafana** | Dashboards at `grafana.arcodange.lab`; datasources Prometheus + ClickHouse | Helm | Admin/datasource creds via VSO from Vault | +| **CrowdSec** | Behavioural detection + **Traefik bouncer** for the public edge | Helm + `iac/` | **Dynamic secrets** from Vault (VSO) | +| **pgbouncer** | Connection pooler to the **external** PostgreSQL on `pi2` | Helm | Auth via the per-app `user_lookup()` function (see [01 · factory](01-factory.md)); creds via VSO | +| **pgcat** | Alternative pooler (optional, **not the default**) | Helm | VSO-injected creds when enabled | +| **Redis / KeyDB** | In-memory cache; **KeyDB** master/replica (Redis-compatible) | Helm | VSO-injected auth when set | +| **Plausible** | Privacy-friendly web analytics | **Kustomize** | VSO-injected creds; backed by ClickHouse | +| **ClickHouse** | OLAP column store backing Plausible | **Kustomize** | VSO-injected creds | +| **`tool`** | A Helm **library chart** — shared templates/helpers reused by the other charts (not itself deployable) | Helm library chart | n/a | + +## How tools fit together + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TB + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef edge fill:#d97706,stroke:#b45309,color:#fff + + VAULT[("Vault
single source of truth")]:::store + VSO["VSO
VaultAuth / VaultDynamicSecret"]:::proc + PG[("External PostgreSQL
pi2 · 192.168.1.202")]:::store + PGB["pgbouncer
pooler"]:::proc + APPS["app pods
(webapp, erp, …)"]:::proc + PROM["Prometheus"]:::proc + GRAF["Grafana
grafana.arcodange.lab"]:::proc + CH[("ClickHouse")]:::store + PLA["Plausible"]:::proc + CS["CrowdSec + Traefik bouncer"]:::edge + + VAULT --> VSO + VSO -- "inject secrets" --> APPS + VSO -- "inject secrets" --> PGB + VSO -- "dynamic secret" --> CS + APPS --> PGB --> PG + PROM --> GRAF + CH --> GRAF + PLA --> CH +``` + +1. **Vault** holds every secret; **VSO** is the operator that delivers them into pods. +2. VSO **injects** static and dynamic secrets into the app pods, into **pgbouncer**, and supplies **CrowdSec** its dynamic secret. +3. App pods connect through **pgbouncer**, which pools connections to the **external PostgreSQL** on `pi2` (using the per-app `user_lookup()` function defined in factory's `postgres/iac/`). +4. **Prometheus** scrapes metrics and **ClickHouse** stores analytics; both are wired as **Grafana** datasources. +5. **Plausible** writes its analytics into **ClickHouse**. +6. **CrowdSec** runs as a Traefik bouncer on the public edge, fed dynamic secrets from Vault — the same edge that fronts the CMS in [03 · cms](03-cms.md). + +## Where to look + +- Repo: [arcodange-org/tools](https://gitea.arcodange.lab/arcodange-org/tools) — each component is a top-level chart/overlay with its own `iac/`. +- Vault config patterns: [hashicorp-vault/iac/modules](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac) (e.g. `app_roles`, `app_policy`) — referenced by the [naming convention](../../../doc/runbooks/new-web-app/conventions.md). + +## Cross-references + +- [Lab ecosystem hub](README.md) — the whole-lab map. +- [01 · factory](01-factory.md) — the ArgoCD that deploys this namespace, and the `postgres/iac/` roles + `user_lookup()` that pgbouncer consumes. +- [03 · cms](03-cms.md) — the public edge protected by **CrowdSec** (Turnstile → CrowdSec wiring). +- [secrets-and-vault.md](secrets-and-vault.md) — full Vault detail: KV/transit/dynamic engines, Gitea OIDC JWT, VSO injection. +- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn PVCs these stateful tools mount, and the Vault-unseal step in recovery. diff --git a/vibe/guidebooks/lab-ecosystem/03-cms.md b/vibe/guidebooks/lab-ecosystem/03-cms.md new file mode 100644 index 0000000..2bcdaad --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/03-cms.md @@ -0,0 +1,82 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **03 · cms** + +# 03 · cms + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [01 · factory](01-factory.md) +> **Related:** [02 · tools](02-tools.md) · [secrets-and-vault.md](secrets-and-vault.md) + +The [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) is the **public-facing site** of the lab: a Nuxt static site served at **`arcodange.fr`**, plus the OpenTofu that owns its Cloudflare edge and its Zoho email. It is the one app whose primary audience is the open Internet, so it ties together the public-DNS, tunnel, CAPTCHA, and email plumbing. + +## The Nuxt site + +| Aspect | Detail | +|---|---| +| App | Static **Nuxt** site | +| Chart | [`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart) — Helm chart, deployed as ArgoCD app **`cms`** into the `cms` namespace | +| Image | Built in CI to the Gitea registry; ArgoCD **Image Updater** tracks `gitea.arcodange.lab/arcodange-org/cms:latest` with the **digest** strategy (see [01 · factory](01-factory.md)) | +| Hostname | `arcodange.fr` (public) | + +## Cloudflare edge ([`cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare)) + +OpenTofu (state in cloud object storage) manages the `arcodange.fr` zone. The domain is **registered at OVH** (factory's [`iac/ovh.tf`](../../../iac/ovh.tf) grants the CMS an OVH OAuth2 client to edit nameservers) but its **DNS is delegated to Cloudflare**. The Cloudflare API token + account ID are pushed into the CMS Gitea repo as action secrets and mirrored into Vault by factory's [`iac/cloudflare.tf`](../../../iac/cloudflare.tf). + +| Cloudflare object | Purpose | +|---|---| +| Zone `arcodange.fr` | Public DNS for the site + email records | +| Cloudflare **Pages** option | Static-hosting alternative for the Nuxt build | +| **Cloudflared** Zero-Trust tunnel | Exposes **internal Traefik** to the Internet without opening home-LAN ports | +| **Turnstile** CAPTCHA | Bot challenge on forms; wired to **CrowdSec** for decisioning | + +The Cloudflared tunnel token and Turnstile secret are stored in **Vault** (see [secrets-and-vault.md](secrets-and-vault.md)); the Turnstile → CrowdSec link is the public-edge guard documented in [02 · tools](02-tools.md). + +## Zoho email ([`zoho/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/zoho)) + +Sets up email for `arcodange.fr`: org/account lookup via the Zoho API + shell scripts, the full DNS authentication record set, and the public aliases. + +| DNS record | Role | +|---|---| +| CNAME (verify) | Domain ownership verification | +| **MX** | Mail routing to Zoho | +| **SPF** | Authorized senders | +| **DKIM** | Outbound signing | +| **DMARC** | Alignment + reporting policy | +| **BIMI** | Brand logo in inboxes | + +Seven aliases are provisioned: **bonjour, contact, analytics, books, abonnements, helloworld, bureaux**. + +## Public request + email path + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef edge fill:#d97706,stroke:#b45309,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + + USER(["Visitor"]):::edge + CF["Cloudflare
DNS + Turnstile"]:::edge + TUN["Cloudflared tunnel"]:::edge + TRAEFIK["internal Traefik"]:::proc + CS["CrowdSec bouncer"]:::proc + CMS["cms pod (Nuxt)
arcodange.fr"]:::proc + MAIL(["Sender"]):::edge + ZOHO["Zoho
MX / SPF / DKIM / DMARC / BIMI"]:::store + + USER --> CF -- "Turnstile challenge" --> TUN --> TRAEFIK --> CS --> CMS + MAIL -- "MX lookup arcodange.fr" --> ZOHO +``` + +1. A **visitor** resolves `arcodange.fr` through **Cloudflare** DNS; form submissions hit a **Turnstile** challenge. +2. Traffic enters the home LAN through the **Cloudflared** Zero-Trust tunnel — no home-LAN ports are opened. +3. The tunnel lands on **internal Traefik**, which routes through the **CrowdSec** bouncer (fed Turnstile/decision signals) to the **`cms`** Nuxt pod. +4. Separately, **email** to `arcodange.fr` follows the **MX** record to **Zoho**, with **SPF/DKIM/DMARC/BIMI** authenticating and presenting the mail; the seven aliases land there. + +## Cross-references + +- [Lab ecosystem hub](README.md) — the whole-lab map. +- [01 · factory](01-factory.md) — the ArgoCD app `cms`, and `iac/cloudflare.tf` / `iac/ovh.tf` that grant the CMS its Cloudflare token and OVH nameserver-edit rights. +- [02 · tools](02-tools.md) — **CrowdSec** (the Traefik bouncer the Turnstile challenge feeds). +- [secrets-and-vault.md](secrets-and-vault.md) — the Cloudflared tunnel token and Turnstile/Cloudflare secrets stored in Vault. +- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms). diff --git a/vibe/guidebooks/lab-ecosystem/README.md b/vibe/guidebooks/lab-ecosystem/README.md new file mode 100644 index 0000000..f932bec --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/README.md @@ -0,0 +1,116 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **Lab ecosystem** + +# Lab ecosystem + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Related:** [ADR-0001 · safe prod-like environment](../../ADR/0001-safe-prod-like-environment.md) · [PRD · safe prod-like environment](../../PRD/safe-prod-like-environment/README.md) · [INV-001 · prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) + +## What this is + +This guidebook is the **end-to-end map of the Arcodange home lab** — how the three repos (`factory`, `tools`, `cms`), the three Raspberry Pis, and the cloud edge wire together into one running system. It is a *descriptive reference map*, not a procedure: it answers *"how does this fit together right now?"*. For *"how do I add a new app step by step?"* see the [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md); for *"why was it built this way?"* see the [factory ADRs](../../../doc/adr/README.md). + +The lab is run from **one control node** — a MacBook Pro M4 — driving everything via Ansible (imperative host setup) and OpenTofu (declarative cloud/Gitea/Vault/Postgres state). The three Pis (`pi1`/`pi2`/`pi3` = `192.168.1.201-203`) sit behind a home Livebox. `pi1` is the k3s server; `pi2`/`pi3` are agents. Gitea + PostgreSQL run as Docker Compose **outside** k3s on `pi2`'s disk; everything else runs **inside** k3s on Longhorn distributed block storage. The public edge is a Cloudflared Zero-Trust tunnel into the internal Traefik, with Cloudflare DNS and Zoho email fronting `arcodange.fr`. + +## The whole lab, end to end + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TB + classDef ctrl fill:#2563eb,stroke:#1e40af,color:#fff + classDef host fill:#0891b2,stroke:#0e7490,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef edge fill:#d97706,stroke:#b45309,color:#fff + classDef dead fill:#6b7280,stroke:#4b5563,color:#fff + + MAC["Control node (MacBook Pro M4)
Ansible + OpenTofu"]:::ctrl + + subgraph LAN["Home LAN (Livebox) — 192.168.1.0/24"] + subgraph PI2["pi2 · 192.168.1.202 (docker-compose, outside k3s)"] + GITEA["Gitea
arcodange-org/*"]:::host + PG[("PostgreSQL")]:::store + end + subgraph K3S["k3s cluster — pi1 server, pi2/pi3 agents"] + ARGO["ArgoCD app-of-apps
/argocd"]:::proc + LH[("Longhorn
block storage")]:::store + VAULT["Vault + VSO
secrets"]:::store + TRAEFIK["Traefik
ingress"]:::proc + TOOLS["tools namespace
(Vault, Grafana, CrowdSec, …)"]:::host + APPS["app namespaces
(webapp, erp, cms, …)"]:::host + end + OLLAMA["pi3 · ollama"]:::host + end + + subgraph CLOUD["Cloud edge"] + CF["Cloudflare DNS
+ Cloudflared tunnel"]:::edge + ZOHO["Zoho
email (arcodange.fr)"]:::edge + GCS[("GCS gs://arcodange-tf
OpenTofu state + Longhorn backup")]:::store + end + + INTERNET(["Internet"]):::edge + + MAC -- "Ansible: provision hosts, k3s, docker-compose" --> PI2 + MAC -- "Ansible: k3s, Longhorn, Traefik" --> K3S + MAC -- "OpenTofu: Gitea/Vault/PG/Cloudflare/OVH state" --> GITEA + MAC -- "OpenTofu state" --> GCS + + GITEA -- "repoURL chart/" --> ARGO + ARGO -- "Application CRDs (prune+selfHeal)" --> TOOLS + ARGO -- "Application CRDs (prune+selfHeal)" --> APPS + VAULT -- "VSO injects secrets into pods" --> TOOLS + VAULT -- "VSO injects secrets into pods" --> APPS + APPS -- "dynamic creds" --> PG + LH -. "PVCs" .- TOOLS + LH -. "PVCs" .- APPS + LH -- "backup target" --> GCS + + INTERNET --> CF -- "tunnel" --> TRAEFIK --> APPS + INTERNET --> ZOHO +``` + +1. The **control node** (MacBook) provisions the three Pis with Ansible (OS, disks, Docker, k3s, Longhorn, Traefik) and manages all SaaS/Gitea/Vault/Postgres state with OpenTofu. +2. On **pi2**, Gitea and PostgreSQL run as Docker Compose *outside* k3s, on the local disk — they are the source-of-truth services the cluster depends on. +3. OpenTofu keeps its **state in GCS** (`gs://arcodange-tf`), and Longhorn pushes volume **backups** to the same GCS project. +4. **Gitea** hosts every app repo; each repo's `chart/` directory is the deployable Helm chart. +5. **ArgoCD's app-of-apps** turns each Gitea repo into an `Application` CRD (automated `prune` + `selfHeal`) that deploys into the `tools` namespace and the per-app namespaces. +6. **Vault** is the single source of truth for secrets; the **Vault Secrets Operator (VSO)** injects them into pods via Kubernetes auth, and apps draw dynamic PostgreSQL credentials from Vault against `pi2`. +7. **Longhorn** provides the PVCs the in-cluster workloads mount, and backs up to GCS. +8. The **public edge** routes Internet traffic through Cloudflare DNS and a Cloudflared Zero-Trust **tunnel** into the internal **Traefik**, which fronts the app namespaces; **Zoho** handles `arcodange.fr` email. + +> [!NOTE] +> The ArgoCD Helm chart under [`argocd/`](../../../argocd/) is defined and templated, but **ArgoCD itself is not currently deployed in-cluster** (its install step is commented out in the `03_cicd` provisioning). The app-of-apps wiring documented here is the intended steady state; see [01 · factory](01-factory.md) for the caveat. + +## Deploy / secrets / DNS flows + +- **Deploy flow.** Push to a Gitea repo → CI builds an image into the Gitea registry → ArgoCD (via the app-of-apps and, for some apps, the Image Updater) syncs the `chart/` directory into the matching namespace with `prune` + `selfHeal`. The whole chain keys off one `` identifier — see [naming-conventions.md](naming-conventions.md). +- **Secrets flow.** Vault is the **single source of truth** (no sops/age). CI authenticates to Vault via **Gitea OIDC JWT** (role `gitea_cicd_`); pods receive secrets at runtime via **VSO** (Kubernetes auth + `VaultDynamicSecret` CRDs). Detail in [secrets-and-vault.md](secrets-and-vault.md). +- **DNS / edge flow.** Internal names resolve under `*.arcodange.lab` (Pi-hole + Step-CA-issued TLS). Public traffic for `arcodange.fr` enters through Cloudflare and a Cloudflared tunnel to internal Traefik; public TLS is Let's Encrypt via Traefik's DNS-challenge (DuckDNS). Email runs through Zoho. Edge detail in [03 · cms](03-cms.md). + +## Master index + +| Page | What it maps | Status | +|---|---|---| +| [01 · factory](01-factory.md) | The cornerstone admin repo: Ansible host/cluster provisioning, ArgoCD app-of-apps, OpenTofu (`iac/`), and per-app PostgreSQL (`postgres/iac/`) | ✅ Active | +| [02 · tools](02-tools.md) | The `tools` namespace: Vault, VSO, Prometheus, Grafana, CrowdSec, poolers, Redis/KeyDB, Plausible + ClickHouse, the `tool` library chart | ✅ Active | +| [03 · cms](03-cms.md) | The public-facing site: Nuxt static site, Cloudflare zone + tunnel + Turnstile, Zoho email (MX/SPF/DKIM/DMARC/BIMI + aliases) | ✅ Active | +| [naming-conventions.md](naming-conventions.md) | The `` join key — one kebab-case name reused identically across Gitea, PG, Vault, k8s, ArgoCD, GCS, DNS | ✅ Active | +| [secrets-and-vault.md](secrets-and-vault.md) | How Vault is the single source of truth: Gitea OIDC JWT for CI, VSO injection for pods, dynamic PostgreSQL creds | ✅ Active | +| [storage-and-recovery.md](storage-and-recovery.md) | Longhorn block storage, GCS backup target, and the tested power-cut recovery sequence | ✅ Active | + +## Status legend + +✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. + +## Maintenance rule + +> [!IMPORTANT] +> **If you alter a component documented here, update its page in the same change.** A reference map that drifts from reality sends readers (and agents) confidently down dead paths. The PR that changes the component is the PR that updates its guidebook page — treat the doc edit as part of the diff, not a follow-up. + +## Cross-references + +- [ADR-0001 · safe prod-like environment](../../ADR/0001-safe-prod-like-environment.md) — the decision this map supports. +- [PRD · safe prod-like environment](../../PRD/safe-prod-like-environment/README.md) — the product framing of an isolated, prod-like sandbox. +- [INV-001 · prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) — the couplings (the `` join key, shared Vault/PG/Longhorn) that make blast radius real. +- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs (FRENCH). +- [new-web-app conventions](../../../doc/runbooks/new-web-app/conventions.md) — the authoritative source for the `` naming convention. diff --git a/vibe/guidebooks/lab-ecosystem/naming-conventions.md b/vibe/guidebooks/lab-ecosystem/naming-conventions.md new file mode 100644 index 0000000..544cb78 --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/naming-conventions.md @@ -0,0 +1,96 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Naming conventions (the `` join key)** + +# Naming conventions — the `` join key + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [Lab ecosystem](README.md) · [Factory brick](01-factory.md) · [Secrets & Vault](secrets-and-vault.md) · [PRD — isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) +> **Upstream (source of truth)**: [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md) (French, authoritative) + +## TL;DR + +Every application on the platform is pinned to **one** kebab-case identifier — `` (e.g. `erp`, `webapp`, `url-shortener`, `dance-lessons-coach`). That single string is reused **verbatim**, with no transformation, as the name of the app's Gitea repo, its PostgreSQL database and role, its Vault roles and policies, its Kubernetes namespace and ServiceAccount, its ArgoCD Application, its OpenTofu state prefix, and its DNS records. The bricks of the stack do not point at each other through explicit configuration; they **wire together by guessing each other's names from ``**. Pick the name once, get it right, and the whole chain self-assembles. One typo anywhere, and the chain breaks silently. + +## What `` is + +`` is a **lowercase, kebab-case** slug. It is the join key of the entire platform — the one value that lets a dozen otherwise-independent systems agree on which resources belong to the same application without ever exchanging a config pointer. The canonical, authoritative definition (in French) lives in the runbook: [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md). This page is the English concept summary inside the ecosystem guidebook. + +## The mapping — one name, every system + +The table below shows how each system derives its identifier from ``, with the `erp` application as the worked example. + +| System | Identifier derived from `` | Example (`erp`) | +| --- | --- | --- | +| Gitea repository | `arcodange-org/` | `arcodange-org/erp` | +| PostgreSQL database | `` | `erp` | +| PostgreSQL owner role (non-login) | `_role` | `erp_role` | +| Vault dynamic DB role | `postgres/creds/` | `postgres/creds/erp` | +| Vault Kubernetes auth role | `` | `erp` | +| Vault runtime policy (pod) | `` | `erp` | +| Vault CI/ops policy | `-ops` | `erp-ops` | +| Vault CI JWT role (Gitea OIDC) | `gitea_cicd_` | `gitea_cicd_erp` | +| Vault KV config path | `kvv2//config` | `kvv2/erp/config` | +| Kubernetes namespace | `` | `erp` | +| Kubernetes ServiceAccount | `` | `erp` | +| ArgoCD Application | `` | `erp` | +| OpenTofu state prefix (GCS) | `/main` | `erp/main` | +| Internal DNS | `.arcodange.lab` | `erp.arcodange.lab` | +| Public DNS | `.arcodange.fr` | `erp.arcodange.fr` | + +> [!NOTE] +> The `_role` suffix (PG owner role) and the `-ops` suffix (Vault CI policy/identity group) are the only two *systematic* transformations of ``. Everything else uses the bare slug. Note the suffix style differs: PostgreSQL uses an underscore (`erp_role`) because hyphens are awkward in SQL identifiers, whereas Vault and Kubernetes use a hyphen (`erp-ops`). + +## Why uniformity is structuring + +The platform is a set of loosely-coupled bricks (Gitea, Postgres, Vault, k3s/ArgoCD, OpenTofu, DNS). They were deliberately built **not** to hold explicit references to one another. Instead, each brick reconstructs the names it needs from `` at the moment it runs: + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart LR + APP["<app>
(one kebab-case slug)"]:::src + + APP --> GIT["Gitea repo
arcodange-org/<app>"]:::brick + APP --> PG["PostgreSQL
db <app> · role <app>_role"]:::brick + APP --> VAULT["Vault
postgres/creds/<app>
policy <app> · gitea_cicd_<app>"]:::brick + APP --> K8S["Kubernetes
namespace + SA <app>"]:::brick + APP --> ARGO["ArgoCD
Application <app>"]:::brick + APP --> GCS["OpenTofu state
<app>/main"]:::brick + APP --> DNS["DNS
<app>.arcodange.lab / .fr"]:::brick + + VAULT -.->|"GRANT <app>_role
assumes PG role name"| PG + K8S -.->|"VaultDynamicSecret reads
postgres/creds/<app>"| VAULT + ARGO -.->|"repoURL=.../<app>
namespace=<app>"| GIT + + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef brick fill:#059669,stroke:#047857,color:#fff +``` + +1. The chosen slug `` is the single input. +2. From it, each brick names its own resource: Gitea names the repo `arcodange-org/`; Postgres names the database `` and its owner role `_role`; Vault names the dynamic-creds role `postgres/creds/`, the runtime policy ``, and the CI JWT role `gitea_cicd_`; Kubernetes names the namespace and ServiceAccount ``; ArgoCD names the Application ``; OpenTofu writes state under `/main`; DNS publishes `.arcodange.lab` and `.arcodange.fr`. +3. The dashed arrows are the cross-brick assumptions that make it work: the Vault `app_roles` module issues a dynamic PG user with `GRANT _role TO …`, **assuming** the Postgres owner role is named exactly `_role`; the chart's `VaultDynamicSecret` reads `postgres/creds/`, **assuming** the Vault role is named exactly ``; the ArgoCD Application derives `repoURL=.../` and `namespace=` from the slug alone, **assuming** the Gitea repo and the namespace match. +4. None of these links is configured by hand. They hold purely because every brick was given the same `` to reconstruct from. + +## The failure mode of a typo + +Because the wiring is by name and not by explicit reference, **nothing validates the join key end-to-end**. A single divergence — `my_app` vs `my-app`, a stray capital (`MyApp`), an accidental plural (`erps`) — does not raise an error at creation time. The mismatched brick simply builds a resource under a name no one else looks for: + +- A Postgres owner role created as `erp-role` (hyphen) instead of `erp_role` → Vault's `GRANT erp_role` fails or grants nothing → the pod gets a DB user with no privileges. +- A Gitea repo named `erp-app` instead of `erp` → ArgoCD's derived `repoURL=.../erp` 404s → the Application never syncs. +- A namespace typo → the `VaultDynamicSecret` and ServiceAccount land in the wrong place → silent auth failure at pod start. + +The symptom is always the same: a brick that *looks* provisioned but never connects, with no single component to blame. This is why the slug must be **short, stable, and correct from the first step** — there is no safety net downstream. + +✅ Choose a short, stable, lowercase kebab-case name up front and reuse it character-for-character. +❌ Never introduce variants (case, separators, plurals); nothing will warn you. + +## Why this makes a sandbox safe + +The `` convention is also the reason a **production-like sandbox can reuse the exact same names** without colliding with production. Because every brick derives its resource names from `` and from nothing else, an entire parallel universe of the platform — its own Vault, its own Postgres instance, its own k3s namespace scope — can host an `erp` named identically to the production `erp`, provided the two universes never share a backing store. Identity comes from the *environment boundary*, not from the name; the name is free to repeat. This is what lets QA and recovery drills run against `erp`, `webapp`, etc. with realistic identifiers instead of mangled `erp-staging`-style aliases that would themselves break the name-wiring. See the PRD's [isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) for how that environment fence is drawn. + +## See also + +- [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md) — the authoritative French source, with per-step references into the 8-step "new web app" runbook. +- [Secrets & Vault](secrets-and-vault.md) — how `gitea_cicd_` and the `` / `-ops` policies fit the auth model. +- [Factory brick](01-factory.md) — where the ArgoCD app-of-apps, the Postgres OpenTofu, and the IaC live. +- [PRD — isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) — why identical names are safe across environments. +- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). diff --git a/vibe/guidebooks/lab-ecosystem/secrets-and-vault.md b/vibe/guidebooks/lab-ecosystem/secrets-and-vault.md new file mode 100644 index 0000000..1c71e74 --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/secrets-and-vault.md @@ -0,0 +1,110 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Secrets & Vault** + +# Secrets & Vault + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [Lab ecosystem](README.md) · [Tools brick](02-tools.md) · [Storage & recovery](storage-and-recovery.md) · [Naming conventions](naming-conventions.md) +> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) + +## TL;DR + +**HashiCorp Vault is the single source of truth for every secret in the lab.** There is no sops, no age, no secret files in git — if a credential exists, Vault either stores it or mints it on demand. Two parties consume secrets, and each authenticates a different way: **pods** use the Kubernetes auth backend (via the Vault Secrets Operator), and **CI / OpenTofu** use Gitea OIDC JWT (one role `gitea_cicd_` per app). Vault holds static config in KV, encryption keys in transit, and issues **short-lived, dynamic** PostgreSQL credentials so no long-lived DB password is ever written down. The trade-off: Vault is sealed on every restart and must be **manually unsealed** (1 key, threshold 1) before anything that needs a secret can come back. + +## Why Vault, and only Vault + +The lab made a deliberate choice: **one** secret store, accessed over the network, rather than encrypted secret files scattered through the repos. The consequences are structuring: + +- **No secret material in git.** Charts and OpenTofu reference Vault *paths*, never values. A leaked repo leaks no credentials. +- **One revocation point.** Rotating or revoking a credential happens in Vault; consumers pick up the change on their next read or lease renewal. +- **Dynamic over static.** Where a backend supports it (Postgres), Vault issues a fresh, time-boxed credential per consumer instead of a shared static password. + +Vault itself runs as the `hashicorp-vault` chart in the **tools** namespace. Its full configuration — engines, auth backends, policies, the per-app role/policy modules — lives in the tools repo; see the [Tools brick](02-tools.md) for the deployment context. + +## What Vault mounts + +| Mount | Type | Purpose | +| --- | --- | --- | +| `kvv2/` | KV v2 (versioned) | Application static config, e.g. `kvv2//config`. Versioned so a bad write can be rolled back. | +| KV v1 | KV v1 (unversioned) | Flat secrets that don't need history. | +| `transit/` | Transit | Encryption-as-a-service: encrypt/decrypt and sign without exposing the key. | +| `postgres/` | Database (dynamic) | Issues **short-lived** PostgreSQL credentials on demand: `postgres/creds/` hands out a fresh login user, granted `_role`, with a lease that expires. | + +The `` slug threads through every one of these paths — `kvv2//config`, `postgres/creds/` — exactly as described in [Naming conventions](naming-conventions.md). + +## The two auth backends + +Vault doesn't trust callers by static token. Each class of consumer proves its identity through a backend matched to where it runs: + +- **Kubernetes auth** — for **pods**. The Vault Secrets Operator (VSO) and workloads present their Kubernetes ServiceAccount token; Vault validates it against the cluster's API and maps the SA to the Vault role ``, which carries the runtime policy ``. +- **Gitea OIDC / JWT auth** — for **CI and OpenTofu**. A Gitea Actions workflow obtains an OIDC token; Vault validates it and maps it to the JWT role `gitea_cicd_`, which carries the CI/ops policy `-ops`. This is how `tofu apply` in CI reads and writes the secrets it manages without any pre-shared Vault token. + +The split matters: pods get only what they need at runtime (the `` policy), while CI gets the broader provisioning rights (`-ops`) needed to *create* the very secrets the pods will later read. + +## How VSO delivers secrets to pods + +Inside the cluster, the **Vault Secrets Operator** is the bridge between Vault and Kubernetes. It watches two CRDs: + +- **`VaultAuth`** — declares *how* to authenticate to Vault (the Kubernetes auth mount + the `` role). +- **`VaultDynamicSecret`** (and `VaultStaticSecret`) — declares *what* to fetch (e.g. `postgres/creds/`) and which Kubernetes Secret to materialise it into. For dynamic secrets, VSO also **renews the lease** and rotates the Secret before it expires. + +The pod then mounts the resulting Kubernetes Secret as it would any other — it never speaks to Vault directly, and never sees a static DB password. + +## The secret flow, end to end + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart LR + subgraph CI["CI / Provisioning path"] + GHA["Gitea Actions
workflow"]:::src + TOFU["OpenTofu
tofu apply"]:::proc + end + + subgraph RT["Runtime path (in-cluster)"] + VSO["Vault Secrets
Operator (VSO)"]:::proc + POD["App pod
(ServiceAccount <app>)"]:::proc + end + + VAULT["Vault
KV v1/v2 · transit · postgres dynamic"]:::store + + GHA -->|"OIDC JWT
role gitea_cicd_<app>"| VAULT + VAULT -->|"policy <app>-ops
read/write secrets"| TOFU + TOFU -->|"writes config to
kvv2/<app>/config"| VAULT + + VSO -->|"k8s auth
role <app> (SA token)"| VAULT + VAULT -->|"dynamic creds
postgres/creds/<app>"| VSO + VSO -->|"materialises +
renews K8s Secret"| POD + + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff +``` + +1. **CI path:** a Gitea Actions workflow requests an OIDC JWT and presents it to Vault under the role `gitea_cicd_`. Vault validates the token and grants the `-ops` policy. +2. With that policy, OpenTofu (`tofu apply`, running in CI) reads the secrets it needs and writes the app's static config back to `kvv2//config`. No pre-shared Vault token is ever stored — the trust is established per-run via OIDC. +3. **Runtime path:** in the cluster, the Vault Secrets Operator authenticates with the Kubernetes auth backend, presenting the app's ServiceAccount token mapped to the Vault role ``. +4. Vault issues a **short-lived, dynamic** PostgreSQL credential from `postgres/creds/` back to VSO. +5. VSO materialises that credential into a Kubernetes Secret in the app's namespace, then **renews the lease** and rotates the Secret before it expires. +6. The app pod mounts the Kubernetes Secret like any other — it never talks to Vault, and never holds a long-lived database password. + +## The unseal model + +Vault encrypts its storage with a master key that is **never persisted in usable form**. On every start — a fresh deploy, a pod reschedule, or a full cluster recovery — Vault comes up **sealed** and refuses every request until it is unsealed. + +- **Shamir config:** 1 unseal key, threshold 1 (a single-operator lab, so no key-splitting ceremony). +- **Where the key lives:** on the control node (the MacBook), at `~/.arcodange/cluster-keys.json`. It is *not* in git, *not* in Kubernetes, *not* in Vault. +- **Operational consequence:** **nothing that needs a secret recovers until a human unseals Vault.** This is the chokepoint baked into the recovery order — VSO cannot re-auth, dynamic DB creds cannot be issued, and dependent apps cannot start, until the unseal happens. See [Storage & recovery](storage-and-recovery.md) for where unseal sits in the tested startup sequence. + +> [!CAUTION] +> If `~/.arcodange/cluster-keys.json` is lost, Vault's data is **unrecoverable** — there is no second copy of the unseal key and no key-recovery path. Treat that file as the most critical secret in the lab. + +## Sandbox implications + +A production-like sandbox does **not** share the production Vault. It runs its **own** Vault instance with its **own** unseal key and its **own** policies, so that exercising secret flows, rotating credentials, or testing a broken unseal cannot touch production secrets. Because the `` join key is environment-relative (see [Naming conventions](naming-conventions.md)), the sandbox can keep identical role and policy names — `gitea_cicd_`, ``, `-ops` — while remaining fully isolated. The rationale for that separate-Vault, separate-unseal posture is recorded in [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). + +## See also + +- [Tools brick](02-tools.md) — where the `hashicorp-vault` chart, VSO, and the per-app Vault IaC modules are deployed. +- [Storage & recovery](storage-and-recovery.md) — Vault unseal as a step in the tested power-cut recovery order. +- [Naming conventions](naming-conventions.md) — how `gitea_cicd_`, ``, and `-ops` derive from the join key. +- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) — the sandbox's separate-Vault decision. diff --git a/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md b/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md new file mode 100644 index 0000000..0b7a6d1 --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md @@ -0,0 +1,76 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery** + +# Storage & recovery + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [Lab ecosystem](README.md) · [Secrets & Vault](secrets-and-vault.md) · [Factory brick](01-factory.md) · [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) +> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) +> **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) + +## TL;DR + +The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** — Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** — that brings the cluster back deterministically. That order is now rehearsed as a drill. + +## Two storage tiers, on purpose + +| Tier | Backing | What lives there | Why here | +| --- | --- | --- | --- | +| **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. | +| **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — **without** k3s being healthy, so they cannot live inside it. | + +This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations. + +## The Longhorn engine-ID re-association failure mode + +Longhorn stores each replica's data on a node in a directory named **`-`**. The raw `volume-head-*.img` files are durable — they survive a power cut on the disk. The danger is in the *metadata*, not the data: + +1. A power cut drops the Longhorn CSI driver. +2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible. +3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**. +4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories. + +The result: the real data sits in an orphaned `…-/` directory while Longhorn happily serves an empty `…-/`. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). + +> [!CAUTION] +> Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead. + +## The tested 2026-04-13 power-cut recovery + +The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a **dependency gate** for the next: + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart TD + PC["Power cut
(cluster down, disks intact)"]:::dead + + PC --> LH["1 · Restore Longhorn
volumes (block-device
injection if engine IDs changed)"]:::store + LH --> VU["2 · Unseal Vault
(1 key, threshold 1,
key on the Mac)"]:::proc + VU --> VSO["3 · VSO re-auth
(k8s auth → fresh
dynamic creds)"]:::proc + VSO --> ERP["4 · Scale up ERP
last (depends on DB +
injected secrets)"]:::src + + classDef dead fill:#6b7280,stroke:#4b5563,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef src fill:#2563eb,stroke:#1e40af,color:#fff +``` + +1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done. +2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint — see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step. +3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials. +4. **Scale up ERP last.** ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1–3 are confirmed avoids a crash-loop against a half-recovered platform. + +The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo). + +## Why this is rehearsed in the sandbox + +A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md). + +## See also + +- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID failure mode, the five recovery methods, and the block-device injection automation. +- [Secrets & Vault](secrets-and-vault.md) — the unseal model and why it gates step 2 of the recovery order. +- [Factory brick](01-factory.md) — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation. +- [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) — how recovery drills become routine QA. +- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). +- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo). diff --git a/vibe/investigations/INV-001-prod-blast-radius-couplings.md b/vibe/investigations/INV-001-prod-blast-radius-couplings.md new file mode 100644 index 0000000..b6d3520 --- /dev/null +++ b/vibe/investigations/INV-001-prod-blast-radius-couplings.md @@ -0,0 +1,206 @@ +[vibe](../README.md) > [Investigations](README.md) > **INV-001 · Prod blast-radius couplings** + +# INV-001: Prod blast-radius couplings + +> **Status**: Complete +> **Date**: 2026-06-23 +> **Priority**: 🔴 P1 +> **Related**: [ADR-0001 · Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) · [PRD · Isolation boundary](../PRD/safe-prod-like-environment/isolation-boundary.md) + +> [!NOTE] +> **Origin.** This investigation was spun out of the [safe-environment ADR](../ADR/0001-safe-prod-like-environment.md) and [PRD](../PRD/safe-prod-like-environment/README.md) design work, to enumerate exactly which prod couplings a sandbox must isolate before any sandbox run can be trusted not to mutate live production. + +## Objectives + +What this investigation set out to answer: + +- [x] Enumerate every place where the GitOps repos hardcode a **live prod endpoint, credential path, state location, or auto-reconciling controller**. +- [x] For each, cite the **concrete file and value** so the isolation control can be written against a known target. +- [x] State the **blast radius** of an accidental sandbox-targets-prod mistake per coupling. +- [x] Name the **sandbox control** that severs each coupling (cross-referenced to the PRD isolation boundary). +- [x] Classify severity so Phase 0 guardrails know what *must* land first. + +## Executive summary + +The lab is administered from a single MacBook holding kubeconfig, the Vault unseal key, and every cloud admin token; the repos point at live prod by default, so any sandbox seeded from those same repos will hit prod unless explicitly fenced off. The worst couplings are the **PostgreSQL superuser provider hardwired to `192.168.1.202`** (a wrong apply can `DROP`/`ALTER` the live ERP, CMS and other business DBs), the **Vault unseal key at a fixed path `~/.arcodange/cluster-keys.json`** (a botched sandbox init could overwrite the one key that unseals prod), and **ArgoCD app-of-apps with `prune: true` + `selfHeal: true`** pointed at `targetRevision: HEAD` of the live Gitea (an auto-reconcile can delete live resources fleet-wide). Secondary but real: the Ansible inventory targeting `192.168.1.201-203`, the single GCS state bucket `arcodange-tf` shared across all stacks, and the Cloudflare/OVH/Zoho tokens that control public `arcodange.fr` DNS and email. Each coupling has a clean sandbox control (separate inventory + prod-IP abort guard, separate Vault + unseal path, separate state prefix family, plan-only DNS). **None of these controls exist yet — Phase 0 must build them before the first sandbox run.** + +## Findings + +Each finding leads with a one-line **Brief**, then the concrete evidence (file + value), then the bold **Finding** with a severity tag. + +### 1. PostgreSQL superuser provider hardwired to the live host + +**Brief.** The Postgres OpenTofu stack connects as **superuser** straight to the production database host with no environment switch. + +Evidence — [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf): + +```hcl +provider "postgresql" { + host = "192.168.1.202" + username = var.POSTGRES_USERNAME + password = var.POSTGRES_PASSWORD + sslmode = "disable" + superuser = true +} +``` + +The host is a literal — there is no `var.pg_host`, no workspace gate, no profile. This is the same PG instance (the docker-compose on pi2, outside k3s) that backs the Dolibarr ERP, the CMS, and every app DB created per the [`` convention](../../doc/runbooks/new-web-app/conventions.md). A superuser session here can `DROP DATABASE`, `ALTER ROLE`, or revoke logins on any of them. Running `tofu apply` from a sandbox shell that still has the prod state and creds wired would act on live data. + +**Finding:** Highest-risk coupling. A single wrong `apply` against `192.168.1.202` as superuser can cause **irreversible ERP/business data loss**. The sandbox PG must be the docker-compose on the sandbox "pi2-equivalent", and a guard must **refuse to apply when `host == 192.168.1.202` and `workspace != prod`**. 🔴 + +### 2. Vault address + unseal key at a fixed local path + +**Brief.** Every IaC stack authenticates to the one prod Vault, and the unseal key sits at a single hardcoded path that a sandbox init could clobber. + +Evidence — both [`iac/providers.tf`](../../iac/providers.tf) and [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf) declare: + +```hcl +provider "vault" { + address = "https://vault.arcodange.lab" + auth_login_jwt { + mount = "gitea_jwt" + role = "gitea_cicd" + } +} +``` + +The prod unseal key (1 key, threshold 1) lives at `~/.arcodange/cluster-keys.json` — the single secret that brings prod Vault back after a seal (see the lab-root recovery runbook, named below). Vault is the **single source of truth** for all secrets, so a policy/auth/mount change against `vault.arcodange.lab`, or an init/operator step that writes to the default key path, has fleet-wide reach: VSO across every namespace re-reads from this Vault. + +**Finding:** Two failure modes. (a) Sandbox IaC pointed at `vault.arcodange.lab` can rewrite prod policies/auth and lock out VSO → fleet-wide secret outage. (b) A botched sandbox `vault operator init` writing to `~/.arcodange/cluster-keys.json` would **overwrite the prod unseal key**, making prod unrecoverable after the next seal. Sandbox needs a **separate Vault** and the unseal-key path overridden to `~/.arcodange/sandbox/cluster-keys.json`. 🔴 + +### 3. ArgoCD app-of-apps: live repoURL + HEAD + prune/selfHeal + +**Brief.** The app-of-apps template renders Applications that auto-reconcile against `HEAD` of the live Gitea, with pruning and self-heal on by default. + +Evidence — [`argocd/templates/apps.yaml`](../../argocd/templates/apps.yaml) loops `values.gitea_applications` into Application CRDs: + +```yaml +source: + repoURL: https://gitea.arcodange.lab/{{ $org }}/{{ $app_name }} + targetRevision: HEAD + path: chart +syncPolicy: + automated: + prune: true + selfHeal: true +``` + +[`argocd/values.yaml`](../../argocd/values.yaml) lists the live apps (`url-shortener`, `tools`, `webapp`, `telegram-gateway`, `erp`, `cms`, `dance-lessons-coach`); several add ArgoCD Image Updater annotations that chase `:latest` digests. `prune: true` means a resource that disappears from git is deleted from the cluster; `selfHeal: true` means manual changes are reverted. `targetRevision: HEAD` means the live cluster follows whatever lands on the default branch. + +> [!NOTE] +> ArgoCD itself is not currently deployed in-cluster (it is commented out in `03_cicd`), so this controller is **latent** today. The coupling matters the moment ArgoCD is enabled — and sandbox work to enable/iterate on ArgoCD is exactly when it would bite. + +**Finding:** When ArgoCD is live, a sandbox change to the app-of-apps (or an Image Updater misfire) reconciled against the prod `repoURL`/`HEAD` can **prune live resources fleet-wide**. Sandbox needs its own ArgoCD pointed at a **sandbox branch/Gitea** so it only syncs sandbox refs. 🟠 + +### 4. Ansible inventory targets the live Pi IPs + +**Brief.** The only inventory targets the three production Raspberry Pis directly; a stray playbook run hits prod hardware. + +Evidence — [`ansible/arcodange/factory/inventory/hosts.yml`](../../ansible/arcodange/factory/inventory/hosts.yml): + +```yaml +raspberries: + hosts: + pi1: { preferred_ip: 192.168.1.201, ... } + pi2: { preferred_ip: 192.168.1.202, ... } + pi3: { preferred_ip: 192.168.1.203, ... } +postgres: + hosts: { pi2: } +``` + +The numbered playbooks (`01_system` … `05_backup`) and the `recover/` plays operate on these hosts; `01_system`-class roles can wipe disks, re-init k3s, or disturb Longhorn replicas. There is **no sandbox inventory and no guard** — `ansible-playbook` defaults straight at prod. + +**Finding:** A misdirected playbook against `192.168.1.201-203` can **wipe disks / reset k3s / corrupt Longhorn** on prod. Sandbox needs a separate `inventory/sandbox/hosts.yml` (VM hosts only) **plus a pre-task that aborts if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true`**. 🔴 + +### 5. Single GCS state bucket shared by every stack + +**Brief.** All OpenTofu stacks store state in one bucket, separated only by `prefix`; a wrong backend config writes prod state. + +Evidence — [`iac/backend.tf`](../../iac/backend.tf) uses `bucket = "arcodange-tf"`, `prefix = "factory/main"`; [`postgres/iac/backend.tf`](../../postgres/iac/backend.tf) uses the same bucket with `prefix = "factory/postgres"`. Sibling stacks (`tools`, `cms`) follow the same bucket-with-prefix pattern. State is the authoritative map of real resources — a sandbox run that inherits the prod backend will read prod state, plan against prod resources, and on `apply` mutate them. + +> [!TIP] +> Note the name collision risk: [`iac/cloudflare.tf`](../../iac/cloudflare.tf) also creates a Cloudflare **R2** bucket literally named `arcodange-tf`. The GCS state bucket and the R2 object bucket share a name but are different stores; do not conflate them when scoping sandbox state. + +**Finding:** Without isolation, sandbox `tofu` reads/writes **prod state** in `arcodange-tf`. Sandbox needs a **sandbox prefix family** (`sandbox/factory/main`, `sandbox/factory/postgres`, …) via a backend-config override, or a separate bucket `arcodange-tf-sandbox`. 🟠 + +### 6. Cloudflare account / OVH arcodange.fr / Zoho — live public DNS & email + +**Brief.** IaC holds tokens that manage the public `arcodange.fr` zone, the OVH registrar nameservers, and the Zoho mail records; a wrong record silently breaks company email. + +Evidence — [`iac/providers.tf`](../../iac/providers.tf) declares `provider "cloudflare" {}` (token via `CLOUDFLARE_API_TOKEN`) and `provider "ovh" { endpoint = "ovh-eu" }`. [`iac/cloudflare.tf`](../../iac/cloudflare.tf) resolves the account by `arcodange@gmail.com` and a `cf_arcodange_cms_token` granting `zone:DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write`, etc. [`iac/ovh.tf`](../../iac/ovh.tf) grants `domain:apiovh:nameServer/edit` on `urn:v1:eu:resource:domain:arcodange.fr`. The Zoho mail wiring (MX/SPF/DKIM/DMARC/BIMI + aliases) lives in the sibling `cms` repo's `zoho/` and the `arcodange.fr` zone is managed at Cloudflare. A bad MX/SPF/DKIM record breaks `arcodange.fr` mail **silently, for days**. + +**Finding:** These are **public, customer-facing, and slow-to-detect**. The blast radius is broken company email and public site/tunnel exposure. Sandbox must run these modules **plan-only against a throwaway zone/subdomain with a separate token**; the real `arcodange.fr` token must **never** be exported into a sandbox shell. Real public DNS/ACME end-to-end is out of scope. 🟠 + +### 7. Longhorn backup target points at the prod backup bucket + +**Brief.** Longhorn's backup target is a fixed S3 bucket; a sandbox restore drill could overwrite prod backups. + +Evidence — [`argocd/templates/longhorn_backup_target.yaml`](../../argocd/templates/longhorn_backup_target.yaml) sets: + +```yaml +"backup-target": "s3://arcodange-backup@us-east-1/" +"backup-target-credential-secret": "longhorn-gcs-backup-credentials" +``` + +The credentials are injected by VSO from Vault path `kvv2 longhorn/gcs-backup`. Recovery drills are a core sandbox use-case (re-run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)) — but a drill that writes to `arcodange-backup` could clobber the real restore points. + +**Finding:** Sandbox restore drills against `s3://arcodange-backup` risk **overwriting prod backups** — the worst kind of failure during a recovery rehearsal. Sandbox backup target must be a **separate bucket/prefix**. 🟠 + +### 8. Gitea base_url + the restricted CI module-reader user + +**Brief.** The Gitea provider and a created CI user both bind to the live Gitea; sandbox IaC can mutate prod repos, secrets, and users. + +Evidence — `provider "gitea" { base_url = "https://gitea.arcodange.lab" }` in [`iac/providers.tf`](../../iac/providers.tf). [`iac/gitea_tofu_ci_user.tf`](../../iac/gitea_tofu_ci_user.tf) creates the `tofu_module_reader` user, an SSH key, and stores it in Vault `kvv1/gitea/tofu_module_reader`; [`iac/cloudflare.tf`](../../iac/cloudflare.tf) and [`iac/ovh.tf`](../../iac/ovh.tf) push repo actions secrets (`CLOUDFLARE_API_TOKEN`, `OVH_CLIENT_ID`, …) onto the live `cms` repo. A sandbox apply against this provider rewrites prod repo secrets and CI users. + +**Finding:** Sandbox IaC pointed at `gitea.arcodange.lab` can **mutate prod repo CI secrets and users**, indirectly poisoning prod CI. Sandbox needs its own Gitea (sandbox cluster or org `arcodange-sandbox`); ArgoCD app-of-apps then points at sandbox refs (see finding 3). 🟠 + +## Summary table + +| Coupling | Where (file · value) | Blast radius | Sandbox control | +| --- | --- | --- | --- | +| PG superuser provider | `postgres/iac/providers.tf` · `host = 192.168.1.202`, `superuser = true` | Drop/alter live ERP + app DBs → irreversible data loss | Sandbox PG (docker-compose on sandbox pi2-eq); guard refuses apply if `host == 192.168.1.202 && workspace != prod` | +| Vault address + unseal key | `iac/providers.tf` / `postgres/iac/providers.tf` · `vault.arcodange.lab` · key `~/.arcodange/cluster-keys.json` | Lock out VSO fleet-wide; overwrite prod unseal key → prod unrecoverable | Separate sandbox Vault; unseal path → `~/.arcodange/sandbox/cluster-keys.json` | +| ArgoCD app-of-apps | `argocd/templates/apps.yaml` · `repoURL gitea.arcodange.lab/...`, `HEAD`, `prune+selfHeal` | Auto-prune live resources fleet-wide (latent until ArgoCD deployed) | Sandbox ArgoCD → sandbox branch/Gitea; sync only sandbox refs | +| Ansible inventory | `ansible/.../inventory/hosts.yml` · `192.168.1.201-203` | Wipe disks / reset k3s / corrupt Longhorn on prod Pis | `inventory/sandbox/hosts.yml` (VMs only) + prod-IP abort guard unless `i_mean_prod=true` | +| GCS state bucket | `iac/backend.tf` / `postgres/iac/backend.tf` · `bucket = arcodange-tf` | Read/write prod state → plan & apply mutate prod resources | `sandbox/...` prefix family or `arcodange-tf-sandbox` bucket via backend override | +| Cloudflare / OVH / Zoho | `iac/providers.tf`, `iac/cloudflare.tf`, `iac/ovh.tf` (+ cms `zoho/`) · `arcodange.fr`, account `arcodange@gmail.com` | Break public DNS / company email silently for days | Plan-only against throwaway zone + separate token; never export the real `arcodange.fr` token into sandbox | +| Longhorn backup target | `argocd/templates/longhorn_backup_target.yaml` · `s3://arcodange-backup@us-east-1/` | Restore drill overwrites prod backups | Separate sandbox backup bucket/prefix | +| Gitea base_url + CI user | `iac/providers.tf` · `gitea.arcodange.lab` · `iac/gitea_tofu_ci_user.tf` | Rewrite prod repo CI secrets/users → poison prod CI | Sandbox Gitea (cluster or org `arcodange-sandbox`) | + +## Classification + +| # | Coupling | Severity | +| --- | --- | --- | +| 1 | PG superuser provider → `192.168.1.202` | 🔴 Critical — confirmed irreversible data-loss path | +| 2 | Vault address + unseal key path | 🔴 Critical — prod-secret outage and unrecoverable-unseal path | +| 4 | Ansible inventory → live Pi IPs | 🔴 Critical — disk-wipe / cluster-reset on prod hardware | +| 3 | ArgoCD app-of-apps prune/selfHeal | 🟠 Significant — fleet-wide prune; latent until ArgoCD is deployed | +| 5 | Shared GCS state bucket | 🟠 Significant — prod state mutation if backend not overridden | +| 6 | Cloudflare / OVH / Zoho public DNS & email | 🟠 Significant — public, customer-facing, slow to detect | +| 7 | Longhorn backup target | 🟠 Significant — prod backup overwrite during drills | +| 8 | Gitea base_url + CI user | 🟠 Significant — prod CI-secret/user mutation | + +| Emoji | Meaning | +| --- | --- | +| 🔴 | Critical — data-loss or breaking risk confirmed | +| 🟠 | Significant — degraded but working, or a real coupling to watch | +| 🟡 | Minor — worth noting, low impact | +| 🟢 | Healthy — verified safe / no issue found | +| 🔵 | Informational — context, no action implied | + +## Open questions + +- **Guard enforcement layer.** Should the prod-IP abort live as an Ansible pre-task only, or also as a wrapper script around `tofu`/`ansible-playbook` so the same fence covers both tools uniformly? (Phase 0 decision.) +- **Vault path discipline.** Beyond the unseal-key path, are there other tools (backup scripts, recovery runbook steps) that read a hardcoded `~/.arcodange/cluster-keys.json`? A grep sweep of the lab-root scripts is needed so the sandbox override is complete, not partial. +- **State backend override ergonomics.** Prefix family vs. a separate `arcodange-tf-sandbox` bucket — the bucket option is harder to misconfigure (no shared blast radius at all) but adds a provisioning step. Decide in the PRD isolation boundary. +- **ArgoCD readiness.** Since ArgoCD is currently latent (commented in `03_cicd`), confirm whether enabling it should happen *first* in the sandbox (so its prune behaviour is rehearsed before it ever reaches prod). +- **Throwaway DNS zone choice.** Which subdomain/zone and token scope are acceptable for plan-only Cloudflare/OVH tests without touching `arcodange.fr`? + +## References + +- [ADR-0001 · Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) — the decision this investigation supports. +- [PRD · Safe prod-like environment (hub)](../PRD/safe-prod-like-environment/README.md) and its [Isolation boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) — where each coupling's control is specified in full. +- [Lab ecosystem guidebook](../guidebooks/lab-ecosystem/README.md) — background on the prod topology and the `` join key. +- [New-web-app conventions (`` join key)](../../doc/runbooks/new-web-app/conventions.md) — why one identifier keys repo, DB, Vault, namespace, and DNS together. +- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association relevant to the backup-target and restore-drill coupling. +- `CLUSTER_RECOVERY.md` (at the lab root, outside this repo) — the tested power-cut recovery runbook; the unseal-key path and Vault-seal recovery referenced in findings 2 and 7 come from it. diff --git a/vibe/investigations/README.md b/vibe/investigations/README.md new file mode 100644 index 0000000..0387c27 --- /dev/null +++ b/vibe/investigations/README.md @@ -0,0 +1,45 @@ +[vibe](../README.md) > **Investigations** + +# Investigations + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [vibe/ADR](../ADR/README.md) · [vibe/PRD](../PRD/README.md) + +`vibe/investigations/` collects focused inquiries: a question is asked, evidence is gathered, and findings are recorded. Investigations feed ADRs (a decision often rests on an investigation) and PRDs (scoping rests on what we learned). Start from [`_template.md`](_template.md). + +## Convention + +- **Prefer a single numbered file** named `INV-NNN-slug.md`. Most investigations need nothing more. +- **When notebooks or data are involved**, keep `INV-NNN-slug.md` as a short **stub** (an origin note + a link) sitting *beside* a same-named folder `INV-NNN-slug/` that holds: + - notebooks — each `.ipynb` paired with an exported `.py` (so diffs and review work on plain text), + - a `_data/` directory for inputs/outputs, + - a `notebook_simple.md` — a plain-language walkthrough (visuals + explanations anyone can follow, no code required). +- **Diagonal-reading style.** Each finding section is written so a skimmer gets the point fast: lead with a one-line **Brief**, then the **evidence**, then a bold **Finding**. A reader can scan the Briefs and Findings alone and still understand the conclusion. +- **No-tombstone rule** applies: write findings as currently true. Corrections are made in place; git history is the audit trail. + +## Classification legend + +Use these on findings and in the Classification table to signal severity / nature: + +| Emoji | Meaning | +| --- | --- | +| 🔴 | Critical — data-loss or breaking risk confirmed | +| 🟠 | Significant — degraded but working, or a real coupling to watch | +| 🟡 | Minor — worth noting, low impact | +| 🟢 | Healthy — verified safe / no issue found | +| 🔵 | Informational — context, no action implied | + +## Index + +| # | Title | Status | Date | +| --- | --- | --- | --- | +| [INV-001](INV-001-prod-blast-radius-couplings.md) | Prod blast-radius couplings | ✅ Complete | 2026-06-23 | + +## Rules to contribute + +1. Copy [`_template.md`](_template.md) to `INV-NNN-slug.md` using the next free sequence number and delete the top HTML-comment note. +2. Fill in the blockquote (Status / Date / Priority / Related), Objectives, Executive summary, Findings (Brief → evidence → Finding each), Summary table, Classification, Open questions, References. +3. If the investigation needs notebooks or data, convert the file into a stub and create the matching `INV-NNN-slug/` folder with paired `.ipynb`/`.py`, `_data/`, and `notebook_simple.md`. +4. Add a row to the Index table above. +5. Cross-link any ADR or PRD this investigation informs (and have them link back). Bidirectional links are mandatory. diff --git a/vibe/investigations/_template.md b/vibe/investigations/_template.md new file mode 100644 index 0000000..df6bfb1 --- /dev/null +++ b/vibe/investigations/_template.md @@ -0,0 +1,66 @@ +[vibe](../README.md) > [Investigations](README.md) > **_template** + + + +# INV-NNN: Title + +> **Status**: In progress | Complete | Blocked +> **Date**: YYYY-MM-DD +> **Priority**: 🔴 High | 🟠 Medium | 🟡 Low +> **Related**: ADR-NNNN · PRD · upstream/downstream links + +## Objectives + +What this investigation set out to answer. Use checkboxes so progress is visible at a glance. + +- [ ] Question or goal 1 +- [ ] Question or goal 2 + +## Executive summary + +Two or three sentences a busy reader can absorb without scrolling: what was investigated, what was found, and what (if anything) to do about it. + +## Findings + +Write each finding for diagonal reading: lead with the Brief, then evidence, then the bold Finding. + +### 1. Finding title + +**Brief.** One line stating what this section is about. + +Evidence — the data, logs, code paths, commands, or reasoning that support the conclusion. + +**Finding:** the bold takeaway. Tag with a classification emoji (🔴 / 🟠 / 🟡 / 🟢 / 🔵). + +### 2. Finding title + +**Brief.** ... + +Evidence ... + +**Finding:** ... + +## Summary table + +| # | Finding | Classification | Action | +| --- | --- | --- | --- | +| 1 | ... | 🟠 | ... | +| 2 | ... | 🟢 | ... | + +## Classification + +| Emoji | Meaning | +| --- | --- | +| 🔴 | Critical — data-loss or breaking risk confirmed | +| 🟠 | Significant — degraded but working, or a real coupling to watch | +| 🟡 | Minor — worth noting, low impact | +| 🟢 | Healthy — verified safe / no issue found | +| 🔵 | Informational — context, no action implied | + +## Open questions + +- Anything left unresolved, deferred, or needing a follow-up investigation. + +## References + +- Commands, files, PRs, ADRs, PRDs, or external docs consulted (descriptive link text, never "here"/"this"). diff --git a/vibe/runbooks/README.md b/vibe/runbooks/README.md new file mode 100644 index 0000000..3286606 --- /dev/null +++ b/vibe/runbooks/README.md @@ -0,0 +1,60 @@ +[vibe](../README.md) > **Runbooks** + +# Runbooks + +> **Status:** Active (conventions + template only — first concrete runbook lands with PRD Phase 1) +> **Last Updated:** 2026-06-23 +> **Related:** [vibe guidebooks](../guidebooks/README.md) · [vibe shareouts](../shareouts/README.md) · [FRENCH human runbooks under doc/runbooks](../../doc/runbooks/README.md) + +## What lives here + +`vibe/runbooks/` holds **agent-oriented operational runbooks, written in English** (this tree is for LLM agents). Each runbook is an ordered procedure where every step is tagged with an actor marker: + +- **`[AGENT]`** — read-only or otherwise safe steps an agent may execute autonomously (inspecting state, dry-runs, generating files, running tests in a sandbox). +- **`[HUMAN]`** — production-mutating steps that require **explicit human approval** before they run (anything that writes to live infrastructure, deletes data, or changes the trunk). + +The marker is load-bearing: it tells an agent reading the runbook exactly where its autonomy ends and where it must stop and hand control back to a human. + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef agent fill:#059669,stroke:#047857,color:#fff + classDef human fill:#dc2626,stroke:#b91c1c,color:#fff + classDef gate fill:#7c3aed,stroke:#6d28d9,color:#fff + A["[AGENT] safe steps
(inspect, dry-run, generate)"]:::agent --> G{"approval
gate"}:::gate --> H["[HUMAN] prod-mutating steps
(explicit approval required)"]:::human +``` + +1. An agent executes the `[AGENT]`-tagged steps on its own — these only read state or act inside a sandbox. +2. When the procedure reaches a prod-mutating step, the agent stops at an approval gate. +3. A human reviews and approves; only then do the `[HUMAN]`-tagged steps run against live infrastructure. + +## Not the same as `doc/runbooks` + +> [!IMPORTANT] +> There are **two** runbook collections in this lab, and they serve different readers — do not merge them. +> +> | Collection | Reader | Language | Step markers | +> |---|---|---|---| +> | **`vibe/runbooks/`** (this folder) | LLM agents | English | `[AGENT]` / `[HUMAN]` | +> | **[`doc/runbooks/`](../../doc/runbooks/README.md)** | Human operators | French | prose procedures | +> +> The canonical, human-facing operator procedures (e.g. [Nouvelle application web](../../doc/runbooks/new-web-app/README.md)) live in French under `doc/runbooks/`. This folder is the agent-facing mirror: same operational reality, written so an autonomous agent can execute the safe parts and gate the dangerous ones. + +## Index + +| Runbook | Summary | Status | +|---|---|---| +| [_template](_template.md) | Skeleton for new agent-oriented runbooks (`[AGENT]`/`[HUMAN]` markers, copy-paste commands, verification + rollback) | ✅ Active | + +> [!NOTE] +> The first **concrete** runbook — a local sandbox game-day for the safe prod-like environment — ships with **PRD Phase 1** ([safe-prod-like-environment PRD](../PRD/safe-prod-like-environment/README.md)). Until then this folder holds the conventions and the template only. + +## Rules to contribute + +1. **Start from [`_template.md`](_template.md).** Copy it, rename to `kebab-case.md`, fill every section, then add a row to the index table above. +2. **Tag every procedure step** `[AGENT]` or `[HUMAN]`. When in doubt, tag it `[HUMAN]` — over-gating is safe, under-gating is not. +3. **Use the `tree-docs` skill** and keep the breadcrumb spine: first line is the breadcrumb trail, ancestors as relative links, current page bold-unlinked, separator ` > `. +4. **README hub stays current** — every new runbook gets an index row here with a one-line summary and status. +5. **Bidirectional links.** If a runbook references a guidebook, ADR, or the French operator runbook, link back from there too. Use descriptive link text. +6. **Commands are copy-paste ready** — put them in fenced ```bash blocks, with the `[HUMAN]`/`[AGENT]` marker on the step that owns them. +7. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. diff --git a/vibe/runbooks/_template.md b/vibe/runbooks/_template.md new file mode 100644 index 0000000..d959e28 --- /dev/null +++ b/vibe/runbooks/_template.md @@ -0,0 +1,80 @@ + + +[vibe](../README.md) > [Runbooks](README.md) > **_template** + +# + +> **Status:** ⬜ Not started +> **Audience:** LLM agents (English). For the human-operator equivalent see the French [doc/runbooks](../../doc/runbooks/README.md). +> **Last Updated:** 2026-06-23 + +## TL;DR + +> [!TIP] +> + +## Scope + +` or environment in play.> + +## Preconditions + + + +- [ ] Working in a worktree under `.claude/worktrees//` (never the trunk). +- [ ] Access to confirmed. +- [ ] . + +## Procedure + + + +1. **[AGENT]** + + ```bash + # read-only example + kubectl --context get pods -n + ``` + +2. **[AGENT]** + + ```bash + # safe generation / sandbox example + tofu -chdir= plan + ``` + +3. **[HUMAN]** + + ```bash + # prod-mutating example — only after approval + tofu -chdir= apply + ``` + +4. **[HUMAN]** + +## Verification + + + +```bash +# verification example +kubectl --context get application -n argocd -o jsonpath='{.status.sync.status}' +# expected: Synced +``` + +## Rollback + + + +## References + +- +- +- diff --git a/vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md b/vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md new file mode 100644 index 0000000..06134eb --- /dev/null +++ b/vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md @@ -0,0 +1,54 @@ +[vibe](../../README.md) > [Shareouts](../README.md) > **2026-06-23 · Vibe & environnement sûr** + +# Vibe & environnement sûr — ce qu'on a posé le 2026-06-23 + +> **Date :** 2026-06-23 · **Statut :** Actif · **Audience :** humains du lab Arcodange (lecteurs non spécialistes bienvenus) + +> [!TIP] +> **TL;DR** +> - On a créé `vibe/`, la **base de connaissances pour agents IA**, et un `AGENTS.md` qui sert de carte de l'écosystème et de règles du jeu. +> - On a écrit le **premier ADR** (la décision) et la **première PRD** (la spécification) d'un **environnement sûr de type production, en LOCAL uniquement**. +> - On a ajouté un **guidebook de l'écosystème** : la carte « qui parle à qui » entre les dépôts `factory`, `tools` et `cms`. + +## Ce qui a été fait + +- **Le dossier `vibe/`** — un tronc de documentation pensé pour les agents IA qui travaillent sur le lab. Il range le *pourquoi* (ADR), le *quoi/quand* (PRD), le *ce-qu'on-a-trouvé* (investigations), le *comment-ça-s'emboîte* (guidebooks), le *comment-faire* (runbooks) et le *ce-qu'on-a-dit-aux-humains* (shareouts, comme cette page). +- **Un `AGENTS.md`** — la carte de l'écosystème et le règlement : conventions de nommage, gestion des secrets, style de documentation, règles de branches et de PR. Un agent qui débarque le lit en premier. +- **Le premier ADR** — [« Safe, production-like environment »](../../ADR/0001-safe-prod-like-environment.md) : la décision actée, figée comme un fait historique. +- **La première PRD** — [« Safe, production-like environment »](../../PRD/safe-prod-like-environment/README.md) : la spécification détaillée (problème, objectifs, périmètre, critères de succès, stratégie de tests). +- **Un guidebook de l'écosystème** — [Lab ecosystem](../../guidebooks/lab-ecosystem/README.md) : la carte de bout en bout de `factory` + `tools` + `cms`, et de la convention `` qui les relie. + +## Pourquoi ça compte + +Aujourd'hui, tester une modification revient trop souvent à la tester **directement sur ce qui tourne pour de vrai**. Or « pour de vrai » ici, ce sont des services qui comptent : + +- le **mail `arcodange.fr`** (Zoho : MX, SPF, DKIM, DMARC, alias) — une fausse manip et des courriels se perdent ; +- le **CMS** (le site public `arcodange.fr`) ; +- l'**ERP** et les autres applications du lab. + +Sans filet, chaque essai est un pari sur la production. L'idée posée ce jour-là : se donner un **bac à sable fidèle à la prod** où l'on peut casser, recommencer et valider **sans jamais toucher** aux services réels. + +## Décision clé + +Un **environnement local et reproductible**, isolé de la production : + +- un cluster **k3d** local, alimenté par **3 VMs arm64** qui rejouent la topologie des 3 Raspberry Pi ; +- une **frontière d'isolation stricte** : l'environnement de test ne peut ni lire ni écrire dans la prod (pas d'accès au mail réel, au DNS public, aux bases de production) ; +- un **périmètre volontairement limité au logiciel** : les niveaux **matériel** (les Pi physiques) et **cloud** (Cloudflare, OVH, GCS) restent **hors périmètre** pour cette première itération. + +Autrement dit : on reproduit la *pile applicative*, pas la *quincaillerie* ni les *comptes cloud*. C'est le bon compromis pour tester vite et sans risque. + +## Liens + +- ADR — [Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) +- PRD — [Safe, production-like environment](../../PRD/safe-prod-like-environment/README.md) +- Guidebook — [Lab ecosystem](../../guidebooks/lab-ecosystem/README.md) + +## Pour aller plus loin + +Ce dossier peut accueillir, plus tard, des supports complémentaires : + +- un **deck** de présentation des slides ; +- un **mp4** narré — capture `ffmpeg`/Playwright, ou un explainer produit avec le skill `documentary-video`. + +Il suffira de les déposer ici, à côté de ce `README.md`, et d'ajouter une ligne dans l'[index des shareouts](../README.md). diff --git a/vibe/shareouts/README.md b/vibe/shareouts/README.md new file mode 100644 index 0000000..1570ee4 --- /dev/null +++ b/vibe/shareouts/README.md @@ -0,0 +1,56 @@ +[vibe](../README.md) > **Shareouts** + +# Shareouts + +> **Status:** Active +> **Last Updated:** 2026-06-23 +> **Related:** [vibe guidebooks](../guidebooks/README.md) · [vibe runbooks](../runbooks/README.md) + +## What a shareout is + +A **shareout** captures something substantial that was **done or seriously considered** — the outcome of an ADR, a PRD, or an investigation — and packages it as a **human-facing handout** you can hand to a person to explain what happened and why it matters. + +Where guidebooks are the standing reference map and runbooks are the procedures, a shareout is a moment-in-time communication artifact: "here is the thing we built / decided / explored, told for humans." + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + WORK["Substantial work
(ADR / PRD / investigation)"]:::src --> SO["Shareout
(dated handout folder)"]:::proc --> HUM["Humans
(read .md, deck, watch .mp4)"]:::store +``` + +1. Some substantial work lands — an ADR is decided, a PRD phase ships, an investigation concludes. +2. It is packaged into a dated shareout folder holding the human-facing handouts. +3. Humans consume the handout: a Markdown summary, a slide deck, and optionally a narrated video. + +## Convention: one dated subfolder per shareout + +> [!IMPORTANT] +> Each shareout lives in **its own dated subfolder** named `YYYY-MM-DD-slug/`. The folder holds the human-facing handouts for that shareout: `.md` summaries, decks, and an optional `.mp4`. + +The handouts inside a shareout folder can include: + +- **`.md` summaries** — the written walkthrough (the folder's `README.md` is the front door). +- **Decks** — slides for presenting the work. +- **Optional `.mp4`** — `ffmpeg` compilations of screenshots, [Playwright](../../ansible/arcodange/factory/README.md) captures, or a narrated explainer produced with the `documentary-video` skill. + +> [!NOTE] +> **Handouts are written in FRENCH.** The audience is humans, and the lab's human-facing audience is French-speaking. (This is the deliberate exception to the otherwise-English `vibe/` tree, which exists for LLM agents.) + +## Index + +| Shareout | What it covers | Status | +|---|---|---| +| [2026-06-23 · Vibe & safe-env](2026-06-23-vibe-and-safe-env/README.md) | Mise en place du tronc `vibe/` (guidebooks, runbooks, shareouts) et le PRD de l'environnement « safe prod-like » | ✅ Active | + +## Rules to contribute + +1. **One dated folder per shareout** — `YYYY-MM-DD-slug/`, kebab-case slug, ISO date. Never dump loose files at this level. +2. **Folder `README.md` is the front door** — it carries the breadcrumb, a TL;DR, and links to every handout in the folder (deck, video, sub-summaries). +3. **Handouts in French** — the audience is human. (Reminder: the rest of `vibe/` is English for agents.) +4. **Add an index row here** for every new shareout, with a one-line French summary and status, newest at the top. +5. **Use the `tree-docs` skill** for the breadcrumb spine and the per-folder README hub; use the `documentary-video` skill when a shareout warrants a narrated `.mp4`. +6. **Bidirectional links** — if a shareout points to an ADR/PRD/guidebook, link back from there. Descriptive link text only. +7. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started.