From 7647a68cdc1e0277cd4572af0e0d8ead89df8f2c Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 11:52:37 +0200 Subject: [PATCH 1/9] docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 --- AGENTS.md | 131 +++++++++++ vibe/ADR/0001-safe-prod-like-environment.md | 84 +++++++ vibe/ADR/README.md | 45 ++++ vibe/ADR/_template.md | 41 ++++ vibe/PRD/README.md | 34 +++ vibe/PRD/safe-prod-like-environment/README.md | 69 ++++++ vibe/PRD/safe-prod-like-environment/STATUS.md | 52 +++++ .../isolation-boundary.md | 31 +++ .../safe-prod-like-environment/qa-strategy.md | 49 +++++ vibe/README.md | 50 +++++ vibe/guidebooks/README.md | 47 ++++ vibe/guidebooks/lab-ecosystem/01-factory.md | 122 +++++++++++ vibe/guidebooks/lab-ecosystem/02-tools.md | 76 +++++++ vibe/guidebooks/lab-ecosystem/03-cms.md | 82 +++++++ vibe/guidebooks/lab-ecosystem/README.md | 116 ++++++++++ .../lab-ecosystem/naming-conventions.md | 96 ++++++++ .../lab-ecosystem/secrets-and-vault.md | 110 ++++++++++ .../lab-ecosystem/storage-and-recovery.md | 76 +++++++ .../INV-001-prod-blast-radius-couplings.md | 206 ++++++++++++++++++ vibe/investigations/README.md | 45 ++++ vibe/investigations/_template.md | 66 ++++++ vibe/runbooks/README.md | 60 +++++ vibe/runbooks/_template.md | 80 +++++++ .../2026-06-23-vibe-and-safe-env/README.md | 54 +++++ vibe/shareouts/README.md | 56 +++++ 25 files changed, 1878 insertions(+) create mode 100644 AGENTS.md create mode 100644 vibe/ADR/0001-safe-prod-like-environment.md create mode 100644 vibe/ADR/README.md create mode 100644 vibe/ADR/_template.md create mode 100644 vibe/PRD/README.md create mode 100644 vibe/PRD/safe-prod-like-environment/README.md create mode 100644 vibe/PRD/safe-prod-like-environment/STATUS.md create mode 100644 vibe/PRD/safe-prod-like-environment/isolation-boundary.md create mode 100644 vibe/PRD/safe-prod-like-environment/qa-strategy.md create mode 100644 vibe/README.md create mode 100644 vibe/guidebooks/README.md create mode 100644 vibe/guidebooks/lab-ecosystem/01-factory.md create mode 100644 vibe/guidebooks/lab-ecosystem/02-tools.md create mode 100644 vibe/guidebooks/lab-ecosystem/03-cms.md create mode 100644 vibe/guidebooks/lab-ecosystem/README.md create mode 100644 vibe/guidebooks/lab-ecosystem/naming-conventions.md create mode 100644 vibe/guidebooks/lab-ecosystem/secrets-and-vault.md create mode 100644 vibe/guidebooks/lab-ecosystem/storage-and-recovery.md create mode 100644 vibe/investigations/INV-001-prod-blast-radius-couplings.md create mode 100644 vibe/investigations/README.md create mode 100644 vibe/investigations/_template.md create mode 100644 vibe/runbooks/README.md create mode 100644 vibe/runbooks/_template.md create mode 100644 vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md create mode 100644 vibe/shareouts/README.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..1077e18 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,131 @@ +# Arcodange Lab — Agent Guide & Operating Rules + +The Arcodange lab is a self-hosted home/company platform running on three Raspberry Pi (pi1/pi2/pi3) behind a home Livebox, driven from a MacBook Pro M4 control node. **factory** is the cornerstone admin repo: it provisions the cluster (k3s), defines what gets deployed (ArgoCD app-of-apps), manages cloud/forge state (OpenTofu), and provisions databases (PostgreSQL). Two sibling repos carry workloads: **tools** (platform services — Vault, Prometheus, Grafana, CrowdSec, poolers) and **cms** (the public Nuxt site `arcodange.fr`). Everything is deployed into k3s namespaces via ArgoCD, every secret comes from Vault, and public traffic enters through a Cloudflared Zero-Trust tunnel into the internal Traefik. + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart TB + subgraph control["Control node (MacBook Pro M4)"] + ansible["Ansible"]:::proc + tofu["OpenTofu"]:::proc + end + + factory["factory repo
orchestrator: ansible + argocd + iac + postgres + doc"]:::src + tools["tools repo
Vault, Prometheus, Grafana, CrowdSec, poolers"]:::src + cms["cms repo
Nuxt site arcodange.fr"]:::src + + subgraph cluster["k3s cluster (pi1 server, pi2/pi3 agents)"] + argocd["ArgoCD app-of-apps"]:::proc + vault["Vault + VSO"]:::store + nsapps["namespaces: tools / cms / webapp / erp / ..."]:::proc + traefik["internal Traefik"]:::proc + end + + cflared["Cloudflared Zero-Trust tunnel"]:::proc + public["public: *.arcodange.fr"]:::store + + ansible -- "provision k3s + base" --> cluster + tofu -- "state in GCS" --> factory + factory -- "defines Applications" --> argocd + argocd -- "deploy charts" --> tools + argocd -- "deploy charts" --> cms + tools --> nsapps + cms --> nsapps + vault -- "inject secrets" --> nsapps + nsapps --> traefik + cflared -- "ingress" --> traefik + public --> cflared + + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff +``` + +1. The **control node** (MacBook Pro M4) runs Ansible and OpenTofu — the two hands that build everything. +2. **Ansible** provisions the **k3s cluster** (pi1 server, pi2/pi3 agents) and its base layer. +3. **OpenTofu** manages forge/cloud state, persisted in **GCS** (`gs://arcodange-tf`), serving the **factory** repo's intent. +4. **factory** defines the **ArgoCD app-of-apps**, which is the single deployment authority. +5. **ArgoCD** deploys the Helm charts of **tools** and **cms** into per-app k3s **namespaces**. +6. **Vault** (with the Vault Secrets Operator) injects secrets into those namespace pods — no secrets live in git. +7. Workload pods route through the **internal Traefik**. +8. The **Cloudflared Zero-Trust tunnel** is the only public ingress, forwarding `*.arcodange.fr` traffic into internal Traefik. + +--- + +## Repos at a glance + +| Repo | Purpose | Key dirs | How deployed | +|---|---|---|---| +| **factory** | Cornerstone admin repo: provisions the cluster, defines deployments, owns infra state and DBs, holds canonical docs | [ansible/](ansible/) · [argocd/](argocd/) · [iac/](iac/) · [postgres/](postgres/) · [doc/](doc/) | Ansible (cluster + base) + OpenTofu (forge/cloud/PG state); ArgoCD reads its app-of-apps | +| **tools** | Platform services in the `tools` namespace | [hashicorp-vault](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault), [prometheus](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus), [grafana](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana), [crowdsec](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec), [pgbouncer](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgbouncer) | Helm/Kustomize charts deployed via ArgoCD | +| **cms** | Public Nuxt static site `arcodange.fr` + its zone/email IaC | [chart](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart), [cloudflare](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare), [zoho](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/zoho) | Helm chart via ArgoCD; Cloudflare/Zoho via OpenTofu | + +Self-hosted Gitea is at `gitea.arcodange.lab` (org `arcodange-org`). **Pick the forge tool from the remote**: these repos live on Gitea, so use the `mcp__gitea__*` MCP tools for PRs/issues/releases — `gh` will silently fail. + +## The `` join key + +One kebab-case identifier — `` — is reused **identically** across the Gitea repo, the PG database + `_role`, Vault (`postgres/creds/`, k8s auth role ``, policies `` / `-ops`, `gitea_cicd_`), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix `/main`, and DNS (`.arcodange.lab` / `.arcodange.fr`). Bricks wire together **by name convention, not explicit config**, so a single typo breaks the chain silently. Source of truth: [doc/runbooks/new-web-app/conventions.md](doc/runbooks/new-web-app/conventions.md); concept page (going forward): [vibe/guidebooks/lab-ecosystem/naming-conventions.md](vibe/guidebooks/lab-ecosystem/naming-conventions.md). + +## Where knowledge lives + +Start at the knowledge-base front door: [vibe/README.md](vibe/README.md). The six `vibe/` folders: + +- [vibe/ADR/](vibe/ADR/README.md) — architecture decision records (the *why*); canonical home going forward. +- [vibe/PRD/](vibe/PRD/README.md) — product/project requirement docs, each with a mandatory `STATUS.md`. +- [vibe/investigations/](vibe/investigations/README.md) — numbered investigations (`INV-NNN-slug`), with notebooks when data-heavy. +- [vibe/guidebooks/](vibe/guidebooks/README.md) — tree-docs that map the lab's components (the *how it fits together*). +- [vibe/runbooks/](vibe/runbooks/README.md) — step-by-step operational procedures with `[AGENT]` / `[HUMAN]` markers. +- [vibe/shareouts/](vibe/shareouts/README.md) — handouts and presentations (FRENCH; the one exception to the English rule). + +Historical infra docs still live under [doc/](doc/) (ADRs, the new-web-app runbook) — see also `CLUSTER_RECOVERY.md` (at the lab root, **outside** this repo) for tested power-cut recovery. + +## Operating rules for agents + +### No-tombstone rule (FOREMOST) +Write every file as **currently true**. NEVER leave "Correction (date): …", "previously X, now Y", changelogs, or "updated to …" notes — git history is the audit trail. The **only** allowed exception is a forward-looking `> [!CAUTION]` about a live operational risk. + +### Mermaid preferences +Begin each block with an init directive selecting `theme base` (or `forest`). Define a `classDef` palette legible on both light and dark backgrounds (dark fills + light text), e.g. `classDef src fill:#2563eb,stroke:#1e40af,color:#fff`. Use HTML `
` for line breaks (never `\n`). Put a leading space before any label starting with a slash, and escape angle brackets inside labels. **Validate every diagram with the Mermaid MCP** before committing, and **immediately after each diagram add a numbered ordered list restating the same flow in words**. + +### Tree-docs (guidebooks & big PRDs) +Guidebooks and large PRDs are written as navigable trees: every file's **first line is a breadcrumb** (ancestors are relative links, current page is bold-unlinked, separator ` > `); **every folder has a `README.md` index hub** (a table of its children — link + one-line summary + status — sorted by importance/sequence, not alphabetically); **cross-references are bidirectional** (if A links B, B links A); use numbered file prefixes only for ordered narratives; **stamp `Last Updated: 2026-06-23` at each tree root**. + +### Optimized ADR format (MADR-lite) +Sections: **Context / Decision / Consequences / Alternatives / QA & validation / References**. Once an ADR is **Accepted, the body is immutable** — only the status field mutates (Proposed → Accepted → Superseded). The canonical home going forward is [vibe/ADR/](vibe/ADR/README.md); [doc/adr/](doc/adr/README.md) stays as the historical record. + +### Investigations +Prefer a **single `INV-NNN-slug.md`** when the finding fits in one file. When data-heavy, write a short stub `.md` beside a same-named folder containing notebooks (`.ipynb` + paired `.py`), a `_data/` dir, and a plain-language `notebook_simple.md` (visuals anyone can read). + +### PRD convention +**One subfolder per PRD.** A `STATUS.md` is **mandatory** and must be updated whenever something ships. Big PRDs use tree-docs and **must detail a QA strategy**. The flow is Problem → … → QA strategy → STATUS.md. + +### PR crosslinking (bidirectional) +**Every PR body references the ADR/PRD it advances.** In return, the ADR's **References** section and the PRD's **STATUS.md** link back to that PR. Links must be bidirectional — never one-way. + +### Guidebook maintenance +Altering a component that is documented in `guidebooks/` **requires updating that guidebook page in the same change**. A code/infra change that leaves its guidebook stale is incomplete. + +### Language policy +**English** for everything in `vibe/` and for `AGENTS.md`/`CLAUDE.md` (this tree is for LLM agents). The single exception: **shareouts handouts are FRENCH**. + +## The cohort + workflow + +These personas are spawned via the Agent tool or a Workflow; document and reuse them across sessions: + +| Persona | Role | +|---|---| +| **Lab Cartographer** | Explores the three repos and maps them into guidebooks (tree-docs). Read-mostly; never edits infra. | +| **ADR Scribe** | Writes optimized MADR-lite ADRs; enforces immutability (status-only mutation) and PR crosslinks. | +| **PRD Architect** | Writes PRDs (Problem → … → QA strategy → STATUS.md); uses tree-docs for big ones. | +| **Runbook Engineer** | Writes step-by-step runbooks with `[AGENT]` (read-only, safe) and `[HUMAN]` (prod-mutating, needs approval) markers. | +| **Investigator** | Writes investigations: a single `INV-NNN-slug.md` when possible, else a stub `.md` beside a same-named folder with `.ipynb` + paired `.py`, a `_data/` dir, and `notebook_simple.md`. | +| **Diagram Smith** | Authors and validates every mermaid diagram with the Mermaid MCP validator; enforces the mermaid preferences + the ordered-list-after-diagram rule. | +| **Continuity Warden** | The adversarial reviewer: checks the no-tombstone rule, breadcrumb depth, bidirectional links, dead links, naming, STATUS/PR crosslinks, and the Last Updated stamp. | + +**Recommended workflow for substantial `vibe/` contributions:** + +1. **Scaffold** — folders + README hubs + templates. +2. **Author** — personas in parallel, each on distinct files. +3. **Validate** — Diagram Smith runs the Mermaid MCP on every diagram. +4. **Review** — Continuity Warden runs the adversarial checklist. +5. **Assemble** — wire cross-refs, run the dead-link self-test, stamp `Last Updated`. diff --git a/vibe/ADR/0001-safe-prod-like-environment.md b/vibe/ADR/0001-safe-prod-like-environment.md new file mode 100644 index 0000000..1d587db --- /dev/null +++ b/vibe/ADR/0001-safe-prod-like-environment.md @@ -0,0 +1,84 @@ +[vibe](../README.md) > [ADR](README.md) > **0001 · Safe, production-like environment** + +# ADR-0001: Safe, production-like environment for the lab + +> **Status**: Accepted +> **Date**: 2026-06-23 +> **Deciders**: @arcodange + +## Context + +The Arcodange lab doubles as company production. The same three Raspberry Pis and one MacBook control node run the public CMS (`arcodange.fr`), Zoho-backed business email, the Dolibarr ERP holding accounting and business records, the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one laptop holding the kubeconfig, the Vault root token, and every cloud admin token. + +There is no separation between *where I experiment* and *where the business runs*. Every risky change is tested directly in production. The 2026-04-13 power-cut proved recovery is manual, multi-step, and only ever validated by a real incident — never rehearsed. + +The danger is concentrated in a handful of change-classes. Each one can cause silent, fleet-wide, or data-losing damage when applied to the live environment: + +| Change-class | Blast radius if wrong | +| --- | --- | +| Ansible playbook edits | Can wipe disks, reset k3s, or corrupt Longhorn across the fleet. | +| Vault policy / auth / mount changes | Lock out the Vault Secrets Operator → fleet-wide secret outage; a botched init could overwrite the single unseal key. | +| Postgres migrations / role changes | The superuser provider on `192.168.1.202` can drop or alter live databases → ERP data loss. | +| ArgoCD sync / app-of-apps changes | `prune` + `selfHeal` auto-prunes live resources fleet-wide. | +| Cloudflare / DNS / email changes | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | +| Longhorn / storage ops | Volume recreation orphans replicas via new engine IDs. | +| Recovery drills | The runbook is only validated by real incidents, never rehearsed. | +| Cert / PKI re-init | Rotates the internal CA, invalidating every issued `*.arcodange.lab` cert. | + +A change-management process is not enough: the operator needs a place to *make the mistake first*, where the mistake cannot reach production. + +## Decision + +We will build a **local-only safe environment on the MacBook control node**, seeded from the *same* GitOps repos via a dedicated sandbox inventory, with two modes: + +- **(a) k3d single-node "fast inner loop"** (~60s bring-up) for app, Vault, and ArgoCD iteration. +- **(b) Three arm64 VMs** (multipass or Vagrant on the M4) reproducing the three-node topology — Postgres + Gitea as docker-compose *outside* k3s on the "pi2-equivalent" VM, Longhorn across the three VM disks — for Ansible, Longhorn, and recovery work. + +The load-bearing requirement is the **isolation boundary**: the sandbox must be *unable* to mutate real production even on a wrong command. Each production coupling maps to a concrete guardrail — a separate sandbox inventory with a prod-IP abort guard, sandbox GCS state prefixes, a separate sandbox Vault with its own unseal-key path, a sandbox Postgres host check, and plan-only DNS against a throwaway zone. The real `arcodange.fr` Cloudflare/Zoho tokens are never exported into a sandbox shell. Because the `` convention keys everything *within* a cluster/Vault/DB, the sandbox reuses identical `` names with no collision — the boundary is the cluster + Vault + state + DNS zone, not the names, so runbooks read identically in both environments. + +See the [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) for the full coupling→control mapping, and the [PRD](../PRD/safe-prod-like-environment/README.md) for the complete product view. + +## Consequences + +- **+** $0 inner loop that runs on the existing control node — no new hardware. +- **+** Rehearses every dangerous change-class except real public DNS/email. +- **+** The 2026-04-13 recovery sequence becomes a repeatable drill instead of a once-per-incident gamble. +- **+** Identical `` names mean runbooks are environment-agnostic. +- **−** x86/ARM nuance must be handled (use arm64 VMs/images on the M4). +- **−** New guardrail and parity-manifest maintenance burden. +- **−** Single-laptop resource limits — k3d for speed, VMs only when multi-node fidelity is actually needed. +- **→** Real public DNS/ACME and physical-ARM always-on testing remain unsolved by design; revisit only if recurring game-days demand them. + +## Alternatives considered + +| Option | Fidelity | Isolation | $ / effort | Verdict | +| --- | --- | --- | --- | --- | +| 1 · Ephemeral local cluster (k3d/kind) | Medium (single-node) | Full (separate cluster) | $0 / low | ✅ **Chosen** as the fast mode. | +| 2 · Three arm64 VMs reproducing the topology | High (3-node, PG+Gitea outside k3s, Longhorn) | Full (separate VMs) | $0 / medium | ✅ **Chosen** for fidelity. | +| 3 · Sandbox namespace on the real cluster | High | None — shared Vault/PG/Longhorn/ArgoCD | $0 / low | ❌ **Rejected**: shared blast radius fails the core isolation requirement. | +| 4 · Dedicated physical node (4th Pi / mini-PC) | High (real ARM, always-on) | Full | $$ hardware / medium | ⛔ **Out of scope**: hardware cost; revisit only for recurring always-on ARM game-days. | +| 5 · Disposable cloud k3s for real public DNS/ACME | High infra, but arch drift | Full | $ recurring / medium | ⛔ **Out of scope**: cost + ARM drift; its only unique value is real DNS/email, which we explicitly do not test. | + +## QA & validation + +- **Parity manifest** — k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, PG + Gitea outside k3s, three nodes. Any drift from this manifest is a failure. +- **Provisioning-parity test** — run the new-web-app runbook for a throwaway `` "canary" and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running. +- **Idempotence gate** — `ansible-playbook --check --diff` reports `changed=0` on the converged sandbox *before* any change is promoted to prod. +- **`tofu plan` diff gate** — plan against sandbox state; for DNS, assert it touches *only* the throwaway zone. +- **Chaos drills** (mapped to CLUSTER_RECOVERY.md sections): node-kill; Vault-seal (unseal via the *sandbox* key; VSO re-auths); Longhorn volume corruption (run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)); DB drop/restore; full power-cut simulation (execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green); ArgoCD bad-sync (observe prune/selfHeal, author a rollback runbook section); cert/PKI re-issue. +- **Monthly game-day** — the operator follows *only* the runbook; any improvised step becomes a runbook PR. This is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident. +- **Promotion gate** — no infra/Vault/storage/DNS change reaches prod until it has been applied to the sandbox *and* survived the matching drill. Each drill records time-to-recover and which step failed or was improvised. + +See the [QA strategy leaf](../PRD/safe-prod-like-environment/qa-strategy.md) for the detailed drill table and evidence trail. + +## References + +- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — full product view, requirements, and phased rollout. +- [INV-001 · Prod blast-radius couplings](../investigations/INV-001-prod-blast-radius-couplings.md) — the investigation that mapped every prod coupling. +- [Guidebook · Lab ecosystem](../guidebooks/lab-ecosystem/README.md) — the end-to-end map of the prod topology this sandbox mirrors. +- [Guidebook · Storage and recovery](../guidebooks/lab-ecosystem/storage-and-recovery.md) — how Longhorn and the recovery sequence work today. +- [doc/adr index](../../doc/adr/README.md) — foundational infrastructure ADRs (read-only history). +- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association, exercised by the Longhorn chaos drill. +- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the `` convention reused identically across sandbox and prod. +- CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section. +- PRs: to be backfilled on PR open. diff --git a/vibe/ADR/README.md b/vibe/ADR/README.md new file mode 100644 index 0000000..e2378ec --- /dev/null +++ b/vibe/ADR/README.md @@ -0,0 +1,45 @@ +[vibe](../README.md) > **ADR** + +# Architecture Decision Records + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [vibe/PRD](../PRD/README.md) · [vibe/Investigations](../investigations/README.md) +> **Historical**: [doc/adr](../../doc/adr/README.md) (foundational infra) · [ansible/.../docs/adr](../../ansible/arcodange/factory/docs/adr/) (dated infra ADRs) + +`vibe/ADR/` is the **canonical home for Architecture Decision Records going forward**. The format is MADR-lite: one short, self-contained Markdown file per decision, focused on the *why* rather than the *how*. Use the [`_template.md`](_template.md) skeleton to start a new one. + +## Where ADRs live + +There are three ADR locations in this repo. Only the first accepts new records; the other two are read-only history kept for context. + +| Location | Role | Accepts new ADRs? | +| --- | --- | --- | +| `vibe/ADR/` (this folder) | Canonical, MADR-lite, going forward | ✅ Yes | +| [`doc/adr/`](../../doc/adr/README.md) | Foundational infrastructure ADRs (DNS, k3s, CI/CD, Vault, telegram-gateway auth) | ❌ Historical | +| [`ansible/arcodange/factory/docs/adr/`](../../ansible/arcodange/factory/docs/adr/) | Dated infra ADRs (network, CI/CD, Longhorn PVC recovery, internal DNS) | ❌ Historical | + +When a new decision *supersedes* one of the historical records, write the new ADR here, set the old one's status note to `Superseded by ADR-NNNN`, and cross-link both ways. + +## Rules + +- **One file per decision**, named `NNNN-kebab-title.md` (zero-padded sequence, e.g. `0001-safe-prod-like-environment.md`). +- **The body is immutable once `Accepted`.** A decision is a historical fact: do not rewrite the Context/Decision/Consequences after acceptance. The *only* mutation allowed on an accepted ADR is its **status** (e.g. flipping `Accepted` → `Superseded`). +- **Statuses**: `Proposed` (under discussion) → `Accepted` (decided, body frozen) → `Superseded` (replaced; points to the successor ADR). A `Proposed` ADR may still be edited freely. +- **No-tombstone rule.** Each file reads as currently true. Never leave "previously X, now Y", changelog lines, or "updated to ..." notes inside an ADR — git history is the audit trail. A superseded ADR keeps its original frozen body; the supersession is recorded only in its status line and the successor's References. +- **PR cross-link both ways.** The ADR References section links the PR that introduced it; the PR description links back to the ADR. Keep links bidirectional. + +## Index + +| # | Title | Status | Date | +| --- | --- | --- | --- | +| [0001](0001-safe-prod-like-environment.md) | Safe, production-like environment | 🟢 Accepted | 2026-06-23 | + +## Rules to contribute + +1. Copy [`_template.md`](_template.md) to `NNNN-kebab-title.md` using the next free sequence number and delete the top HTML-comment note. +2. Fill in the blockquote (Status/Date/Deciders), then Context, Decision, Consequences, Alternatives considered, QA & validation, References. +3. Open the ADR with status `Proposed`. Flip it to `Accepted` once the decision is settled — and from that point treat the body as frozen. +4. Add a row to the Index table above (newest at the bottom to preserve chronological numbering). +5. In the PR that lands the ADR, link to the ADR file; in the ADR's References, link back to the PR. Bidirectional links are mandatory. +6. If this ADR supersedes a historical one in `doc/adr/` or the Ansible ADR folder, update the old record's status note and cross-reference both directions. diff --git a/vibe/ADR/_template.md b/vibe/ADR/_template.md new file mode 100644 index 0000000..20c56fd --- /dev/null +++ b/vibe/ADR/_template.md @@ -0,0 +1,41 @@ +[vibe](../README.md) > [ADR](README.md) > **_template** + + + +# ADR-NNNN: Title + +> **Status**: Proposed | Accepted | Superseded by ADR-NNNN +> **Date**: YYYY-MM-DD +> **Deciders**: name(s) + +## Context + +What forces are at play? Describe the problem, the constraints, and the situation that makes a decision necessary. State facts, not opinions. Keep it short enough that a future reader understands *why* a decision was needed without prior context. + +## Decision + +The decision, stated in the active voice: "We will ...". One clear choice. If the decision has sub-parts, use a short bulleted list. + +## Consequences + +What becomes easier or harder as a result of this decision? + +- **+** A positive outcome / something now enabled. +- **−** A trade-off / cost / new constraint accepted. +- **→** A future follow-up this implies (work deferred, a door left open, a re-evaluation trigger). + +## Alternatives considered + +| Option | Why not | +| --- | --- | +| Alternative A | Reason it was rejected. | +| Alternative B | Reason it was rejected. | + +## QA & validation + +How was (or will be) this decision validated? Tests, smoke checks, manual verification, rollback plan, or the criteria that would tell us the decision was wrong. + +## References + +- Link to the PR that introduces this ADR (and ensure the PR links back here). +- Related ADRs, PRDs, investigations, or external docs (descriptive link text, never "here"/"this"). diff --git a/vibe/PRD/README.md b/vibe/PRD/README.md new file mode 100644 index 0000000..7189eb5 --- /dev/null +++ b/vibe/PRD/README.md @@ -0,0 +1,34 @@ +[vibe](../README.md) > **PRD** + +# Product Requirement Documents + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [vibe/ADR](../ADR/README.md) · [vibe/Investigations](../investigations/README.md) + +`vibe/PRD/` holds the Product Requirement Documents that drive larger pieces of work in the lab. A PRD captures *what* we want and *why it matters*; the matching ADRs capture *how we decided to build it*, and investigations capture *what we learned* along the way. + +## Convention + +- **One subfolder per PRD**, kebab-case (e.g. `safe-prod-like-environment/`). +- Each subfolder **MUST** contain: + - `README.md` — the PRD hub: problem, goals/non-goals, requirements, success criteria, and a QA strategy. + - `STATUS.md` — the implementation tracker. **Update it whenever something ships** (a PR merges, a brick lands, a milestone closes). It is the living view of "where are we" against the PRD. +- A **big PRD uses tree-docs**: the `README.md` stays a hub and detail lives in leaf pages (each with its own breadcrumb and bidirectional cross-links). A tree-sized PRD **MUST** detail an explicit **QA strategy** — how the delivered work will be verified, and what "done and safe" means. +- **PRs cross-link to the PRD**, and the PRD's `STATUS.md` **cross-links back** to the PRs/ADRs/investigations that realised each part. Links are bidirectional. +- **No-tombstone rule** applies: the PRD reads as currently true. Progress lives in `STATUS.md` (which *is* a tracker and may legitimately list shipped items), not as "previously / now" edits scattered through the hub. + +## Index + +| PRD | Hub | Status | +| --- | --- | --- | +| Safe, production-like environment | [safe-prod-like-environment/README.md](safe-prod-like-environment/README.md) | 🟡 In design | + +## Rules to contribute + +1. Create a kebab-case subfolder named for the PRD. +2. Add `README.md` (the hub) and `STATUS.md` (the tracker). Both carry a breadcrumb first line and the leaf header blockquote (Status / Last Updated / Related). +3. In the hub, state the problem, goals and non-goals, requirements, success criteria, and the QA strategy. If the PRD is large, split detail into leaf pages and keep the README as a navigable hub. +4. Keep `STATUS.md` current: every time a piece ships, record it there and link the PR/ADR that delivered it. +5. Add a row to the Index table above. +6. Ensure every PR that implements part of the PRD links to the PRD, and that `STATUS.md` links back. Bidirectional links are mandatory. diff --git a/vibe/PRD/safe-prod-like-environment/README.md b/vibe/PRD/safe-prod-like-environment/README.md new file mode 100644 index 0000000..a821950 --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/README.md @@ -0,0 +1,69 @@ +[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment** + +# Safe, production-like environment + +> **Status:** In design +> **Last Updated:** 2026-06-23 +> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) +> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) +> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md) + +## Problem + +The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens. + +There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days. + +## Users & personas + +A **single operator wearing two hats**: + +- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing. +- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage. + +Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod. + +## Goals & non-goals + +**Goals** + +- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**. +- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated. +- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod. +- Run on the **existing control node** (the MacBook) at **$0** marginal cost. + +**Non-goals** + +- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope. +- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred. +- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred. +- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware. + +## Requirements + +Functional: + +- **One-command bring-up**, seeded from the same GitOps repos as prod. +- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md). +- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail. +- **Two modes** seeded from the same repos via the sandbox inventory: + - **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration. + - **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work. + +The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md). + +## QA strategy + +Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md). + +## Implementation status + +Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md). + +## Leaves + +| Page | Summary | Status | +| --- | --- | --- | +| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `` naming note; the token caution. | 🟡 In design | +| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design | +| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started | diff --git a/vibe/PRD/safe-prod-like-environment/STATUS.md b/vibe/PRD/safe-prod-like-environment/STATUS.md new file mode 100644 index 0000000..d345be6 --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/STATUS.md @@ -0,0 +1,52 @@ +[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **STATUS** + +# STATUS — Safe, production-like environment + +> **Last Updated:** 2026-06-23 + +Legend: ⬜ not started · 🟡 in progress · ✅ done + +> [!IMPORTANT] +> This file MUST be updated whenever something ships. Every PR that advances a phase crosslinks back here (and the matching checkbox flips), and the [PRs](#prs) table gets a row. + +## Phase 0 — Isolation guardrails + +*Must land before any sandbox run.* + +- [ ] ⬜ Sandbox inventory `inventory/sandbox/hosts.yml` (VM/cloud hosts only) +- [ ] ⬜ Prod-IP abort guard (aborts on `192.168.1.201-203` unless `i_mean_prod=true`) +- [ ] ⬜ Sandbox GCS state prefixes (`sandbox/...`) or `gs://arcodange-tf-sandbox` +- [ ] ⬜ Sandbox Vault unseal-key path (`~/.arcodange/sandbox/cluster-keys.json`) +- [ ] ⬜ Sandbox env profile / plan-only DNS against a throwaway zone + +## Phase 1 — Tier-1 k3d fast mode + +- [ ] ⬜ One-command bring-up seeded from GitOps +- [ ] ⬜ Parity manifest v1 +- [ ] ⬜ Canary provisioning-parity test +- [ ] ⬜ `changed=0` idempotence gate documented + +## Phase 2 — Tier-1 3-VM cluster + +- [ ] ⬜ Three arm64 VMs (multipass / Vagrant on the M4) +- [ ] ⬜ Same `system_k3s`; Postgres + Gitea outside k3s on the pi2-equivalent VM +- [ ] ⬜ Longhorn across the three VM disks +- [ ] ⬜ Chaos drills: node-kill / Vault-seal / DB-drop +- [ ] ⬜ First full `CLUSTER_RECOVERY` dry-run against the sandbox + +## Phase 3 — Game-day operationalization + +- [ ] ⬜ Monthly cadence + promotion gate in the PR checklist +- [ ] ⬜ Longhorn engine-ID drill +- [ ] ⬜ ArgoCD bad-sync rollback runbook +- [ ] ⬜ Evidence trail for ≥1 cycle + +## Phase 4 — out of scope + +Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s for real public DNS/ACME. See [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) for the rejected-alternatives rationale. + +## PRs + +| PR | Scope | Phase | Merged | +| --- | --- | --- | --- | +| _pending_ | Bootstrap the PRD tree (this `vibe/` set) — backfilled on open | — | ⬜ | diff --git a/vibe/PRD/safe-prod-like-environment/isolation-boundary.md b/vibe/PRD/safe-prod-like-environment/isolation-boundary.md new file mode 100644 index 0000000..c56b67a --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/isolation-boundary.md @@ -0,0 +1,31 @@ +[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **Isolation boundary** + +# Isolation boundary + +> **Status:** In design +> **Last Updated:** 2026-06-23 +> **Upstream:** [Safe, production-like environment](README.md) +> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) + +The isolation boundary is the load-bearing part of this PRD: the sandbox must be **unable to mutate real prod even on a wrong command**. Every prod coupling that a sandbox run could touch is mapped below to a concrete control. The boundary is the **cluster + Vault + state + DNS zone** — not the names (see the naming note). + +## Prod couplings → sandbox controls + +| Prod coupling | What it can break in prod | Sandbox control | +| --- | --- | --- | +| Ansible inventory `hosts.yml` → `192.168.1.201-203` | Wipe disks, reset k3s, corrupt Longhorn on the live Pis. | Separate `inventory/sandbox/hosts.yml` (VM/cloud hosts only) **plus** a pre-task guard that **aborts** if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true` is set explicitly. | +| OpenTofu state in `gs://arcodange-tf` (prefixes) | A sandbox apply rewrites live state and re-plans prod resources. | A sandbox prefix family (`sandbox/factory/main`, `sandbox/tools/...`, `sandbox/factory/postgres`) via a backend-config override, **or** a separate bucket `gs://arcodange-tf-sandbox`. Sandbox runs never touch prod state. | +| Gitea provider `base_url` `gitea.arcodange.lab` + ArgoCD `repoURL` / `targetRevision` | Sandbox commits/pushes into the prod forge; ArgoCD syncs sandbox refs onto the prod cluster. | Sandbox Gitea on the sandbox cluster (or org `arcodange-sandbox`); the sandbox app-of-apps points at a **sandbox branch** so the sandbox cluster syncs only sandbox refs. | +| Vault provider `address` `vault.arcodange.lab` + unseal key `~/.arcodange/cluster-keys.json` | Sandbox writes clobber prod policies/auth/mounts; a botched init overwrites the prod unseal key. | A **separate sandbox Vault**; override the unseal-key path to `~/.arcodange/sandbox/cluster-keys.json` so prod's key can never be overwritten. | +| PostgreSQL provider `host` `192.168.1.202` (superuser) | Drop or alter live DBs — including ERP business records. | Sandbox PG is the docker-compose on the sandbox pi2-equivalent; a guard **refuses apply** if `host == 192.168.1.202` and `workspace != prod`. | +| Cloudflare account / OVH `arcodange.fr` / Zoho live mail | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | DNS/email modules run **plan-only** against a throwaway zone/subdomain with a separate token. The real `arcodange.fr` token is **never** exported into a sandbox shell. Real public DNS/ACME is out of scope. | +| Longhorn backup bucket | A restore drill overwrites prod backups. | Sandbox backup target is a **separate bucket/prefix** so restore drills cannot overwrite prod backups. | + +## The `` naming note + +The `` key threads one kebab-case identifier through the Gitea repo, the PG db + role, the Vault paths/policies, the k8s namespace + SA, the ArgoCD Application, the GCS state prefix, and DNS — see [conventions](../../../doc/runbooks/new-web-app/conventions.md). + +Because `` keys everything **within** a cluster / Vault / DB / zone, the sandbox can reuse **identical `` names with no collision**. The isolation boundary is the cluster + Vault + state + DNS zone, not the names. This is deliberate: runbooks read **identically** in both environments, so a drill exercises the exact same convention chain an operator runs in prod. + +> [!CAUTION] +> The real `arcodange.fr` Cloudflare token must **never** be exported into a sandbox shell. DNS/email work in the sandbox is plan-only against a throwaway zone with its own separate token. Exporting the prod token into a sandbox session would defeat the entire isolation boundary — a single `tofu apply` could rewrite live public DNS or mail records. diff --git a/vibe/PRD/safe-prod-like-environment/qa-strategy.md b/vibe/PRD/safe-prod-like-environment/qa-strategy.md new file mode 100644 index 0000000..db5c444 --- /dev/null +++ b/vibe/PRD/safe-prod-like-environment/qa-strategy.md @@ -0,0 +1,49 @@ +[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy** + +# QA strategy + +> **Status:** In design +> **Last Updated:** 2026-06-23 +> **Upstream:** [Safe, production-like environment](README.md) +> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md) + +The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate). + +## Fidelity gates + +- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate. +- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test. +- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready. +- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate. + +## Chaos drills + +Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR. + +| Drill | Action | Acceptance | Recovery section | +| --- | --- | --- | --- | +| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. | +| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. | +| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. | +| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. | +| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. | +| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). | +| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. | + +1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state. +2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets. +3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR. +4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact. +5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last. +6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section. +7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate. + +## Recurring game-day & promotion gate + +A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident. + +**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md). + +## Evidence trail + +Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs. diff --git a/vibe/README.md b/vibe/README.md new file mode 100644 index 0000000..e33365c --- /dev/null +++ b/vibe/README.md @@ -0,0 +1,50 @@ +# vibe/ — Arcodange Knowledge Base + +You-are-here: the **root** of the `vibe/` knowledge tree — the front door for every doc agents write and read. +Up: [factory](../README.md) / [AGENTS.md](../AGENTS.md) + +> **Status:** Active +> **Last Updated:** 2026-06-23 + +## What is `vibe/`? + +`vibe/` is the knowledge base dedicated to **LLM agents** working on the Arcodange lab. It collects the *why* (ADRs), the *what/when* (PRDs), the *what-we-found* (investigations), the *how-it-fits-together* (guidebooks), the *how-to-do-it* (runbooks), and the *what-we-told-humans* (shareouts). Everything here is written in **English** — the single exception is **shareouts handouts, which are FRENCH**. Operating rules (no-tombstone, mermaid prefs, tree-docs, ADR/PRD/investigation conventions, PR crosslinking, language policy) are defined authoritatively in [AGENTS.md](../AGENTS.md); this page summarizes them and points there. + +## Folder map + +| Folder | When to use it | Status | +|---|---|---| +| [ADR](ADR/README.md) | Recording an architecture **decision** (MADR-lite; body immutable once Accepted). Canonical home going forward. | ⬜ | +| [PRD](PRD/README.md) | Specifying a **product/project**: Problem → … → QA strategy → `STATUS.md` (mandatory, kept current). | ⬜ | +| [investigations](investigations/README.md) | Capturing a **finding/analysis** — single `INV-NNN-slug.md`, or stub + notebooks when data-heavy. | ⬜ | +| [guidebooks](guidebooks/README.md) | Mapping a **component or the ecosystem** as navigable tree-docs (the lab cartography). | ⬜ | +| [runbooks](runbooks/README.md) | Documenting an **operational procedure** step-by-step with `[AGENT]` / `[HUMAN]` markers. | ⬜ | +| [shareouts](shareouts/README.md) | Producing **handouts/presentations** for humans (FRENCH). | ⬜ | + +Status legend: ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. + +## Conventions at a glance + +- **No-tombstone rule (foremost)** — write each file as currently true; never leave "previously X, now Y", changelogs, or "updated to …" notes. Git history is the audit trail. Only exception: a forward-looking `> [!CAUTION]` about a live risk. +- **Breadcrumb spine** — every non-root file starts with a breadcrumb: ancestors as relative links, current page bold-unlinked, separator ` > `. This root has no breadcrumb (it uses the you-are-here + up-link above instead). +- **README hub per folder** — each folder's `README.md` is an index table of its children (link + one-line summary + status), sorted by importance/sequence. +- **Bidirectional links** — if A references B as related, B references A. Use descriptive link text (never "here"/"this"). +- **Mermaid prefs** — `theme base`/`forest` init directive; legible `classDef` palette (dark fills + light text); `
` not `\n`; leading space before slash-labels; validate with the Mermaid MCP; **a numbered ordered list restating the flow after every diagram**. +- **GitHub alert legend** — `[!NOTE]` info/forward-looking · `[!TIP]` aside · `[!IMPORTANT]` inherent constraint · `[!WARNING]` degraded-but-working · `[!CAUTION]` data-loss/breaking. +- **Status emoji legend** — ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. +- **Language policy** — English throughout `vibe/`; FRENCH only for shareouts handouts. + +Authority for all of the above: [AGENTS.md](../AGENTS.md). + +## Maintenance policy + +- **Adding a page** → also add its row to the parent folder's `README.md` index table. +- **Keep links bidirectional** → when you link A→B, add B→A. +- **Stamp `Last Updated:`** at each tree root (this file and every guidebook/big-PRD root) after any structural change. +- **Never tombstone** → edit content in place; let git carry the history. +- **Guidebook coupling** → changing a documented component means updating its guidebook page in the same change. +- **PR crosslinks** → every PR references the ADR/PRD it advances; that ADR's References and the PRD's `STATUS.md` link back. + +## Cohort + workflow (recap) + +Docs here are produced by a cohort of persona subagents — Lab Cartographer, ADR Scribe, PRD Architect, Runbook Engineer, Investigator, Diagram Smith, Continuity Warden — spawned via the Agent tool or a Workflow. The recommended pipeline for substantial contributions is **Scaffold → Author → Validate → Review → Assemble**. Full descriptions and responsibilities live in [AGENTS.md](../AGENTS.md). diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md new file mode 100644 index 0000000..1c11363 --- /dev/null +++ b/vibe/guidebooks/README.md @@ -0,0 +1,47 @@ +[vibe](../README.md) > **Guidebooks** + +# Guidebooks + +> **Status:** Active +> **Last Updated:** 2026-06-23 +> **Related:** [vibe runbooks](../runbooks/README.md) · [vibe shareouts](../shareouts/README.md) · canonical docs under [doc/](../../doc/README.md) + +## What a guidebook is + +A **guidebook** is a *tree-doc reference map* of the lab: a navigable set of linked Markdown pages (a root index, per-folder README hubs, and leaf pages wired with breadcrumbs and bidirectional cross-references) whose job is to **describe how the system is actually wired right now** — components, the conventions that join them, and the data/control flows between them. + +Guidebooks are descriptive maps, not procedures. They answer *"how does this fit together?"* For *"how do I execute X step by step?"* see the [runbooks](../runbooks/README.md). For *"why was it built this way?"* see the architecture decision records under [doc/adr](../../doc/adr/README.md). + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + SYS["Lab system
(factory + tools + cms)"]:::src --> GB["Guidebook
(tree-doc reference map)"]:::proc --> READER["Reader
(human or agent)
understands the wiring"]:::store +``` + +1. The lab system spans three repos — `factory`, `tools`, and `cms` — joined by the `` naming convention. +2. A guidebook surveys that system and renders it as a tree-doc reference map: indexed folders, breadcrumb-linked leaves, Mermaid flow diagrams. +3. A reader (a human onboarding, or an agent planning a change) consumes the guidebook to understand how the pieces wire together before touching anything. + +## Key maintenance rule + +> [!IMPORTANT] +> **If a component documented in a guidebook is altered, the guidebook page describing it MUST be updated in the same change.** A reference map that drifts from reality is worse than no map — it sends readers (and agents) confidently down dead paths. Treat the guidebook edit as part of the diff, not a follow-up: the PR that changes the component is the PR that updates its guidebook page. + +## Index + +| Guidebook | What it maps | Status | +|---|---|---| +| [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active | + +## Rules to contribute + +1. **Use the `tree-docs` skill.** Guidebooks are tree-docs: author and grow them with the skill so breadcrumbs, hubs, and cross-links stay consistent. +2. **Breadcrumb spine on every file.** The first line of each page is its breadcrumb trail: ancestors are relative links, the current page is the bold-unlinked last item, separator is ` > ` (space-gt-space). +3. **README hub per subfolder.** Every folder carries a `README.md` index hub: a table of its children (link + one-line summary + status), sorted by importance/sequence, never alphabetically. +4. **Bidirectional links.** When page A references page B as related, page B references A back. Use descriptive link text — never "here" or "this". +5. **Mermaid preferences.** Begin each diagram with a `%%{init: {'theme': 'base'}}%%` directive, define a `classDef` palette legible on both light and dark backgrounds (dark fills, light text), use HTML `
` for line breaks, and follow every diagram immediately with a numbered ordered list restating the same flow in words. +6. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. +7. **Honour the maintenance rule above** — update the relevant guidebook page in the same change that alters the component it documents. diff --git a/vibe/guidebooks/lab-ecosystem/01-factory.md b/vibe/guidebooks/lab-ecosystem/01-factory.md new file mode 100644 index 0000000..79e54cd --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/01-factory.md @@ -0,0 +1,122 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **01 · factory** + +# 01 · factory + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md) +> **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) + +`factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md). + +## Pillar 1 — Ansible ([`ansible/`](../../../ansible/)) + +The collection lives at `ansible/arcodange/factory/`. The inventory groups the three Pis and pins the service placement; numbered playbooks run an ordered narrative from bare OS to backups; `recover/` holds the disaster-recovery playbooks. + +### Inventory (`inventory/hosts.yml`) + +| Group | Hosts | Purpose | +|---|---|---| +| `raspberries` | `pi1`, `pi2`, `pi3` (`192.168.1.201-203`) | All three Pis; `ansible_user: pi` | +| `postgres` | `pi2` | The PostgreSQL host (docker-compose, outside k3s) | +| `gitea` | children of `postgres` (→ `pi2`) | Gitea co-located with PG on `pi2` | +| `pihole` | `pi1`, `pi3` | Internal DNS resolvers | +| `step_ca` | `pi1`, `pi2`, `pi3` | Step-CA PKI for `*.arcodange.lab` (primary `pi1`, replicas `pi2`/`pi3`) | +| `local` | `localhost` + the Pis | Control-node-local tasks | + +### Numbered playbooks (`playbooks/`) + +| Playbook | Imports / does | Notes | +|---|---|---| +| `01_system` | `system/system.yml` → rpi base, DNS, SSL, prepare disks, Docker, iSCSI, **k3s install** (`--docker --disable traefik`), CoreDNS, cert-issuer, Longhorn/Traefik config | k3s `v1.34.3+k3s1` via upstream `k3s-ansible`; pi1 server, pi2/pi3 agents | +| `02_setup` | `setup/setup.yml` → PostgreSQL + Gitea docker-compose; optional backup-NFS share | Stands up the two out-of-cluster source-of-truth services on `pi2` | +| `03_cicd` | Gitea **act-runner** docker-compose on `pi1`/`pi3` (`raspberries:&local:!gitea`), plus the ArgoCD/Image-Updater install | See the ArgoCD caveat below | +| `04_tools` | `tools/tools.yml` → `hashicorp_vault.yml`, `crowdsec.yml` | Platform tooling that bootstraps the cluster's Vault + CrowdSec | +| `05_backup` | `backup/backup.yml` → `postgres.yml`, `gitea.yml`, `k3s_pvc.yml` to `/mnt/backups` | Scheduled PG/Gitea/PVC backups; cron-report wiring present | + +### Recovery playbooks (`playbooks/recover/`) + +| Playbook | When to use | +|---|---| +| `longhorn.yml` | Recover Longhorn after a power cut when **Volume CRDs still exist** (CSI driver registration loss) | +| `longhorn_data.yml` | Recover app data from **raw replica `.img` files** when Volume CRDs are gone (block-device level) | + +The tested power-cut recovery sequence (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last) is documented in `CLUSTER_RECOVERY.md` at the lab root (outside this repo) and summarized in [storage-and-recovery.md](storage-and-recovery.md). Background on PVC recovery is in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). + +### Key roles + +`deploy_docker_compose` (renders compose stacks), `gitea_repo` / `gitea_token` / `gitea_secret` / `gitea_sync` (Gitea repo/token/secret/mirror management), `traefik_certs`, `playwright`, plus sub-roles `step_ca`, `hashicorp_vault`, `crowdsec`, `pihole`. + +## Pillar 2 — ArgoCD app-of-apps ([`argocd/`](../../../argocd/)) + +A Helm chart whose `templates/apps.yaml` loops over `values.gitea_applications` and emits one `Application` CRD per app. Each Application derives everything from the app name: `repoURL = https://gitea.arcodange.lab//`, `path = chart`, `namespace = ` (`CreateNamespace=true`), with `syncPolicy.automated` `prune: true` + `selfHeal: true` by default. + +| App | Org override | Image Updater | +|---|---|---| +| `url-shortener` | — | — | +| `tools` | — | explicit `prune`+`selfHeal` | +| `webapp` | — | ✅ digest strategy | +| `telegram-gateway` | `arcodange` | ✅ digest strategy | +| `erp` | — | — | +| `cms` | — | ✅ digest strategy | +| `dance-lessons-coach` | `arcodange` | ✅ digest strategy | + +> [!NOTE] +> The chart also templates a `longhorn_backup_target` and the ArgoCD Image Updater config (`argocd.arcodange.lab`). **ArgoCD itself is not currently deployed in-cluster** — its install is commented out in `03_cicd`. This page documents the intended steady state; treat ArgoCD as "designed, not live" until that step is enabled. + +## Pillar 3 — OpenTofu ([`iac/`](../../../iac/)) + +Manages the cloud/Gitea/Vault edge. State lives in **GCS** (`backend "gcs"`, bucket `arcodange-tf`, prefix `factory/main`). Tofu authenticates to Vault via **Gitea OIDC JWT** (mount `gitea_jwt`, role `gitea_cicd`). + +| Provider | Used for | +|---|---| +| `go-gitea/gitea` (`0.6.0`) | Repos, users, action secrets (e.g. the restricted `tofu_module_reader` CI user, CMS secrets) | +| `vault` (`4.4.0`) | KV secrets + policies + k8s auth roles (e.g. Longhorn GCS-backup creds & policy) | +| `google` (`7.0.1`) | GCS backup bucket + service account + HMAC key for Longhorn | +| `cloudflare/cloudflare` (`~> 5`) | R2 bucket, API tokens, CMS edge wiring (detailed in [03 · cms](03-cms.md)) | +| `ovh/ovh` (`2.8.0`) | OAuth2 client + IAM policy for the `arcodange.fr` domain (registrar = OVH) | + +`modules/cloudflare_token` is a reusable scoped-token factory. The whole module reuses the `` name as the GCS state prefix (`/main`) — see [naming-conventions.md](naming-conventions.md). + +## Pillar 4 — per-app PostgreSQL ([`postgres/iac/`](../../../postgres/)) + +OpenTofu using the `cyrilgdn/postgresql` provider against PG on `192.168.1.202` (state prefix `factory/postgres`). It iterates over a `var.applications` set and, **per app**, creates: + +| Resource | Name pattern | Purpose | +|---|---|---| +| Database | `` | The app's database (`template0`, owned by the role) | +| Owner role (non-login) | `_role` | Database owner; granted to dynamic users by Vault | +| Editor role (login) | `credentials_editor` | Shared admin role that can grant the per-app roles | +| `user_lookup()` function | per-`` db | `SECURITY DEFINER` lookup for **pgbouncer** auth (granted to `pgbouncer_auth`, revoked from `public`) | + +Current `applications` set: `webapp`, `erp`, `crowdsec`, `plausible`, `dance-lessons-coach`. Vault's PostgreSQL secrets engine then issues **dynamic** credentials on top of these roles — see [secrets-and-vault.md](secrets-and-vault.md). The pooler (`pgbouncer`) that consumes `user_lookup()` lives in the `tools` namespace — see [02 · tools](02-tools.md). + +## Provisioning order + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + S1["01_system
OS + k3s + Longhorn"]:::proc --> S2["02_setup
PG + Gitea (pi2)"]:::proc --> S3["03_cicd
runners + ArgoCD"]:::proc --> S4["04_tools
Vault + CrowdSec"]:::proc --> S5["05_backup
PG/Gitea/PVC"]:::proc + IAC["iac/ + postgres/iac
(OpenTofu state in GCS)"]:::store -. "declares cloud/Gitea/Vault/PG" .- S2 +``` + +1. **`01_system`** lays the OS, disks, Docker, and k3s with Longhorn + Traefik onto the three Pis. +2. **`02_setup`** stands up PostgreSQL and Gitea as docker-compose on `pi2` — the out-of-cluster source-of-truth services. +3. **`03_cicd`** registers the Gitea act-runners (and is where ArgoCD would install, currently commented out). +4. **`04_tools`** bootstraps the cluster's Vault and CrowdSec. +5. **`05_backup`** schedules PostgreSQL, Gitea, and k3s-PVC backups to `/mnt/backups`. +6. In parallel, **OpenTofu** (`iac/` and `postgres/iac/`) declares the cloud, Gitea, Vault, and PostgreSQL objects, keeping state in GCS. + +## Cross-references + +- [Lab ecosystem hub](README.md) — the whole-lab map this page sits under. +- [02 · tools](02-tools.md) — what ArgoCD deploys into the `tools` namespace (incl. pgbouncer that consumes the PG `user_lookup()`). +- [03 · cms](03-cms.md) — the CMS edge that `iac/cloudflare.tf` and `iac/ovh.tf` wire up. +- [naming-conventions.md](naming-conventions.md) — the `` join key these pillars share. +- [secrets-and-vault.md](secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and dynamic PG creds. +- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn + GCS backup + power-cut recovery. +- [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md) · [conventions](../../../doc/runbooks/new-web-app/conventions.md) — the step-by-step procedure these pillars support. +- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs. +- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — recovery background. diff --git a/vibe/guidebooks/lab-ecosystem/02-tools.md b/vibe/guidebooks/lab-ecosystem/02-tools.md new file mode 100644 index 0000000..116b7ef --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/02-tools.md @@ -0,0 +1,76 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **02 · tools** + +# 02 · tools + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [01 · factory](01-factory.md) +> **Related:** [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) + +The [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) is deployed by factory's ArgoCD into the **`tools` namespace**. It is the platform layer that every app namespace depends on: secrets (Vault + VSO), observability (Prometheus + Grafana), edge security (CrowdSec), database pooling (pgbouncer / pgcat), caching (Redis/KeyDB), and analytics (Plausible + ClickHouse). Each component ships its own Helm chart or Kustomize overlay, and most carry an `iac/` directory of OpenTofu that declares the Vault config (roles, policies, dynamic-secret backends) that wires the component to secrets — see [secrets-and-vault.md](secrets-and-vault.md). + +## Components in the `tools` namespace + +| Component | What it does | How declared | How it gets secrets | +|---|---|---|---| +| **Vault** | Secrets engine: KV v1 + v2, transit, PostgreSQL **dynamic creds**; auth backends `kubernetes` + Gitea **OIDC/JWT** | Helm chart + `iac/` (Vault config of itself + apps) | Is the source of truth; unsealed at boot (1 key, threshold 1) | +| **VSO** (Vault Secrets Operator) | Injects Vault secrets into pods via `VaultAuth` + `VaultDynamicSecret` CRDs | Helm chart | Authenticates to Vault via **Kubernetes auth** (per-`` role) | +| **Prometheus** | Metrics scraping + storage | Helm (community subchart) | — (scrape configs) | +| **Grafana** | Dashboards at `grafana.arcodange.lab`; datasources Prometheus + ClickHouse | Helm | Admin/datasource creds via VSO from Vault | +| **CrowdSec** | Behavioural detection + **Traefik bouncer** for the public edge | Helm + `iac/` | **Dynamic secrets** from Vault (VSO) | +| **pgbouncer** | Connection pooler to the **external** PostgreSQL on `pi2` | Helm | Auth via the per-app `user_lookup()` function (see [01 · factory](01-factory.md)); creds via VSO | +| **pgcat** | Alternative pooler (optional, **not the default**) | Helm | VSO-injected creds when enabled | +| **Redis / KeyDB** | In-memory cache; **KeyDB** master/replica (Redis-compatible) | Helm | VSO-injected auth when set | +| **Plausible** | Privacy-friendly web analytics | **Kustomize** | VSO-injected creds; backed by ClickHouse | +| **ClickHouse** | OLAP column store backing Plausible | **Kustomize** | VSO-injected creds | +| **`tool`** | A Helm **library chart** — shared templates/helpers reused by the other charts (not itself deployable) | Helm library chart | n/a | + +## How tools fit together + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TB + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef edge fill:#d97706,stroke:#b45309,color:#fff + + VAULT[("Vault
single source of truth")]:::store + VSO["VSO
VaultAuth / VaultDynamicSecret"]:::proc + PG[("External PostgreSQL
pi2 · 192.168.1.202")]:::store + PGB["pgbouncer
pooler"]:::proc + APPS["app pods
(webapp, erp, …)"]:::proc + PROM["Prometheus"]:::proc + GRAF["Grafana
grafana.arcodange.lab"]:::proc + CH[("ClickHouse")]:::store + PLA["Plausible"]:::proc + CS["CrowdSec + Traefik bouncer"]:::edge + + VAULT --> VSO + VSO -- "inject secrets" --> APPS + VSO -- "inject secrets" --> PGB + VSO -- "dynamic secret" --> CS + APPS --> PGB --> PG + PROM --> GRAF + CH --> GRAF + PLA --> CH +``` + +1. **Vault** holds every secret; **VSO** is the operator that delivers them into pods. +2. VSO **injects** static and dynamic secrets into the app pods, into **pgbouncer**, and supplies **CrowdSec** its dynamic secret. +3. App pods connect through **pgbouncer**, which pools connections to the **external PostgreSQL** on `pi2` (using the per-app `user_lookup()` function defined in factory's `postgres/iac/`). +4. **Prometheus** scrapes metrics and **ClickHouse** stores analytics; both are wired as **Grafana** datasources. +5. **Plausible** writes its analytics into **ClickHouse**. +6. **CrowdSec** runs as a Traefik bouncer on the public edge, fed dynamic secrets from Vault — the same edge that fronts the CMS in [03 · cms](03-cms.md). + +## Where to look + +- Repo: [arcodange-org/tools](https://gitea.arcodange.lab/arcodange-org/tools) — each component is a top-level chart/overlay with its own `iac/`. +- Vault config patterns: [hashicorp-vault/iac/modules](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac) (e.g. `app_roles`, `app_policy`) — referenced by the [naming convention](../../../doc/runbooks/new-web-app/conventions.md). + +## Cross-references + +- [Lab ecosystem hub](README.md) — the whole-lab map. +- [01 · factory](01-factory.md) — the ArgoCD that deploys this namespace, and the `postgres/iac/` roles + `user_lookup()` that pgbouncer consumes. +- [03 · cms](03-cms.md) — the public edge protected by **CrowdSec** (Turnstile → CrowdSec wiring). +- [secrets-and-vault.md](secrets-and-vault.md) — full Vault detail: KV/transit/dynamic engines, Gitea OIDC JWT, VSO injection. +- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn PVCs these stateful tools mount, and the Vault-unseal step in recovery. diff --git a/vibe/guidebooks/lab-ecosystem/03-cms.md b/vibe/guidebooks/lab-ecosystem/03-cms.md new file mode 100644 index 0000000..2bcdaad --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/03-cms.md @@ -0,0 +1,82 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **03 · cms** + +# 03 · cms + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [01 · factory](01-factory.md) +> **Related:** [02 · tools](02-tools.md) · [secrets-and-vault.md](secrets-and-vault.md) + +The [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) is the **public-facing site** of the lab: a Nuxt static site served at **`arcodange.fr`**, plus the OpenTofu that owns its Cloudflare edge and its Zoho email. It is the one app whose primary audience is the open Internet, so it ties together the public-DNS, tunnel, CAPTCHA, and email plumbing. + +## The Nuxt site + +| Aspect | Detail | +|---|---| +| App | Static **Nuxt** site | +| Chart | [`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart) — Helm chart, deployed as ArgoCD app **`cms`** into the `cms` namespace | +| Image | Built in CI to the Gitea registry; ArgoCD **Image Updater** tracks `gitea.arcodange.lab/arcodange-org/cms:latest` with the **digest** strategy (see [01 · factory](01-factory.md)) | +| Hostname | `arcodange.fr` (public) | + +## Cloudflare edge ([`cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare)) + +OpenTofu (state in cloud object storage) manages the `arcodange.fr` zone. The domain is **registered at OVH** (factory's [`iac/ovh.tf`](../../../iac/ovh.tf) grants the CMS an OVH OAuth2 client to edit nameservers) but its **DNS is delegated to Cloudflare**. The Cloudflare API token + account ID are pushed into the CMS Gitea repo as action secrets and mirrored into Vault by factory's [`iac/cloudflare.tf`](../../../iac/cloudflare.tf). + +| Cloudflare object | Purpose | +|---|---| +| Zone `arcodange.fr` | Public DNS for the site + email records | +| Cloudflare **Pages** option | Static-hosting alternative for the Nuxt build | +| **Cloudflared** Zero-Trust tunnel | Exposes **internal Traefik** to the Internet without opening home-LAN ports | +| **Turnstile** CAPTCHA | Bot challenge on forms; wired to **CrowdSec** for decisioning | + +The Cloudflared tunnel token and Turnstile secret are stored in **Vault** (see [secrets-and-vault.md](secrets-and-vault.md)); the Turnstile → CrowdSec link is the public-edge guard documented in [02 · tools](02-tools.md). + +## Zoho email ([`zoho/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/zoho)) + +Sets up email for `arcodange.fr`: org/account lookup via the Zoho API + shell scripts, the full DNS authentication record set, and the public aliases. + +| DNS record | Role | +|---|---| +| CNAME (verify) | Domain ownership verification | +| **MX** | Mail routing to Zoho | +| **SPF** | Authorized senders | +| **DKIM** | Outbound signing | +| **DMARC** | Alignment + reporting policy | +| **BIMI** | Brand logo in inboxes | + +Seven aliases are provisioned: **bonjour, contact, analytics, books, abonnements, helloworld, bureaux**. + +## Public request + email path + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef edge fill:#d97706,stroke:#b45309,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + + USER(["Visitor"]):::edge + CF["Cloudflare
DNS + Turnstile"]:::edge + TUN["Cloudflared tunnel"]:::edge + TRAEFIK["internal Traefik"]:::proc + CS["CrowdSec bouncer"]:::proc + CMS["cms pod (Nuxt)
arcodange.fr"]:::proc + MAIL(["Sender"]):::edge + ZOHO["Zoho
MX / SPF / DKIM / DMARC / BIMI"]:::store + + USER --> CF -- "Turnstile challenge" --> TUN --> TRAEFIK --> CS --> CMS + MAIL -- "MX lookup arcodange.fr" --> ZOHO +``` + +1. A **visitor** resolves `arcodange.fr` through **Cloudflare** DNS; form submissions hit a **Turnstile** challenge. +2. Traffic enters the home LAN through the **Cloudflared** Zero-Trust tunnel — no home-LAN ports are opened. +3. The tunnel lands on **internal Traefik**, which routes through the **CrowdSec** bouncer (fed Turnstile/decision signals) to the **`cms`** Nuxt pod. +4. Separately, **email** to `arcodange.fr` follows the **MX** record to **Zoho**, with **SPF/DKIM/DMARC/BIMI** authenticating and presenting the mail; the seven aliases land there. + +## Cross-references + +- [Lab ecosystem hub](README.md) — the whole-lab map. +- [01 · factory](01-factory.md) — the ArgoCD app `cms`, and `iac/cloudflare.tf` / `iac/ovh.tf` that grant the CMS its Cloudflare token and OVH nameserver-edit rights. +- [02 · tools](02-tools.md) — **CrowdSec** (the Traefik bouncer the Turnstile challenge feeds). +- [secrets-and-vault.md](secrets-and-vault.md) — the Cloudflared tunnel token and Turnstile/Cloudflare secrets stored in Vault. +- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms). diff --git a/vibe/guidebooks/lab-ecosystem/README.md b/vibe/guidebooks/lab-ecosystem/README.md new file mode 100644 index 0000000..f932bec --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/README.md @@ -0,0 +1,116 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **Lab ecosystem** + +# Lab ecosystem + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Related:** [ADR-0001 · safe prod-like environment](../../ADR/0001-safe-prod-like-environment.md) · [PRD · safe prod-like environment](../../PRD/safe-prod-like-environment/README.md) · [INV-001 · prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) + +## What this is + +This guidebook is the **end-to-end map of the Arcodange home lab** — how the three repos (`factory`, `tools`, `cms`), the three Raspberry Pis, and the cloud edge wire together into one running system. It is a *descriptive reference map*, not a procedure: it answers *"how does this fit together right now?"*. For *"how do I add a new app step by step?"* see the [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md); for *"why was it built this way?"* see the [factory ADRs](../../../doc/adr/README.md). + +The lab is run from **one control node** — a MacBook Pro M4 — driving everything via Ansible (imperative host setup) and OpenTofu (declarative cloud/Gitea/Vault/Postgres state). The three Pis (`pi1`/`pi2`/`pi3` = `192.168.1.201-203`) sit behind a home Livebox. `pi1` is the k3s server; `pi2`/`pi3` are agents. Gitea + PostgreSQL run as Docker Compose **outside** k3s on `pi2`'s disk; everything else runs **inside** k3s on Longhorn distributed block storage. The public edge is a Cloudflared Zero-Trust tunnel into the internal Traefik, with Cloudflare DNS and Zoho email fronting `arcodange.fr`. + +## The whole lab, end to end + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TB + classDef ctrl fill:#2563eb,stroke:#1e40af,color:#fff + classDef host fill:#0891b2,stroke:#0e7490,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef edge fill:#d97706,stroke:#b45309,color:#fff + classDef dead fill:#6b7280,stroke:#4b5563,color:#fff + + MAC["Control node (MacBook Pro M4)
Ansible + OpenTofu"]:::ctrl + + subgraph LAN["Home LAN (Livebox) — 192.168.1.0/24"] + subgraph PI2["pi2 · 192.168.1.202 (docker-compose, outside k3s)"] + GITEA["Gitea
arcodange-org/*"]:::host + PG[("PostgreSQL")]:::store + end + subgraph K3S["k3s cluster — pi1 server, pi2/pi3 agents"] + ARGO["ArgoCD app-of-apps
/argocd"]:::proc + LH[("Longhorn
block storage")]:::store + VAULT["Vault + VSO
secrets"]:::store + TRAEFIK["Traefik
ingress"]:::proc + TOOLS["tools namespace
(Vault, Grafana, CrowdSec, …)"]:::host + APPS["app namespaces
(webapp, erp, cms, …)"]:::host + end + OLLAMA["pi3 · ollama"]:::host + end + + subgraph CLOUD["Cloud edge"] + CF["Cloudflare DNS
+ Cloudflared tunnel"]:::edge + ZOHO["Zoho
email (arcodange.fr)"]:::edge + GCS[("GCS gs://arcodange-tf
OpenTofu state + Longhorn backup")]:::store + end + + INTERNET(["Internet"]):::edge + + MAC -- "Ansible: provision hosts, k3s, docker-compose" --> PI2 + MAC -- "Ansible: k3s, Longhorn, Traefik" --> K3S + MAC -- "OpenTofu: Gitea/Vault/PG/Cloudflare/OVH state" --> GITEA + MAC -- "OpenTofu state" --> GCS + + GITEA -- "repoURL chart/" --> ARGO + ARGO -- "Application CRDs (prune+selfHeal)" --> TOOLS + ARGO -- "Application CRDs (prune+selfHeal)" --> APPS + VAULT -- "VSO injects secrets into pods" --> TOOLS + VAULT -- "VSO injects secrets into pods" --> APPS + APPS -- "dynamic creds" --> PG + LH -. "PVCs" .- TOOLS + LH -. "PVCs" .- APPS + LH -- "backup target" --> GCS + + INTERNET --> CF -- "tunnel" --> TRAEFIK --> APPS + INTERNET --> ZOHO +``` + +1. The **control node** (MacBook) provisions the three Pis with Ansible (OS, disks, Docker, k3s, Longhorn, Traefik) and manages all SaaS/Gitea/Vault/Postgres state with OpenTofu. +2. On **pi2**, Gitea and PostgreSQL run as Docker Compose *outside* k3s, on the local disk — they are the source-of-truth services the cluster depends on. +3. OpenTofu keeps its **state in GCS** (`gs://arcodange-tf`), and Longhorn pushes volume **backups** to the same GCS project. +4. **Gitea** hosts every app repo; each repo's `chart/` directory is the deployable Helm chart. +5. **ArgoCD's app-of-apps** turns each Gitea repo into an `Application` CRD (automated `prune` + `selfHeal`) that deploys into the `tools` namespace and the per-app namespaces. +6. **Vault** is the single source of truth for secrets; the **Vault Secrets Operator (VSO)** injects them into pods via Kubernetes auth, and apps draw dynamic PostgreSQL credentials from Vault against `pi2`. +7. **Longhorn** provides the PVCs the in-cluster workloads mount, and backs up to GCS. +8. The **public edge** routes Internet traffic through Cloudflare DNS and a Cloudflared Zero-Trust **tunnel** into the internal **Traefik**, which fronts the app namespaces; **Zoho** handles `arcodange.fr` email. + +> [!NOTE] +> The ArgoCD Helm chart under [`argocd/`](../../../argocd/) is defined and templated, but **ArgoCD itself is not currently deployed in-cluster** (its install step is commented out in the `03_cicd` provisioning). The app-of-apps wiring documented here is the intended steady state; see [01 · factory](01-factory.md) for the caveat. + +## Deploy / secrets / DNS flows + +- **Deploy flow.** Push to a Gitea repo → CI builds an image into the Gitea registry → ArgoCD (via the app-of-apps and, for some apps, the Image Updater) syncs the `chart/` directory into the matching namespace with `prune` + `selfHeal`. The whole chain keys off one `` identifier — see [naming-conventions.md](naming-conventions.md). +- **Secrets flow.** Vault is the **single source of truth** (no sops/age). CI authenticates to Vault via **Gitea OIDC JWT** (role `gitea_cicd_`); pods receive secrets at runtime via **VSO** (Kubernetes auth + `VaultDynamicSecret` CRDs). Detail in [secrets-and-vault.md](secrets-and-vault.md). +- **DNS / edge flow.** Internal names resolve under `*.arcodange.lab` (Pi-hole + Step-CA-issued TLS). Public traffic for `arcodange.fr` enters through Cloudflare and a Cloudflared tunnel to internal Traefik; public TLS is Let's Encrypt via Traefik's DNS-challenge (DuckDNS). Email runs through Zoho. Edge detail in [03 · cms](03-cms.md). + +## Master index + +| Page | What it maps | Status | +|---|---|---| +| [01 · factory](01-factory.md) | The cornerstone admin repo: Ansible host/cluster provisioning, ArgoCD app-of-apps, OpenTofu (`iac/`), and per-app PostgreSQL (`postgres/iac/`) | ✅ Active | +| [02 · tools](02-tools.md) | The `tools` namespace: Vault, VSO, Prometheus, Grafana, CrowdSec, poolers, Redis/KeyDB, Plausible + ClickHouse, the `tool` library chart | ✅ Active | +| [03 · cms](03-cms.md) | The public-facing site: Nuxt static site, Cloudflare zone + tunnel + Turnstile, Zoho email (MX/SPF/DKIM/DMARC/BIMI + aliases) | ✅ Active | +| [naming-conventions.md](naming-conventions.md) | The `` join key — one kebab-case name reused identically across Gitea, PG, Vault, k8s, ArgoCD, GCS, DNS | ✅ Active | +| [secrets-and-vault.md](secrets-and-vault.md) | How Vault is the single source of truth: Gitea OIDC JWT for CI, VSO injection for pods, dynamic PostgreSQL creds | ✅ Active | +| [storage-and-recovery.md](storage-and-recovery.md) | Longhorn block storage, GCS backup target, and the tested power-cut recovery sequence | ✅ Active | + +## Status legend + +✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. + +## Maintenance rule + +> [!IMPORTANT] +> **If you alter a component documented here, update its page in the same change.** A reference map that drifts from reality sends readers (and agents) confidently down dead paths. The PR that changes the component is the PR that updates its guidebook page — treat the doc edit as part of the diff, not a follow-up. + +## Cross-references + +- [ADR-0001 · safe prod-like environment](../../ADR/0001-safe-prod-like-environment.md) — the decision this map supports. +- [PRD · safe prod-like environment](../../PRD/safe-prod-like-environment/README.md) — the product framing of an isolated, prod-like sandbox. +- [INV-001 · prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) — the couplings (the `` join key, shared Vault/PG/Longhorn) that make blast radius real. +- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs (FRENCH). +- [new-web-app conventions](../../../doc/runbooks/new-web-app/conventions.md) — the authoritative source for the `` naming convention. diff --git a/vibe/guidebooks/lab-ecosystem/naming-conventions.md b/vibe/guidebooks/lab-ecosystem/naming-conventions.md new file mode 100644 index 0000000..544cb78 --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/naming-conventions.md @@ -0,0 +1,96 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Naming conventions (the `` join key)** + +# Naming conventions — the `` join key + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [Lab ecosystem](README.md) · [Factory brick](01-factory.md) · [Secrets & Vault](secrets-and-vault.md) · [PRD — isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) +> **Upstream (source of truth)**: [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md) (French, authoritative) + +## TL;DR + +Every application on the platform is pinned to **one** kebab-case identifier — `` (e.g. `erp`, `webapp`, `url-shortener`, `dance-lessons-coach`). That single string is reused **verbatim**, with no transformation, as the name of the app's Gitea repo, its PostgreSQL database and role, its Vault roles and policies, its Kubernetes namespace and ServiceAccount, its ArgoCD Application, its OpenTofu state prefix, and its DNS records. The bricks of the stack do not point at each other through explicit configuration; they **wire together by guessing each other's names from ``**. Pick the name once, get it right, and the whole chain self-assembles. One typo anywhere, and the chain breaks silently. + +## What `` is + +`` is a **lowercase, kebab-case** slug. It is the join key of the entire platform — the one value that lets a dozen otherwise-independent systems agree on which resources belong to the same application without ever exchanging a config pointer. The canonical, authoritative definition (in French) lives in the runbook: [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md). This page is the English concept summary inside the ecosystem guidebook. + +## The mapping — one name, every system + +The table below shows how each system derives its identifier from ``, with the `erp` application as the worked example. + +| System | Identifier derived from `` | Example (`erp`) | +| --- | --- | --- | +| Gitea repository | `arcodange-org/` | `arcodange-org/erp` | +| PostgreSQL database | `` | `erp` | +| PostgreSQL owner role (non-login) | `_role` | `erp_role` | +| Vault dynamic DB role | `postgres/creds/` | `postgres/creds/erp` | +| Vault Kubernetes auth role | `` | `erp` | +| Vault runtime policy (pod) | `` | `erp` | +| Vault CI/ops policy | `-ops` | `erp-ops` | +| Vault CI JWT role (Gitea OIDC) | `gitea_cicd_` | `gitea_cicd_erp` | +| Vault KV config path | `kvv2//config` | `kvv2/erp/config` | +| Kubernetes namespace | `` | `erp` | +| Kubernetes ServiceAccount | `` | `erp` | +| ArgoCD Application | `` | `erp` | +| OpenTofu state prefix (GCS) | `/main` | `erp/main` | +| Internal DNS | `.arcodange.lab` | `erp.arcodange.lab` | +| Public DNS | `.arcodange.fr` | `erp.arcodange.fr` | + +> [!NOTE] +> The `_role` suffix (PG owner role) and the `-ops` suffix (Vault CI policy/identity group) are the only two *systematic* transformations of ``. Everything else uses the bare slug. Note the suffix style differs: PostgreSQL uses an underscore (`erp_role`) because hyphens are awkward in SQL identifiers, whereas Vault and Kubernetes use a hyphen (`erp-ops`). + +## Why uniformity is structuring + +The platform is a set of loosely-coupled bricks (Gitea, Postgres, Vault, k3s/ArgoCD, OpenTofu, DNS). They were deliberately built **not** to hold explicit references to one another. Instead, each brick reconstructs the names it needs from `` at the moment it runs: + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart LR + APP["<app>
(one kebab-case slug)"]:::src + + APP --> GIT["Gitea repo
arcodange-org/<app>"]:::brick + APP --> PG["PostgreSQL
db <app> · role <app>_role"]:::brick + APP --> VAULT["Vault
postgres/creds/<app>
policy <app> · gitea_cicd_<app>"]:::brick + APP --> K8S["Kubernetes
namespace + SA <app>"]:::brick + APP --> ARGO["ArgoCD
Application <app>"]:::brick + APP --> GCS["OpenTofu state
<app>/main"]:::brick + APP --> DNS["DNS
<app>.arcodange.lab / .fr"]:::brick + + VAULT -.->|"GRANT <app>_role
assumes PG role name"| PG + K8S -.->|"VaultDynamicSecret reads
postgres/creds/<app>"| VAULT + ARGO -.->|"repoURL=.../<app>
namespace=<app>"| GIT + + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef brick fill:#059669,stroke:#047857,color:#fff +``` + +1. The chosen slug `` is the single input. +2. From it, each brick names its own resource: Gitea names the repo `arcodange-org/`; Postgres names the database `` and its owner role `_role`; Vault names the dynamic-creds role `postgres/creds/`, the runtime policy ``, and the CI JWT role `gitea_cicd_`; Kubernetes names the namespace and ServiceAccount ``; ArgoCD names the Application ``; OpenTofu writes state under `/main`; DNS publishes `.arcodange.lab` and `.arcodange.fr`. +3. The dashed arrows are the cross-brick assumptions that make it work: the Vault `app_roles` module issues a dynamic PG user with `GRANT _role TO …`, **assuming** the Postgres owner role is named exactly `_role`; the chart's `VaultDynamicSecret` reads `postgres/creds/`, **assuming** the Vault role is named exactly ``; the ArgoCD Application derives `repoURL=.../` and `namespace=` from the slug alone, **assuming** the Gitea repo and the namespace match. +4. None of these links is configured by hand. They hold purely because every brick was given the same `` to reconstruct from. + +## The failure mode of a typo + +Because the wiring is by name and not by explicit reference, **nothing validates the join key end-to-end**. A single divergence — `my_app` vs `my-app`, a stray capital (`MyApp`), an accidental plural (`erps`) — does not raise an error at creation time. The mismatched brick simply builds a resource under a name no one else looks for: + +- A Postgres owner role created as `erp-role` (hyphen) instead of `erp_role` → Vault's `GRANT erp_role` fails or grants nothing → the pod gets a DB user with no privileges. +- A Gitea repo named `erp-app` instead of `erp` → ArgoCD's derived `repoURL=.../erp` 404s → the Application never syncs. +- A namespace typo → the `VaultDynamicSecret` and ServiceAccount land in the wrong place → silent auth failure at pod start. + +The symptom is always the same: a brick that *looks* provisioned but never connects, with no single component to blame. This is why the slug must be **short, stable, and correct from the first step** — there is no safety net downstream. + +✅ Choose a short, stable, lowercase kebab-case name up front and reuse it character-for-character. +❌ Never introduce variants (case, separators, plurals); nothing will warn you. + +## Why this makes a sandbox safe + +The `` convention is also the reason a **production-like sandbox can reuse the exact same names** without colliding with production. Because every brick derives its resource names from `` and from nothing else, an entire parallel universe of the platform — its own Vault, its own Postgres instance, its own k3s namespace scope — can host an `erp` named identically to the production `erp`, provided the two universes never share a backing store. Identity comes from the *environment boundary*, not from the name; the name is free to repeat. This is what lets QA and recovery drills run against `erp`, `webapp`, etc. with realistic identifiers instead of mangled `erp-staging`-style aliases that would themselves break the name-wiring. See the PRD's [isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) for how that environment fence is drawn. + +## See also + +- [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md) — the authoritative French source, with per-step references into the 8-step "new web app" runbook. +- [Secrets & Vault](secrets-and-vault.md) — how `gitea_cicd_` and the `` / `-ops` policies fit the auth model. +- [Factory brick](01-factory.md) — where the ArgoCD app-of-apps, the Postgres OpenTofu, and the IaC live. +- [PRD — isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) — why identical names are safe across environments. +- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). diff --git a/vibe/guidebooks/lab-ecosystem/secrets-and-vault.md b/vibe/guidebooks/lab-ecosystem/secrets-and-vault.md new file mode 100644 index 0000000..1c71e74 --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/secrets-and-vault.md @@ -0,0 +1,110 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Secrets & Vault** + +# Secrets & Vault + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [Lab ecosystem](README.md) · [Tools brick](02-tools.md) · [Storage & recovery](storage-and-recovery.md) · [Naming conventions](naming-conventions.md) +> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) + +## TL;DR + +**HashiCorp Vault is the single source of truth for every secret in the lab.** There is no sops, no age, no secret files in git — if a credential exists, Vault either stores it or mints it on demand. Two parties consume secrets, and each authenticates a different way: **pods** use the Kubernetes auth backend (via the Vault Secrets Operator), and **CI / OpenTofu** use Gitea OIDC JWT (one role `gitea_cicd_` per app). Vault holds static config in KV, encryption keys in transit, and issues **short-lived, dynamic** PostgreSQL credentials so no long-lived DB password is ever written down. The trade-off: Vault is sealed on every restart and must be **manually unsealed** (1 key, threshold 1) before anything that needs a secret can come back. + +## Why Vault, and only Vault + +The lab made a deliberate choice: **one** secret store, accessed over the network, rather than encrypted secret files scattered through the repos. The consequences are structuring: + +- **No secret material in git.** Charts and OpenTofu reference Vault *paths*, never values. A leaked repo leaks no credentials. +- **One revocation point.** Rotating or revoking a credential happens in Vault; consumers pick up the change on their next read or lease renewal. +- **Dynamic over static.** Where a backend supports it (Postgres), Vault issues a fresh, time-boxed credential per consumer instead of a shared static password. + +Vault itself runs as the `hashicorp-vault` chart in the **tools** namespace. Its full configuration — engines, auth backends, policies, the per-app role/policy modules — lives in the tools repo; see the [Tools brick](02-tools.md) for the deployment context. + +## What Vault mounts + +| Mount | Type | Purpose | +| --- | --- | --- | +| `kvv2/` | KV v2 (versioned) | Application static config, e.g. `kvv2//config`. Versioned so a bad write can be rolled back. | +| KV v1 | KV v1 (unversioned) | Flat secrets that don't need history. | +| `transit/` | Transit | Encryption-as-a-service: encrypt/decrypt and sign without exposing the key. | +| `postgres/` | Database (dynamic) | Issues **short-lived** PostgreSQL credentials on demand: `postgres/creds/` hands out a fresh login user, granted `_role`, with a lease that expires. | + +The `` slug threads through every one of these paths — `kvv2//config`, `postgres/creds/` — exactly as described in [Naming conventions](naming-conventions.md). + +## The two auth backends + +Vault doesn't trust callers by static token. Each class of consumer proves its identity through a backend matched to where it runs: + +- **Kubernetes auth** — for **pods**. The Vault Secrets Operator (VSO) and workloads present their Kubernetes ServiceAccount token; Vault validates it against the cluster's API and maps the SA to the Vault role ``, which carries the runtime policy ``. +- **Gitea OIDC / JWT auth** — for **CI and OpenTofu**. A Gitea Actions workflow obtains an OIDC token; Vault validates it and maps it to the JWT role `gitea_cicd_`, which carries the CI/ops policy `-ops`. This is how `tofu apply` in CI reads and writes the secrets it manages without any pre-shared Vault token. + +The split matters: pods get only what they need at runtime (the `` policy), while CI gets the broader provisioning rights (`-ops`) needed to *create* the very secrets the pods will later read. + +## How VSO delivers secrets to pods + +Inside the cluster, the **Vault Secrets Operator** is the bridge between Vault and Kubernetes. It watches two CRDs: + +- **`VaultAuth`** — declares *how* to authenticate to Vault (the Kubernetes auth mount + the `` role). +- **`VaultDynamicSecret`** (and `VaultStaticSecret`) — declares *what* to fetch (e.g. `postgres/creds/`) and which Kubernetes Secret to materialise it into. For dynamic secrets, VSO also **renews the lease** and rotates the Secret before it expires. + +The pod then mounts the resulting Kubernetes Secret as it would any other — it never speaks to Vault directly, and never sees a static DB password. + +## The secret flow, end to end + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart LR + subgraph CI["CI / Provisioning path"] + GHA["Gitea Actions
workflow"]:::src + TOFU["OpenTofu
tofu apply"]:::proc + end + + subgraph RT["Runtime path (in-cluster)"] + VSO["Vault Secrets
Operator (VSO)"]:::proc + POD["App pod
(ServiceAccount <app>)"]:::proc + end + + VAULT["Vault
KV v1/v2 · transit · postgres dynamic"]:::store + + GHA -->|"OIDC JWT
role gitea_cicd_<app>"| VAULT + VAULT -->|"policy <app>-ops
read/write secrets"| TOFU + TOFU -->|"writes config to
kvv2/<app>/config"| VAULT + + VSO -->|"k8s auth
role <app> (SA token)"| VAULT + VAULT -->|"dynamic creds
postgres/creds/<app>"| VSO + VSO -->|"materialises +
renews K8s Secret"| POD + + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff +``` + +1. **CI path:** a Gitea Actions workflow requests an OIDC JWT and presents it to Vault under the role `gitea_cicd_`. Vault validates the token and grants the `-ops` policy. +2. With that policy, OpenTofu (`tofu apply`, running in CI) reads the secrets it needs and writes the app's static config back to `kvv2//config`. No pre-shared Vault token is ever stored — the trust is established per-run via OIDC. +3. **Runtime path:** in the cluster, the Vault Secrets Operator authenticates with the Kubernetes auth backend, presenting the app's ServiceAccount token mapped to the Vault role ``. +4. Vault issues a **short-lived, dynamic** PostgreSQL credential from `postgres/creds/` back to VSO. +5. VSO materialises that credential into a Kubernetes Secret in the app's namespace, then **renews the lease** and rotates the Secret before it expires. +6. The app pod mounts the Kubernetes Secret like any other — it never talks to Vault, and never holds a long-lived database password. + +## The unseal model + +Vault encrypts its storage with a master key that is **never persisted in usable form**. On every start — a fresh deploy, a pod reschedule, or a full cluster recovery — Vault comes up **sealed** and refuses every request until it is unsealed. + +- **Shamir config:** 1 unseal key, threshold 1 (a single-operator lab, so no key-splitting ceremony). +- **Where the key lives:** on the control node (the MacBook), at `~/.arcodange/cluster-keys.json`. It is *not* in git, *not* in Kubernetes, *not* in Vault. +- **Operational consequence:** **nothing that needs a secret recovers until a human unseals Vault.** This is the chokepoint baked into the recovery order — VSO cannot re-auth, dynamic DB creds cannot be issued, and dependent apps cannot start, until the unseal happens. See [Storage & recovery](storage-and-recovery.md) for where unseal sits in the tested startup sequence. + +> [!CAUTION] +> If `~/.arcodange/cluster-keys.json` is lost, Vault's data is **unrecoverable** — there is no second copy of the unseal key and no key-recovery path. Treat that file as the most critical secret in the lab. + +## Sandbox implications + +A production-like sandbox does **not** share the production Vault. It runs its **own** Vault instance with its **own** unseal key and its **own** policies, so that exercising secret flows, rotating credentials, or testing a broken unseal cannot touch production secrets. Because the `` join key is environment-relative (see [Naming conventions](naming-conventions.md)), the sandbox can keep identical role and policy names — `gitea_cicd_`, ``, `-ops` — while remaining fully isolated. The rationale for that separate-Vault, separate-unseal posture is recorded in [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). + +## See also + +- [Tools brick](02-tools.md) — where the `hashicorp-vault` chart, VSO, and the per-app Vault IaC modules are deployed. +- [Storage & recovery](storage-and-recovery.md) — Vault unseal as a step in the tested power-cut recovery order. +- [Naming conventions](naming-conventions.md) — how `gitea_cicd_`, ``, and `-ops` derive from the join key. +- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) — the sandbox's separate-Vault decision. diff --git a/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md b/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md new file mode 100644 index 0000000..0b7a6d1 --- /dev/null +++ b/vibe/guidebooks/lab-ecosystem/storage-and-recovery.md @@ -0,0 +1,76 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery** + +# Storage & recovery + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [Lab ecosystem](README.md) · [Secrets & Vault](secrets-and-vault.md) · [Factory brick](01-factory.md) · [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) +> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) +> **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) + +## TL;DR + +The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** — Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** — that brings the cluster back deterministically. That order is now rehearsed as a drill. + +## Two storage tiers, on purpose + +| Tier | Backing | What lives there | Why here | +| --- | --- | --- | --- | +| **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. | +| **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — **without** k3s being healthy, so they cannot live inside it. | + +This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations. + +## The Longhorn engine-ID re-association failure mode + +Longhorn stores each replica's data on a node in a directory named **`-`**. The raw `volume-head-*.img` files are durable — they survive a power cut on the disk. The danger is in the *metadata*, not the data: + +1. A power cut drops the Longhorn CSI driver. +2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible. +3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**. +4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories. + +The result: the real data sits in an orphaned `…-/` directory while Longhorn happily serves an empty `…-/`. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). + +> [!CAUTION] +> Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead. + +## The tested 2026-04-13 power-cut recovery + +The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a **dependency gate** for the next: + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart TD + PC["Power cut
(cluster down, disks intact)"]:::dead + + PC --> LH["1 · Restore Longhorn
volumes (block-device
injection if engine IDs changed)"]:::store + LH --> VU["2 · Unseal Vault
(1 key, threshold 1,
key on the Mac)"]:::proc + VU --> VSO["3 · VSO re-auth
(k8s auth → fresh
dynamic creds)"]:::proc + VSO --> ERP["4 · Scale up ERP
last (depends on DB +
injected secrets)"]:::src + + classDef dead fill:#6b7280,stroke:#4b5563,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef src fill:#2563eb,stroke:#1e40af,color:#fff +``` + +1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done. +2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint — see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step. +3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials. +4. **Scale up ERP last.** ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1–3 are confirmed avoids a crash-loop against a half-recovered platform. + +The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo). + +## Why this is rehearsed in the sandbox + +A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md). + +## See also + +- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID failure mode, the five recovery methods, and the block-device injection automation. +- [Secrets & Vault](secrets-and-vault.md) — the unseal model and why it gates step 2 of the recovery order. +- [Factory brick](01-factory.md) — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation. +- [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) — how recovery drills become routine QA. +- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md). +- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo). diff --git a/vibe/investigations/INV-001-prod-blast-radius-couplings.md b/vibe/investigations/INV-001-prod-blast-radius-couplings.md new file mode 100644 index 0000000..b6d3520 --- /dev/null +++ b/vibe/investigations/INV-001-prod-blast-radius-couplings.md @@ -0,0 +1,206 @@ +[vibe](../README.md) > [Investigations](README.md) > **INV-001 · Prod blast-radius couplings** + +# INV-001: Prod blast-radius couplings + +> **Status**: Complete +> **Date**: 2026-06-23 +> **Priority**: 🔴 P1 +> **Related**: [ADR-0001 · Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) · [PRD · Isolation boundary](../PRD/safe-prod-like-environment/isolation-boundary.md) + +> [!NOTE] +> **Origin.** This investigation was spun out of the [safe-environment ADR](../ADR/0001-safe-prod-like-environment.md) and [PRD](../PRD/safe-prod-like-environment/README.md) design work, to enumerate exactly which prod couplings a sandbox must isolate before any sandbox run can be trusted not to mutate live production. + +## Objectives + +What this investigation set out to answer: + +- [x] Enumerate every place where the GitOps repos hardcode a **live prod endpoint, credential path, state location, or auto-reconciling controller**. +- [x] For each, cite the **concrete file and value** so the isolation control can be written against a known target. +- [x] State the **blast radius** of an accidental sandbox-targets-prod mistake per coupling. +- [x] Name the **sandbox control** that severs each coupling (cross-referenced to the PRD isolation boundary). +- [x] Classify severity so Phase 0 guardrails know what *must* land first. + +## Executive summary + +The lab is administered from a single MacBook holding kubeconfig, the Vault unseal key, and every cloud admin token; the repos point at live prod by default, so any sandbox seeded from those same repos will hit prod unless explicitly fenced off. The worst couplings are the **PostgreSQL superuser provider hardwired to `192.168.1.202`** (a wrong apply can `DROP`/`ALTER` the live ERP, CMS and other business DBs), the **Vault unseal key at a fixed path `~/.arcodange/cluster-keys.json`** (a botched sandbox init could overwrite the one key that unseals prod), and **ArgoCD app-of-apps with `prune: true` + `selfHeal: true`** pointed at `targetRevision: HEAD` of the live Gitea (an auto-reconcile can delete live resources fleet-wide). Secondary but real: the Ansible inventory targeting `192.168.1.201-203`, the single GCS state bucket `arcodange-tf` shared across all stacks, and the Cloudflare/OVH/Zoho tokens that control public `arcodange.fr` DNS and email. Each coupling has a clean sandbox control (separate inventory + prod-IP abort guard, separate Vault + unseal path, separate state prefix family, plan-only DNS). **None of these controls exist yet — Phase 0 must build them before the first sandbox run.** + +## Findings + +Each finding leads with a one-line **Brief**, then the concrete evidence (file + value), then the bold **Finding** with a severity tag. + +### 1. PostgreSQL superuser provider hardwired to the live host + +**Brief.** The Postgres OpenTofu stack connects as **superuser** straight to the production database host with no environment switch. + +Evidence — [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf): + +```hcl +provider "postgresql" { + host = "192.168.1.202" + username = var.POSTGRES_USERNAME + password = var.POSTGRES_PASSWORD + sslmode = "disable" + superuser = true +} +``` + +The host is a literal — there is no `var.pg_host`, no workspace gate, no profile. This is the same PG instance (the docker-compose on pi2, outside k3s) that backs the Dolibarr ERP, the CMS, and every app DB created per the [`` convention](../../doc/runbooks/new-web-app/conventions.md). A superuser session here can `DROP DATABASE`, `ALTER ROLE`, or revoke logins on any of them. Running `tofu apply` from a sandbox shell that still has the prod state and creds wired would act on live data. + +**Finding:** Highest-risk coupling. A single wrong `apply` against `192.168.1.202` as superuser can cause **irreversible ERP/business data loss**. The sandbox PG must be the docker-compose on the sandbox "pi2-equivalent", and a guard must **refuse to apply when `host == 192.168.1.202` and `workspace != prod`**. 🔴 + +### 2. Vault address + unseal key at a fixed local path + +**Brief.** Every IaC stack authenticates to the one prod Vault, and the unseal key sits at a single hardcoded path that a sandbox init could clobber. + +Evidence — both [`iac/providers.tf`](../../iac/providers.tf) and [`postgres/iac/providers.tf`](../../postgres/iac/providers.tf) declare: + +```hcl +provider "vault" { + address = "https://vault.arcodange.lab" + auth_login_jwt { + mount = "gitea_jwt" + role = "gitea_cicd" + } +} +``` + +The prod unseal key (1 key, threshold 1) lives at `~/.arcodange/cluster-keys.json` — the single secret that brings prod Vault back after a seal (see the lab-root recovery runbook, named below). Vault is the **single source of truth** for all secrets, so a policy/auth/mount change against `vault.arcodange.lab`, or an init/operator step that writes to the default key path, has fleet-wide reach: VSO across every namespace re-reads from this Vault. + +**Finding:** Two failure modes. (a) Sandbox IaC pointed at `vault.arcodange.lab` can rewrite prod policies/auth and lock out VSO → fleet-wide secret outage. (b) A botched sandbox `vault operator init` writing to `~/.arcodange/cluster-keys.json` would **overwrite the prod unseal key**, making prod unrecoverable after the next seal. Sandbox needs a **separate Vault** and the unseal-key path overridden to `~/.arcodange/sandbox/cluster-keys.json`. 🔴 + +### 3. ArgoCD app-of-apps: live repoURL + HEAD + prune/selfHeal + +**Brief.** The app-of-apps template renders Applications that auto-reconcile against `HEAD` of the live Gitea, with pruning and self-heal on by default. + +Evidence — [`argocd/templates/apps.yaml`](../../argocd/templates/apps.yaml) loops `values.gitea_applications` into Application CRDs: + +```yaml +source: + repoURL: https://gitea.arcodange.lab/{{ $org }}/{{ $app_name }} + targetRevision: HEAD + path: chart +syncPolicy: + automated: + prune: true + selfHeal: true +``` + +[`argocd/values.yaml`](../../argocd/values.yaml) lists the live apps (`url-shortener`, `tools`, `webapp`, `telegram-gateway`, `erp`, `cms`, `dance-lessons-coach`); several add ArgoCD Image Updater annotations that chase `:latest` digests. `prune: true` means a resource that disappears from git is deleted from the cluster; `selfHeal: true` means manual changes are reverted. `targetRevision: HEAD` means the live cluster follows whatever lands on the default branch. + +> [!NOTE] +> ArgoCD itself is not currently deployed in-cluster (it is commented out in `03_cicd`), so this controller is **latent** today. The coupling matters the moment ArgoCD is enabled — and sandbox work to enable/iterate on ArgoCD is exactly when it would bite. + +**Finding:** When ArgoCD is live, a sandbox change to the app-of-apps (or an Image Updater misfire) reconciled against the prod `repoURL`/`HEAD` can **prune live resources fleet-wide**. Sandbox needs its own ArgoCD pointed at a **sandbox branch/Gitea** so it only syncs sandbox refs. 🟠 + +### 4. Ansible inventory targets the live Pi IPs + +**Brief.** The only inventory targets the three production Raspberry Pis directly; a stray playbook run hits prod hardware. + +Evidence — [`ansible/arcodange/factory/inventory/hosts.yml`](../../ansible/arcodange/factory/inventory/hosts.yml): + +```yaml +raspberries: + hosts: + pi1: { preferred_ip: 192.168.1.201, ... } + pi2: { preferred_ip: 192.168.1.202, ... } + pi3: { preferred_ip: 192.168.1.203, ... } +postgres: + hosts: { pi2: } +``` + +The numbered playbooks (`01_system` … `05_backup`) and the `recover/` plays operate on these hosts; `01_system`-class roles can wipe disks, re-init k3s, or disturb Longhorn replicas. There is **no sandbox inventory and no guard** — `ansible-playbook` defaults straight at prod. + +**Finding:** A misdirected playbook against `192.168.1.201-203` can **wipe disks / reset k3s / corrupt Longhorn** on prod. Sandbox needs a separate `inventory/sandbox/hosts.yml` (VM hosts only) **plus a pre-task that aborts if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true`**. 🔴 + +### 5. Single GCS state bucket shared by every stack + +**Brief.** All OpenTofu stacks store state in one bucket, separated only by `prefix`; a wrong backend config writes prod state. + +Evidence — [`iac/backend.tf`](../../iac/backend.tf) uses `bucket = "arcodange-tf"`, `prefix = "factory/main"`; [`postgres/iac/backend.tf`](../../postgres/iac/backend.tf) uses the same bucket with `prefix = "factory/postgres"`. Sibling stacks (`tools`, `cms`) follow the same bucket-with-prefix pattern. State is the authoritative map of real resources — a sandbox run that inherits the prod backend will read prod state, plan against prod resources, and on `apply` mutate them. + +> [!TIP] +> Note the name collision risk: [`iac/cloudflare.tf`](../../iac/cloudflare.tf) also creates a Cloudflare **R2** bucket literally named `arcodange-tf`. The GCS state bucket and the R2 object bucket share a name but are different stores; do not conflate them when scoping sandbox state. + +**Finding:** Without isolation, sandbox `tofu` reads/writes **prod state** in `arcodange-tf`. Sandbox needs a **sandbox prefix family** (`sandbox/factory/main`, `sandbox/factory/postgres`, …) via a backend-config override, or a separate bucket `arcodange-tf-sandbox`. 🟠 + +### 6. Cloudflare account / OVH arcodange.fr / Zoho — live public DNS & email + +**Brief.** IaC holds tokens that manage the public `arcodange.fr` zone, the OVH registrar nameservers, and the Zoho mail records; a wrong record silently breaks company email. + +Evidence — [`iac/providers.tf`](../../iac/providers.tf) declares `provider "cloudflare" {}` (token via `CLOUDFLARE_API_TOKEN`) and `provider "ovh" { endpoint = "ovh-eu" }`. [`iac/cloudflare.tf`](../../iac/cloudflare.tf) resolves the account by `arcodange@gmail.com` and a `cf_arcodange_cms_token` granting `zone:DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write`, etc. [`iac/ovh.tf`](../../iac/ovh.tf) grants `domain:apiovh:nameServer/edit` on `urn:v1:eu:resource:domain:arcodange.fr`. The Zoho mail wiring (MX/SPF/DKIM/DMARC/BIMI + aliases) lives in the sibling `cms` repo's `zoho/` and the `arcodange.fr` zone is managed at Cloudflare. A bad MX/SPF/DKIM record breaks `arcodange.fr` mail **silently, for days**. + +**Finding:** These are **public, customer-facing, and slow-to-detect**. The blast radius is broken company email and public site/tunnel exposure. Sandbox must run these modules **plan-only against a throwaway zone/subdomain with a separate token**; the real `arcodange.fr` token must **never** be exported into a sandbox shell. Real public DNS/ACME end-to-end is out of scope. 🟠 + +### 7. Longhorn backup target points at the prod backup bucket + +**Brief.** Longhorn's backup target is a fixed S3 bucket; a sandbox restore drill could overwrite prod backups. + +Evidence — [`argocd/templates/longhorn_backup_target.yaml`](../../argocd/templates/longhorn_backup_target.yaml) sets: + +```yaml +"backup-target": "s3://arcodange-backup@us-east-1/" +"backup-target-credential-secret": "longhorn-gcs-backup-credentials" +``` + +The credentials are injected by VSO from Vault path `kvv2 longhorn/gcs-backup`. Recovery drills are a core sandbox use-case (re-run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)) — but a drill that writes to `arcodange-backup` could clobber the real restore points. + +**Finding:** Sandbox restore drills against `s3://arcodange-backup` risk **overwriting prod backups** — the worst kind of failure during a recovery rehearsal. Sandbox backup target must be a **separate bucket/prefix**. 🟠 + +### 8. Gitea base_url + the restricted CI module-reader user + +**Brief.** The Gitea provider and a created CI user both bind to the live Gitea; sandbox IaC can mutate prod repos, secrets, and users. + +Evidence — `provider "gitea" { base_url = "https://gitea.arcodange.lab" }` in [`iac/providers.tf`](../../iac/providers.tf). [`iac/gitea_tofu_ci_user.tf`](../../iac/gitea_tofu_ci_user.tf) creates the `tofu_module_reader` user, an SSH key, and stores it in Vault `kvv1/gitea/tofu_module_reader`; [`iac/cloudflare.tf`](../../iac/cloudflare.tf) and [`iac/ovh.tf`](../../iac/ovh.tf) push repo actions secrets (`CLOUDFLARE_API_TOKEN`, `OVH_CLIENT_ID`, …) onto the live `cms` repo. A sandbox apply against this provider rewrites prod repo secrets and CI users. + +**Finding:** Sandbox IaC pointed at `gitea.arcodange.lab` can **mutate prod repo CI secrets and users**, indirectly poisoning prod CI. Sandbox needs its own Gitea (sandbox cluster or org `arcodange-sandbox`); ArgoCD app-of-apps then points at sandbox refs (see finding 3). 🟠 + +## Summary table + +| Coupling | Where (file · value) | Blast radius | Sandbox control | +| --- | --- | --- | --- | +| PG superuser provider | `postgres/iac/providers.tf` · `host = 192.168.1.202`, `superuser = true` | Drop/alter live ERP + app DBs → irreversible data loss | Sandbox PG (docker-compose on sandbox pi2-eq); guard refuses apply if `host == 192.168.1.202 && workspace != prod` | +| Vault address + unseal key | `iac/providers.tf` / `postgres/iac/providers.tf` · `vault.arcodange.lab` · key `~/.arcodange/cluster-keys.json` | Lock out VSO fleet-wide; overwrite prod unseal key → prod unrecoverable | Separate sandbox Vault; unseal path → `~/.arcodange/sandbox/cluster-keys.json` | +| ArgoCD app-of-apps | `argocd/templates/apps.yaml` · `repoURL gitea.arcodange.lab/...`, `HEAD`, `prune+selfHeal` | Auto-prune live resources fleet-wide (latent until ArgoCD deployed) | Sandbox ArgoCD → sandbox branch/Gitea; sync only sandbox refs | +| Ansible inventory | `ansible/.../inventory/hosts.yml` · `192.168.1.201-203` | Wipe disks / reset k3s / corrupt Longhorn on prod Pis | `inventory/sandbox/hosts.yml` (VMs only) + prod-IP abort guard unless `i_mean_prod=true` | +| GCS state bucket | `iac/backend.tf` / `postgres/iac/backend.tf` · `bucket = arcodange-tf` | Read/write prod state → plan & apply mutate prod resources | `sandbox/...` prefix family or `arcodange-tf-sandbox` bucket via backend override | +| Cloudflare / OVH / Zoho | `iac/providers.tf`, `iac/cloudflare.tf`, `iac/ovh.tf` (+ cms `zoho/`) · `arcodange.fr`, account `arcodange@gmail.com` | Break public DNS / company email silently for days | Plan-only against throwaway zone + separate token; never export the real `arcodange.fr` token into sandbox | +| Longhorn backup target | `argocd/templates/longhorn_backup_target.yaml` · `s3://arcodange-backup@us-east-1/` | Restore drill overwrites prod backups | Separate sandbox backup bucket/prefix | +| Gitea base_url + CI user | `iac/providers.tf` · `gitea.arcodange.lab` · `iac/gitea_tofu_ci_user.tf` | Rewrite prod repo CI secrets/users → poison prod CI | Sandbox Gitea (cluster or org `arcodange-sandbox`) | + +## Classification + +| # | Coupling | Severity | +| --- | --- | --- | +| 1 | PG superuser provider → `192.168.1.202` | 🔴 Critical — confirmed irreversible data-loss path | +| 2 | Vault address + unseal key path | 🔴 Critical — prod-secret outage and unrecoverable-unseal path | +| 4 | Ansible inventory → live Pi IPs | 🔴 Critical — disk-wipe / cluster-reset on prod hardware | +| 3 | ArgoCD app-of-apps prune/selfHeal | 🟠 Significant — fleet-wide prune; latent until ArgoCD is deployed | +| 5 | Shared GCS state bucket | 🟠 Significant — prod state mutation if backend not overridden | +| 6 | Cloudflare / OVH / Zoho public DNS & email | 🟠 Significant — public, customer-facing, slow to detect | +| 7 | Longhorn backup target | 🟠 Significant — prod backup overwrite during drills | +| 8 | Gitea base_url + CI user | 🟠 Significant — prod CI-secret/user mutation | + +| Emoji | Meaning | +| --- | --- | +| 🔴 | Critical — data-loss or breaking risk confirmed | +| 🟠 | Significant — degraded but working, or a real coupling to watch | +| 🟡 | Minor — worth noting, low impact | +| 🟢 | Healthy — verified safe / no issue found | +| 🔵 | Informational — context, no action implied | + +## Open questions + +- **Guard enforcement layer.** Should the prod-IP abort live as an Ansible pre-task only, or also as a wrapper script around `tofu`/`ansible-playbook` so the same fence covers both tools uniformly? (Phase 0 decision.) +- **Vault path discipline.** Beyond the unseal-key path, are there other tools (backup scripts, recovery runbook steps) that read a hardcoded `~/.arcodange/cluster-keys.json`? A grep sweep of the lab-root scripts is needed so the sandbox override is complete, not partial. +- **State backend override ergonomics.** Prefix family vs. a separate `arcodange-tf-sandbox` bucket — the bucket option is harder to misconfigure (no shared blast radius at all) but adds a provisioning step. Decide in the PRD isolation boundary. +- **ArgoCD readiness.** Since ArgoCD is currently latent (commented in `03_cicd`), confirm whether enabling it should happen *first* in the sandbox (so its prune behaviour is rehearsed before it ever reaches prod). +- **Throwaway DNS zone choice.** Which subdomain/zone and token scope are acceptable for plan-only Cloudflare/OVH tests without touching `arcodange.fr`? + +## References + +- [ADR-0001 · Safe prod-like environment](../ADR/0001-safe-prod-like-environment.md) — the decision this investigation supports. +- [PRD · Safe prod-like environment (hub)](../PRD/safe-prod-like-environment/README.md) and its [Isolation boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) — where each coupling's control is specified in full. +- [Lab ecosystem guidebook](../guidebooks/lab-ecosystem/README.md) — background on the prod topology and the `` join key. +- [New-web-app conventions (`` join key)](../../doc/runbooks/new-web-app/conventions.md) — why one identifier keys repo, DB, Vault, namespace, and DNS together. +- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association relevant to the backup-target and restore-drill coupling. +- `CLUSTER_RECOVERY.md` (at the lab root, outside this repo) — the tested power-cut recovery runbook; the unseal-key path and Vault-seal recovery referenced in findings 2 and 7 come from it. diff --git a/vibe/investigations/README.md b/vibe/investigations/README.md new file mode 100644 index 0000000..0387c27 --- /dev/null +++ b/vibe/investigations/README.md @@ -0,0 +1,45 @@ +[vibe](../README.md) > **Investigations** + +# Investigations + +> **Status**: 🟢 Active +> **Last Updated**: 2026-06-23 +> **Related**: [vibe/ADR](../ADR/README.md) · [vibe/PRD](../PRD/README.md) + +`vibe/investigations/` collects focused inquiries: a question is asked, evidence is gathered, and findings are recorded. Investigations feed ADRs (a decision often rests on an investigation) and PRDs (scoping rests on what we learned). Start from [`_template.md`](_template.md). + +## Convention + +- **Prefer a single numbered file** named `INV-NNN-slug.md`. Most investigations need nothing more. +- **When notebooks or data are involved**, keep `INV-NNN-slug.md` as a short **stub** (an origin note + a link) sitting *beside* a same-named folder `INV-NNN-slug/` that holds: + - notebooks — each `.ipynb` paired with an exported `.py` (so diffs and review work on plain text), + - a `_data/` directory for inputs/outputs, + - a `notebook_simple.md` — a plain-language walkthrough (visuals + explanations anyone can follow, no code required). +- **Diagonal-reading style.** Each finding section is written so a skimmer gets the point fast: lead with a one-line **Brief**, then the **evidence**, then a bold **Finding**. A reader can scan the Briefs and Findings alone and still understand the conclusion. +- **No-tombstone rule** applies: write findings as currently true. Corrections are made in place; git history is the audit trail. + +## Classification legend + +Use these on findings and in the Classification table to signal severity / nature: + +| Emoji | Meaning | +| --- | --- | +| 🔴 | Critical — data-loss or breaking risk confirmed | +| 🟠 | Significant — degraded but working, or a real coupling to watch | +| 🟡 | Minor — worth noting, low impact | +| 🟢 | Healthy — verified safe / no issue found | +| 🔵 | Informational — context, no action implied | + +## Index + +| # | Title | Status | Date | +| --- | --- | --- | --- | +| [INV-001](INV-001-prod-blast-radius-couplings.md) | Prod blast-radius couplings | ✅ Complete | 2026-06-23 | + +## Rules to contribute + +1. Copy [`_template.md`](_template.md) to `INV-NNN-slug.md` using the next free sequence number and delete the top HTML-comment note. +2. Fill in the blockquote (Status / Date / Priority / Related), Objectives, Executive summary, Findings (Brief → evidence → Finding each), Summary table, Classification, Open questions, References. +3. If the investigation needs notebooks or data, convert the file into a stub and create the matching `INV-NNN-slug/` folder with paired `.ipynb`/`.py`, `_data/`, and `notebook_simple.md`. +4. Add a row to the Index table above. +5. Cross-link any ADR or PRD this investigation informs (and have them link back). Bidirectional links are mandatory. diff --git a/vibe/investigations/_template.md b/vibe/investigations/_template.md new file mode 100644 index 0000000..df6bfb1 --- /dev/null +++ b/vibe/investigations/_template.md @@ -0,0 +1,66 @@ +[vibe](../README.md) > [Investigations](README.md) > **_template** + + + +# INV-NNN: Title + +> **Status**: In progress | Complete | Blocked +> **Date**: YYYY-MM-DD +> **Priority**: 🔴 High | 🟠 Medium | 🟡 Low +> **Related**: ADR-NNNN · PRD · upstream/downstream links + +## Objectives + +What this investigation set out to answer. Use checkboxes so progress is visible at a glance. + +- [ ] Question or goal 1 +- [ ] Question or goal 2 + +## Executive summary + +Two or three sentences a busy reader can absorb without scrolling: what was investigated, what was found, and what (if anything) to do about it. + +## Findings + +Write each finding for diagonal reading: lead with the Brief, then evidence, then the bold Finding. + +### 1. Finding title + +**Brief.** One line stating what this section is about. + +Evidence — the data, logs, code paths, commands, or reasoning that support the conclusion. + +**Finding:** the bold takeaway. Tag with a classification emoji (🔴 / 🟠 / 🟡 / 🟢 / 🔵). + +### 2. Finding title + +**Brief.** ... + +Evidence ... + +**Finding:** ... + +## Summary table + +| # | Finding | Classification | Action | +| --- | --- | --- | --- | +| 1 | ... | 🟠 | ... | +| 2 | ... | 🟢 | ... | + +## Classification + +| Emoji | Meaning | +| --- | --- | +| 🔴 | Critical — data-loss or breaking risk confirmed | +| 🟠 | Significant — degraded but working, or a real coupling to watch | +| 🟡 | Minor — worth noting, low impact | +| 🟢 | Healthy — verified safe / no issue found | +| 🔵 | Informational — context, no action implied | + +## Open questions + +- Anything left unresolved, deferred, or needing a follow-up investigation. + +## References + +- Commands, files, PRs, ADRs, PRDs, or external docs consulted (descriptive link text, never "here"/"this"). diff --git a/vibe/runbooks/README.md b/vibe/runbooks/README.md new file mode 100644 index 0000000..3286606 --- /dev/null +++ b/vibe/runbooks/README.md @@ -0,0 +1,60 @@ +[vibe](../README.md) > **Runbooks** + +# Runbooks + +> **Status:** Active (conventions + template only — first concrete runbook lands with PRD Phase 1) +> **Last Updated:** 2026-06-23 +> **Related:** [vibe guidebooks](../guidebooks/README.md) · [vibe shareouts](../shareouts/README.md) · [FRENCH human runbooks under doc/runbooks](../../doc/runbooks/README.md) + +## What lives here + +`vibe/runbooks/` holds **agent-oriented operational runbooks, written in English** (this tree is for LLM agents). Each runbook is an ordered procedure where every step is tagged with an actor marker: + +- **`[AGENT]`** — read-only or otherwise safe steps an agent may execute autonomously (inspecting state, dry-runs, generating files, running tests in a sandbox). +- **`[HUMAN]`** — production-mutating steps that require **explicit human approval** before they run (anything that writes to live infrastructure, deletes data, or changes the trunk). + +The marker is load-bearing: it tells an agent reading the runbook exactly where its autonomy ends and where it must stop and hand control back to a human. + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef agent fill:#059669,stroke:#047857,color:#fff + classDef human fill:#dc2626,stroke:#b91c1c,color:#fff + classDef gate fill:#7c3aed,stroke:#6d28d9,color:#fff + A["[AGENT] safe steps
(inspect, dry-run, generate)"]:::agent --> G{"approval
gate"}:::gate --> H["[HUMAN] prod-mutating steps
(explicit approval required)"]:::human +``` + +1. An agent executes the `[AGENT]`-tagged steps on its own — these only read state or act inside a sandbox. +2. When the procedure reaches a prod-mutating step, the agent stops at an approval gate. +3. A human reviews and approves; only then do the `[HUMAN]`-tagged steps run against live infrastructure. + +## Not the same as `doc/runbooks` + +> [!IMPORTANT] +> There are **two** runbook collections in this lab, and they serve different readers — do not merge them. +> +> | Collection | Reader | Language | Step markers | +> |---|---|---|---| +> | **`vibe/runbooks/`** (this folder) | LLM agents | English | `[AGENT]` / `[HUMAN]` | +> | **[`doc/runbooks/`](../../doc/runbooks/README.md)** | Human operators | French | prose procedures | +> +> The canonical, human-facing operator procedures (e.g. [Nouvelle application web](../../doc/runbooks/new-web-app/README.md)) live in French under `doc/runbooks/`. This folder is the agent-facing mirror: same operational reality, written so an autonomous agent can execute the safe parts and gate the dangerous ones. + +## Index + +| Runbook | Summary | Status | +|---|---|---| +| [_template](_template.md) | Skeleton for new agent-oriented runbooks (`[AGENT]`/`[HUMAN]` markers, copy-paste commands, verification + rollback) | ✅ Active | + +> [!NOTE] +> The first **concrete** runbook — a local sandbox game-day for the safe prod-like environment — ships with **PRD Phase 1** ([safe-prod-like-environment PRD](../PRD/safe-prod-like-environment/README.md)). Until then this folder holds the conventions and the template only. + +## Rules to contribute + +1. **Start from [`_template.md`](_template.md).** Copy it, rename to `kebab-case.md`, fill every section, then add a row to the index table above. +2. **Tag every procedure step** `[AGENT]` or `[HUMAN]`. When in doubt, tag it `[HUMAN]` — over-gating is safe, under-gating is not. +3. **Use the `tree-docs` skill** and keep the breadcrumb spine: first line is the breadcrumb trail, ancestors as relative links, current page bold-unlinked, separator ` > `. +4. **README hub stays current** — every new runbook gets an index row here with a one-line summary and status. +5. **Bidirectional links.** If a runbook references a guidebook, ADR, or the French operator runbook, link back from there too. Use descriptive link text. +6. **Commands are copy-paste ready** — put them in fenced ```bash blocks, with the `[HUMAN]`/`[AGENT]` marker on the step that owns them. +7. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. diff --git a/vibe/runbooks/_template.md b/vibe/runbooks/_template.md new file mode 100644 index 0000000..d959e28 --- /dev/null +++ b/vibe/runbooks/_template.md @@ -0,0 +1,80 @@ + + +[vibe](../README.md) > [Runbooks](README.md) > **_template** + +# + +> **Status:** ⬜ Not started +> **Audience:** LLM agents (English). For the human-operator equivalent see the French [doc/runbooks](../../doc/runbooks/README.md). +> **Last Updated:** 2026-06-23 + +## TL;DR + +> [!TIP] +> + +## Scope + +` or environment in play.> + +## Preconditions + + + +- [ ] Working in a worktree under `.claude/worktrees//` (never the trunk). +- [ ] Access to confirmed. +- [ ] . + +## Procedure + + + +1. **[AGENT]** + + ```bash + # read-only example + kubectl --context get pods -n + ``` + +2. **[AGENT]** + + ```bash + # safe generation / sandbox example + tofu -chdir= plan + ``` + +3. **[HUMAN]** + + ```bash + # prod-mutating example — only after approval + tofu -chdir= apply + ``` + +4. **[HUMAN]** + +## Verification + + + +```bash +# verification example +kubectl --context get application -n argocd -o jsonpath='{.status.sync.status}' +# expected: Synced +``` + +## Rollback + + + +## References + +- +- +- diff --git a/vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md b/vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md new file mode 100644 index 0000000..06134eb --- /dev/null +++ b/vibe/shareouts/2026-06-23-vibe-and-safe-env/README.md @@ -0,0 +1,54 @@ +[vibe](../../README.md) > [Shareouts](../README.md) > **2026-06-23 · Vibe & environnement sûr** + +# Vibe & environnement sûr — ce qu'on a posé le 2026-06-23 + +> **Date :** 2026-06-23 · **Statut :** Actif · **Audience :** humains du lab Arcodange (lecteurs non spécialistes bienvenus) + +> [!TIP] +> **TL;DR** +> - On a créé `vibe/`, la **base de connaissances pour agents IA**, et un `AGENTS.md` qui sert de carte de l'écosystème et de règles du jeu. +> - On a écrit le **premier ADR** (la décision) et la **première PRD** (la spécification) d'un **environnement sûr de type production, en LOCAL uniquement**. +> - On a ajouté un **guidebook de l'écosystème** : la carte « qui parle à qui » entre les dépôts `factory`, `tools` et `cms`. + +## Ce qui a été fait + +- **Le dossier `vibe/`** — un tronc de documentation pensé pour les agents IA qui travaillent sur le lab. Il range le *pourquoi* (ADR), le *quoi/quand* (PRD), le *ce-qu'on-a-trouvé* (investigations), le *comment-ça-s'emboîte* (guidebooks), le *comment-faire* (runbooks) et le *ce-qu'on-a-dit-aux-humains* (shareouts, comme cette page). +- **Un `AGENTS.md`** — la carte de l'écosystème et le règlement : conventions de nommage, gestion des secrets, style de documentation, règles de branches et de PR. Un agent qui débarque le lit en premier. +- **Le premier ADR** — [« Safe, production-like environment »](../../ADR/0001-safe-prod-like-environment.md) : la décision actée, figée comme un fait historique. +- **La première PRD** — [« Safe, production-like environment »](../../PRD/safe-prod-like-environment/README.md) : la spécification détaillée (problème, objectifs, périmètre, critères de succès, stratégie de tests). +- **Un guidebook de l'écosystème** — [Lab ecosystem](../../guidebooks/lab-ecosystem/README.md) : la carte de bout en bout de `factory` + `tools` + `cms`, et de la convention `` qui les relie. + +## Pourquoi ça compte + +Aujourd'hui, tester une modification revient trop souvent à la tester **directement sur ce qui tourne pour de vrai**. Or « pour de vrai » ici, ce sont des services qui comptent : + +- le **mail `arcodange.fr`** (Zoho : MX, SPF, DKIM, DMARC, alias) — une fausse manip et des courriels se perdent ; +- le **CMS** (le site public `arcodange.fr`) ; +- l'**ERP** et les autres applications du lab. + +Sans filet, chaque essai est un pari sur la production. L'idée posée ce jour-là : se donner un **bac à sable fidèle à la prod** où l'on peut casser, recommencer et valider **sans jamais toucher** aux services réels. + +## Décision clé + +Un **environnement local et reproductible**, isolé de la production : + +- un cluster **k3d** local, alimenté par **3 VMs arm64** qui rejouent la topologie des 3 Raspberry Pi ; +- une **frontière d'isolation stricte** : l'environnement de test ne peut ni lire ni écrire dans la prod (pas d'accès au mail réel, au DNS public, aux bases de production) ; +- un **périmètre volontairement limité au logiciel** : les niveaux **matériel** (les Pi physiques) et **cloud** (Cloudflare, OVH, GCS) restent **hors périmètre** pour cette première itération. + +Autrement dit : on reproduit la *pile applicative*, pas la *quincaillerie* ni les *comptes cloud*. C'est le bon compromis pour tester vite et sans risque. + +## Liens + +- ADR — [Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) +- PRD — [Safe, production-like environment](../../PRD/safe-prod-like-environment/README.md) +- Guidebook — [Lab ecosystem](../../guidebooks/lab-ecosystem/README.md) + +## Pour aller plus loin + +Ce dossier peut accueillir, plus tard, des supports complémentaires : + +- un **deck** de présentation des slides ; +- un **mp4** narré — capture `ffmpeg`/Playwright, ou un explainer produit avec le skill `documentary-video`. + +Il suffira de les déposer ici, à côté de ce `README.md`, et d'ajouter une ligne dans l'[index des shareouts](../README.md). diff --git a/vibe/shareouts/README.md b/vibe/shareouts/README.md new file mode 100644 index 0000000..1570ee4 --- /dev/null +++ b/vibe/shareouts/README.md @@ -0,0 +1,56 @@ +[vibe](../README.md) > **Shareouts** + +# Shareouts + +> **Status:** Active +> **Last Updated:** 2026-06-23 +> **Related:** [vibe guidebooks](../guidebooks/README.md) · [vibe runbooks](../runbooks/README.md) + +## What a shareout is + +A **shareout** captures something substantial that was **done or seriously considered** — the outcome of an ADR, a PRD, or an investigation — and packages it as a **human-facing handout** you can hand to a person to explain what happened and why it matters. + +Where guidebooks are the standing reference map and runbooks are the procedures, a shareout is a moment-in-time communication artifact: "here is the thing we built / decided / explored, told for humans." + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + WORK["Substantial work
(ADR / PRD / investigation)"]:::src --> SO["Shareout
(dated handout folder)"]:::proc --> HUM["Humans
(read .md, deck, watch .mp4)"]:::store +``` + +1. Some substantial work lands — an ADR is decided, a PRD phase ships, an investigation concludes. +2. It is packaged into a dated shareout folder holding the human-facing handouts. +3. Humans consume the handout: a Markdown summary, a slide deck, and optionally a narrated video. + +## Convention: one dated subfolder per shareout + +> [!IMPORTANT] +> Each shareout lives in **its own dated subfolder** named `YYYY-MM-DD-slug/`. The folder holds the human-facing handouts for that shareout: `.md` summaries, decks, and an optional `.mp4`. + +The handouts inside a shareout folder can include: + +- **`.md` summaries** — the written walkthrough (the folder's `README.md` is the front door). +- **Decks** — slides for presenting the work. +- **Optional `.mp4`** — `ffmpeg` compilations of screenshots, [Playwright](../../ansible/arcodange/factory/README.md) captures, or a narrated explainer produced with the `documentary-video` skill. + +> [!NOTE] +> **Handouts are written in FRENCH.** The audience is humans, and the lab's human-facing audience is French-speaking. (This is the deliberate exception to the otherwise-English `vibe/` tree, which exists for LLM agents.) + +## Index + +| Shareout | What it covers | Status | +|---|---|---| +| [2026-06-23 · Vibe & safe-env](2026-06-23-vibe-and-safe-env/README.md) | Mise en place du tronc `vibe/` (guidebooks, runbooks, shareouts) et le PRD de l'environnement « safe prod-like » | ✅ Active | + +## Rules to contribute + +1. **One dated folder per shareout** — `YYYY-MM-DD-slug/`, kebab-case slug, ISO date. Never dump loose files at this level. +2. **Folder `README.md` is the front door** — it carries the breadcrumb, a TL;DR, and links to every handout in the folder (deck, video, sub-summaries). +3. **Handouts in French** — the audience is human. (Reminder: the rest of `vibe/` is English for agents.) +4. **Add an index row here** for every new shareout, with a one-line French summary and status, newest at the top. +5. **Use the `tree-docs` skill** for the breadcrumb spine and the per-folder README hub; use the `documentary-video` skill when a shareout warrants a narrated `.mp4`. +6. **Bidirectional links** — if a shareout points to an ADR/PRD/guidebook, link back from there. Descriptive link text only. +7. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started. -- 2.49.1 From b886f068242fdd25c8f46e0f725e6d6c801f087c Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 11:53:39 +0200 Subject: [PATCH 2/9] docs(vibe): backfill PR #10 crosslink into ADR-0001 + PRD STATUS Co-Authored-By: Claude Opus 4.8 --- vibe/ADR/0001-safe-prod-like-environment.md | 2 +- vibe/PRD/safe-prod-like-environment/STATUS.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/vibe/ADR/0001-safe-prod-like-environment.md b/vibe/ADR/0001-safe-prod-like-environment.md index 1d587db..b7c5f05 100644 --- a/vibe/ADR/0001-safe-prod-like-environment.md +++ b/vibe/ADR/0001-safe-prod-like-environment.md @@ -81,4 +81,4 @@ See the [QA strategy leaf](../PRD/safe-prod-like-environment/qa-strategy.md) for - [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association, exercised by the Longhorn chaos drill. - [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the `` convention reused identically across sandbox and prod. - CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section. -- PRs: to be backfilled on PR open. +- PRs: [#10 — bootstrap vibe/ tree + ecosystem AGENTS.md](https://gitea.arcodange.lab/arcodange-org/factory/pulls/10). diff --git a/vibe/PRD/safe-prod-like-environment/STATUS.md b/vibe/PRD/safe-prod-like-environment/STATUS.md index d345be6..411b87e 100644 --- a/vibe/PRD/safe-prod-like-environment/STATUS.md +++ b/vibe/PRD/safe-prod-like-environment/STATUS.md @@ -49,4 +49,4 @@ Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s | PR | Scope | Phase | Merged | | --- | --- | --- | --- | -| _pending_ | Bootstrap the PRD tree (this `vibe/` set) — backfilled on open | — | ⬜ | +| [#10](https://gitea.arcodange.lab/arcodange-org/factory/pulls/10) | Bootstrap the `vibe/` tree + ecosystem `AGENTS.md` (PRD scaffold, not a phase deliverable) | — | 🟡 open | -- 2.49.1 From dbe32161dcb400cd3e5e899f2e8399fa69100757 Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 21:11:51 +0200 Subject: [PATCH 3/9] docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu) Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 --- vibe/guidebooks/README.md | 1 + .../guidebooks/factory-provisioning/README.md | 88 +++++++++ .../factory-provisioning/ansible/01-system.md | 94 +++++++++ .../factory-provisioning/ansible/02-setup.md | 82 ++++++++ .../factory-provisioning/ansible/03-cicd.md | 34 ++++ .../factory-provisioning/ansible/04-tools.md | 125 ++++++++++++ .../factory-provisioning/ansible/05-backup.md | 107 ++++++++++ .../ansible/06-recover.md | 149 ++++++++++++++ .../factory-provisioning/ansible/README.md | 120 +++++++++++ .../factory-provisioning/ansible/inventory.md | 111 +++++++++++ .../factory-provisioning/ansible/roles.md | 186 ++++++++++++++++++ .../factory-provisioning/opentofu/README.md | 95 +++++++++ .../opentofu/ci-apply-flow.md | 114 +++++++++++ .../opentofu/factory-iac.md | 148 ++++++++++++++ .../opentofu/postgres-iac.md | 116 +++++++++++ vibe/guidebooks/lab-ecosystem/01-factory.md | 1 + 16 files changed, 1571 insertions(+) create mode 100644 vibe/guidebooks/factory-provisioning/README.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/01-system.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/02-setup.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/03-cicd.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/04-tools.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/05-backup.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/06-recover.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/README.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/inventory.md create mode 100644 vibe/guidebooks/factory-provisioning/ansible/roles.md create mode 100644 vibe/guidebooks/factory-provisioning/opentofu/README.md create mode 100644 vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md create mode 100644 vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md create mode 100644 vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md index 1c11363..fd2f104 100644 --- a/vibe/guidebooks/README.md +++ b/vibe/guidebooks/README.md @@ -35,6 +35,7 @@ flowchart LR | Guidebook | What it maps | Status | |---|---|---| | [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active | +| [Factory provisioning](factory-provisioning/README.md) | Deep dive into how factory provisions everything: Ansible playbooks + roles and OpenTofu | ✅ Active | ## Rules to contribute diff --git a/vibe/guidebooks/factory-provisioning/README.md b/vibe/guidebooks/factory-provisioning/README.md new file mode 100644 index 0000000..79a6eae --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/README.md @@ -0,0 +1,88 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **Factory provisioning** + +# Factory provisioning + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Lab ecosystem guidebook](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md) +> **Related:** [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-prod-like-environment PRD](../../PRD/safe-prod-like-environment/README.md) + +This guidebook is the deep dive into **how the `factory` repo turns three Raspberry Pis + a handful of cloud accounts into the running lab.** Where the [lab-ecosystem](../lab-ecosystem/README.md) map shows *which* components exist and how they join, this guidebook drills into the two provisioning **engines** that build and maintain them: the Ansible collection that the operator runs from the Mac, and the OpenTofu modules that Gitea CI applies. Every page below describes the engine *as it is wired right now* — playbook imports, role responsibilities, inventory placement, provider versions, state backends, and the CI flow that ties Tofu to Vault. + +## Two engines, two trigger models + +The factory splits provisioning along a hard line: **imperative, operator-driven host/cluster build** (Ansible) versus **declarative, CI-driven forge/cloud/database state** (OpenTofu). They never overlap on the same resource, and they run at different moments. + +| Engine | Trigger | Runs from | Owns | Lives at | +|---|---|---|---|---| +| **Ansible** | One-shot, operator-run on demand | The Mac (control node) | The cluster + base layer + stateful services: k3s, Longhorn, Pi-hole, step-ca, PostgreSQL, Gitea, Vault, CrowdSec — plus the disaster-recovery playbooks | [`ansible/`](../../../ansible/) → [sub-hub](ansible/README.md) | +| **OpenTofu** | CI-applied on Gitea (path-filtered `push`/`pull_request` + `workflow_dispatch`) | Gitea act-runners | Forge/cloud edge state (Cloudflare, OVH, GCP, Gitea, Vault) and **per-app PostgreSQL databases** | [`iac/`](../../../iac/) + [`postgres/`](../../../postgres/) → [sub-hub](opentofu/README.md) | + +> [!NOTE] +> Ansible is **imperative and human-gated** because it touches bare hosts and one-time bootstrap (disk prep, k3s install, Vault init). OpenTofu is **declarative and machine-gated** because its targets are reconcilable API objects (a DNS record, a bucket, a database) whose desired state belongs in version control and converges on every merge. + +## How a green-field lab comes up + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef op fill:#1e3a8a,stroke:#1e40af,color:#fff + classDef eng fill:#059669,stroke:#047857,color:#fff + classDef host fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef store fill:#b45309,stroke:#92400e,color:#fff + + OP["Operator
at the Mac"]:::op -->|"runs playbooks 01→05"| ANS["Ansible collection
arcodange.factory"]:::eng + ANS -->|"OS · k3s · Longhorn · base layer"| PIS["3× Raspberry Pi
pi1 / pi2 / pi3"]:::host + PIS -->|"hosts Gitea + act-runners"| CI["Gitea CI
act-runners"]:::store + CI -->|"path-filtered apply"| TOFU["OpenTofu
iac/ + postgres/iac/"]:::eng + TOFU -->|"forge · cloud · PG state"| EDGE["Cloudflare · OVH · GCP
Gitea · Vault · PostgreSQL"]:::store + TOFU -. "state in GCS gs://arcodange-tf" .- EDGE +``` + +1. The **operator**, working from the **Mac control node**, runs the numbered Ansible playbooks `01_system` → `05_backup` in order. +2. **Ansible** lays the OS, k3s (`v1.34.3+k3s1`), Longhorn, and the base layer (Pi-hole, step-ca, Vault, CrowdSec) plus the stateful out-of-cluster services (PostgreSQL + Gitea) onto the **three Raspberry Pis** (`pi1`/`pi2`/`pi3`). +3. Once `pi2` is hosting **Gitea** and `pi1`/`pi3` are running the **act-runners** (registered by `03_cicd`), the forge can run CI. +4. A push or merge to `factory` that touches `iac/**` or `postgres/**` triggers the corresponding **Gitea CI** workflow on those runners. +5. The CI job authenticates to Vault via Gitea OIDC JWT and runs **OpenTofu**, which reconciles the **forge/cloud/database edge** — Cloudflare, OVH, GCP, Gitea action-secrets, Vault KV/policies, and the per-app PostgreSQL objects. +6. All OpenTofu state is kept in **GCS** under `gs://arcodange-tf` (prefix `factory/main` for the cloud edge, `factory/postgres` for the databases), so each CI run reads and writes the authoritative state remotely. + +## Master index + +| Sub-hub | What it maps | Status | +|---|---|---| +| [Ansible](ansible/README.md) | The `arcodange.factory` collection: numbered playbooks `01`–`06`, the inventory + group_vars, and the reusable roles that build hosts, the cluster, and the stateful services | ✅ Active | +| [OpenTofu](opentofu/README.md) | The CI-applied IaC: the cloud/forge edge (`iac/`), the per-app PostgreSQL provisioning (`postgres/iac/`), and the Gitea-OIDC → Vault apply flow | ✅ Active | + +### All pages + +- **Ansible** + - [System (`01`)](ansible/01-system.md) — OS, DNS, SSL, disks, Docker, iSCSI, k3s, CoreDNS, cert-issuer, Longhorn/Traefik config + - [Setup (`02`)](ansible/02-setup.md) — PostgreSQL + Gitea docker-compose on `pi2` (and the optional backup-NFS share) + - [CI/CD (`03`)](ansible/03-cicd.md) — Gitea act-runner registration on `pi1`/`pi3` and the ArgoCD/Image-Updater install + - [Tools (`04`)](ansible/04-tools.md) — Vault + CrowdSec bootstrap into the cluster + - [Backup (`05`)](ansible/05-backup.md) — scheduled PostgreSQL / Gitea / k3s-PVC backups to `/mnt/backups` + - [Recover (`06`)](ansible/06-recover.md) — the Longhorn disaster-recovery playbooks (`recover/`) + - [Inventory & variables](ansible/inventory.md) — `hosts.yml` groups and the `group_vars` tree + - [Roles reference](ansible/roles.md) — `deploy_docker_compose`, the `gitea_*` family, `traefik_certs`, `playwright`, and the service sub-roles +- **OpenTofu** + - [factory iac](opentofu/factory-iac.md) — `iac/`: Cloudflare/OVH/GCP/Gitea/Vault edge + the `cloudflare_token` module + - [postgres iac](opentofu/postgres-iac.md) — `postgres/iac/`: per-app databases, roles, and the pgbouncer `user_lookup()` function + - [CI apply flow](opentofu/ci-apply-flow.md) — the Gitea workflows, OIDC-JWT → Vault auth, and the GCS state backend + +## Maintenance rule + +> [!IMPORTANT] +> **Alter a documented component → update its page in the same change.** If you change a playbook, a role, an inventory entry, a provider version, a Tofu resource, or the CI flow, the matching page in this guidebook MUST be edited in the same PR. A provisioning map that drifts from the code sends operators (and agents) down dead paths during a rebuild or a recovery — exactly when the map matters most. + +## Why this guidebook earns its keep + +The safe-prod-like-environment work rehearses **exactly these playbooks and Tofu modules** in a throwaway sandbox before they touch the real lab: the sandbox stands up the same `01`–`05` narrative and runs the same `iac/` + `postgres/iac/` apply, so the rehearsal only holds if this guidebook tracks the engines faithfully. See the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) for the decision and the [PRD](../../PRD/safe-prod-like-environment/README.md) (with its [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)) for what the sandbox must reproduce. + +## Cross-references + +- [Lab ecosystem guidebook](../lab-ecosystem/README.md) — the higher-altitude whole-lab map; this guidebook is its provisioning deep dive. +- [01 · factory](../lab-ecosystem/01-factory.md) — the four-pillar summary of the `factory` repo that this guidebook expands. +- [secrets-and-vault.md](../lab-ecosystem/secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and the dynamic PostgreSQL credentials these engines set up. +- [storage-and-recovery.md](../lab-ecosystem/storage-and-recovery.md) — Longhorn + GCS backup + the power-cut recovery the `06 · recover` playbooks serve. +- [naming-conventions.md](../lab-ecosystem/naming-conventions.md) — the `` join key shared by the OpenTofu state prefixes and per-app PostgreSQL objects. +- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [PRD](../../PRD/safe-prod-like-environment/README.md) — the sandbox that rehearses these engines before they touch the real lab. diff --git a/vibe/guidebooks/factory-provisioning/ansible/01-system.md b/vibe/guidebooks/factory-provisioning/ansible/01-system.md new file mode 100644 index 0000000..8e4b03f --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/01-system.md @@ -0,0 +1,94 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **01 · System** + +# 01 · System — base OS, Docker, K3s, Longhorn, DNS, SSL + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) +> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +## What it does + +`01 · System` takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a configured K3s cluster. The wrapper [`playbooks/01_system.yml`](../../../../ansible/arcodange/factory/playbooks/01_system.yml) does nothing but `import_playbook` the stage orchestrator [`playbooks/system/system.yml`](../../../../ansible/arcodange/factory/playbooks/system/system.yml), which in turn imports ten sub-playbooks **in strict order**. Each sub-play layers one capability: hostname/DNS hygiene, Pi-hole HA DNS, the step-ca PKI, the external backup disk, Docker, the iSCSI/dm-crypt prerequisites for Longhorn, K3s itself, CoreDNS forwarding, the cert-manager issuer, and finally the cluster config (Longhorn + Traefik). + +All host-facing plays target `raspberries:&local` — the intersection of the `raspberries` group and the `local` group, which resolves to `pi1`/`pi2`/`pi3` (see [Inventory & variables](inventory.md)). The K3s server/agent split is decided at runtime: the **first host (alphabetically) becomes the server**, the rest become agents. + +## Ordered steps + +| # | Sub-playbook | Purpose | Key vars / versions | +| --- | --- | --- | --- | +| 1 | [`system/rpi.yml`](../../../../ansible/arcodange/factory/playbooks/system/rpi.yml) | Set each node's hostname to its `inventory_hostname`. On Pi-hole nodes (`pi1`/`pi3`) add `dnsmasq` to the `dip` group, then **stop & disable `dnsmasq`** to free port 53 for `pihole-FTL`. | `tags: never` (opt-in only) | +| 2 | [`dns/dns.yml`](../../../../ansible/arcodange/factory/playbooks/dns/dns.yml) → [`dns/pihole.yml`](../../../../ansible/arcodange/factory/playbooks/dns/pihole.yml) | Install & configure **Pi-hole HA DNS** via the `pihole` role. Adds custom records mapping `.arcodange.lab` and `.arcodange.duckdns.org` to `pi1`. | `pihole_custom_dns` → `pi1.preferred_ip` | +| 3 | [`ssl/ssl.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/ssl.yml) → [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml) | Install **step-ca** (the `step_ca` role) on all three Pis; fetch the root CA from `pi1`; build a **Gitea runner image that trusts the CA** (`runner-images:ubuntu-latest-ca`) and push it to the registry. | `step_ca_primary: pi1`, root at `/home/step/.step/certs/root_ca.crt` | +| 4 | [`system/prepare_disks.yml`](../../../../ansible/arcodange/factory/playbooks/system/prepare_disks.yml) | Auto-detect the largest external (non-`mmcblk0`) USB partition, format it **ext4 with label `arcodange_500`**, mount at `/mnt/arcodange`, and persist in `fstab`. Skips format if the label already exists. **`pause` confirm before any format.** | `mount_point: /mnt/arcodange`, `disk_label: arcodange_500` | +| 5 | [`system/system_docker.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_docker.yml) | Install Docker via `geerlingguy.docker`; write `daemon.json` with **json-file logging** (`max-size 10m`, `max-file 5`) and **`data-root: /mnt/arcodange/docker`** (only when the external disk is mounted). | `tags: never`; `storage-driver: overlay2` | +| 6 | [`system/iscsi_longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/system/iscsi_longhorn.yml) | Install `open-iscsi` (+ enable `iscsid`) and `cryptsetup`, and load the **`dm_crypt`** kernel module (persisted in `/etc/modules`) — Longhorn's encrypted-volume prerequisites. Creates `/mnt/arcodange/longhorn`. | module `dm_crypt` | +| 7 | [`system/system_k3s.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_k3s.yml) | Build the K3s inventory dynamically (first sorted host → `server`, rest → `agent`), install the `k3s-ansible` content, run `k3s.orchestration.site`, then **fetch the kubeconfig** to `~/.kube/config` (rewriting `127.0.0.1` → server IP). | **k3s `v1.34.3+k3s1`**; server args `--docker --disable traefik` | +| 8 | [`system/k3s_dns.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_dns.yml) | Create the **`coredns-custom`** ConfigMap so cluster DNS forwards `arcodange.lab:53` to the Pi-hole IPs; also patch the main CoreDNS Corefile to forward to the same HA Pi-holes. | `pihole_ips` (extracted from hostvars) | +| 9 | [`system/k3s_ssl.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_ssl.yml) | Deploy **cert-manager** + **step-issuer** as k3s static HelmCharts; create the `StepClusterIssuer` `step-ca` wired to the JWK provisioner and root CA. | cert-manager `v1.19.2`, step-issuer `1.9.11`, `caUrl: https://ssl-ca.arcodange.lab:8443`, **ARM64 `kube-rbac-proxy` override** | +| 10 | [`system/k3s_config.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_config.yml) | Deploy **Longhorn** + **Traefik** as HelmCharts; issue the wildcard cert, set the default `TLSStore`, wire Gitea, the IP-allow-list middleware, and the CrowdSec bouncer plugin; then **delete the old Traefik** to force a redeploy. | Longhorn `v1.9.1`, Traefik `v37.4.0` (see detail below) | + +## How the stages fit together + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%% +flowchart TD + classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4; + classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + rpi["1 · rpi.yml
hostname + dnsmasq off"]:::host + dns["2 · pihole
HA DNS"]:::host + ssl["3 · step-ca
root CA + CA-trusting runner image"]:::host + disk["4 · prepare_disks.yml
ext4 arcodange_500 -> /mnt/arcodange"]:::danger + docker["5 · system_docker.yml
data-root on external disk"]:::host + iscsi["6 · iscsi_longhorn.yml
open-iscsi + dm_crypt"]:::host + k3s["7 · system_k3s.yml
k3s v1.34.3 (--disable traefik)"]:::cluster + cdns["8 · k3s_dns.yml
coredns-custom -> Pi-hole"]:::cluster + cmgr["9 · k3s_ssl.yml
cert-manager + step-issuer"]:::cluster + cfg["10 · k3s_config.yml
Longhorn + Traefik + redeploy"]:::cluster + + rpi --> dns --> ssl --> disk --> docker --> iscsi --> k3s --> cdns --> cmgr --> cfg +``` + +1. **`rpi.yml`** fixes the hostname and, on Pi-hole nodes, stops `dnsmasq` so `pihole-FTL` can own port 53. +2. **Pi-hole** comes up as the HA DNS authority for `arcodange.lab`. +3. **step-ca** is installed; its root CA is fetched and baked into a Gitea runner image so CI can trust internal TLS. +4. **`prepare_disks.yml`** formats and mounts the external USB disk at `/mnt/arcodange` (with a confirmation pause). +5. **Docker** installs with its data-root pointed at that disk and capped logging. +6. **iSCSI + dm_crypt** prerequisites land so Longhorn can attach (and encrypt) volumes. +7. **K3s** installs with the first host as server, Docker as the container runtime, and Traefik disabled. +8. **CoreDNS** is reconfigured to forward `arcodange.lab` to the Pi-holes. +9. **cert-manager + step-issuer** wire the in-cluster issuer to step-ca. +10. **`k3s_config.yml`** deploys Longhorn and a fully-customized Traefik, then deletes the old Traefik so the helm-controller redeploys with the new config. + +## `k3s_config.yml` — Longhorn & Traefik detail + +| Resource | Value | Notes | +| --- | --- | --- | +| Longhorn HelmChart | `v1.9.1` | `defaultSettings.defaultDataPath: /mnt/arcodange/longhorn` — volumes live on the external disk. | +| Traefik HelmChart | `v37.4.0` | Deployed as a k3s static manifest (`traefik-v3.yaml`) with an inline `traefik-configmap`. | +| Wildcard cert | `wildcard-arcodange-lab` | `Certificate` for `arcodange.lab` + `*.arcodange.lab`, issued by the `step-issuer` `StepClusterIssuer`. | +| `TLSStore` `default` | `defaultCertificate: wildcard-arcodange-lab` | Makes the wildcard cert the cluster-wide default. | +| Gitea exposure | `gitea-external` `ExternalName` Service → `pi2` port 3000 | Gitea runs **outside** K3s as Docker Compose on `pi2`; Traefik routes `gitea.arcodange.lab` to it. | +| `localIp` middleware | `ipAllowList` | Restricts dashboard/Gitea routers to LAN + pod CIDR + the detected public IP. | +| CrowdSec bouncer | plugin `v1.3.3` | Traefik experimental plugin `crowdsec-bouncer-traefik-plugin` (config completed in [04 · Tools](04-tools.md)). | +| DuckDNS token | `traefik-duckdns-token` Secret → `DUCKDNS_TOKEN` | Consumed by the `letsencrypt` ACME DNS-challenge resolver via `envFrom`. | + +## Gotchas + +> [!CAUTION] +> **Step 4 formats a disk — data loss is real.** `prepare_disks.yml` picks the **largest non-system partition** and runs `mkfs.ext4 -F` on it when the `arcodange_500` label is absent. The `run_once` `pause` prompt ("tapez 'oui' pour continuer") is the only guard, and a wrong USB stick plugged into the wrong Pi will be wiped. Confirm `target_device` in the debug output before answering. If a candidate already carries the label, the format is skipped and the disk is only (re)mounted. + +> [!WARNING] +> **K3s ships with `--disable traefik`.** The bundled Traefik is intentionally turned off in step 7 so step 10 can deploy its own fully-customized `v37.4.0`. If you re-enable the bundled Traefik or run `k3s_config.yml` out of order, two Traefiks will fight over the ingress ports. + +> [!WARNING] +> **ARM64 needs the `kube-rbac-proxy` image override.** step-issuer's default `gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0` is AMD64-only and **crash-loops on `pi3` (ARM64)**. `k3s_ssl.yml` overrides it to `quay.io/brancz/kube-rbac-proxy:v0.15.0`. Do not remove this override. + +> [!WARNING] +> **Traefik is force-redeployed.** The last play of `k3s_config.yml` deletes the `traefik` Deployment **and** the `helm-install-traefik` Job so the k3s helm-controller re-runs the install against the new manifest. Expect a brief ingress outage during this window; the play then waits for the new Deployment to come back before finishing. + +> [!NOTE] +> **`tags: never` plays are opt-in.** `rpi.yml` and `system_docker.yml` carry `tags: never`, so they are skipped unless you explicitly pass their tag (e.g. `--tags rpi` / `--tags ...`) or `--tags all`. The K3s/Longhorn/Traefik plays run on a normal invocation. diff --git a/vibe/guidebooks/factory-provisioning/ansible/02-setup.md b/vibe/guidebooks/factory-provisioning/ansible/02-setup.md new file mode 100644 index 0000000..383ef58 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/02-setup.md @@ -0,0 +1,82 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **02 · Setup** + +# 02 · Setup — Postgres, Gitea, NFS backup target + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [01 · System](01-system.md) +> **Downstream:** [03 · CI/CD](03-cicd.md) +> **Related:** [Inventory & variables](inventory.md) · [Roles reference](roles.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) + +## What it does + +`02 · Setup` deploys the **stateful services the rest of the platform leans on**: a PostgreSQL server and a Gitea instance — both running as **Docker Compose stacks on `pi2`, outside K3s** — plus the in-cluster NFS backup target. The wrapper [`playbooks/02_setup.yml`](../../../../ansible/arcodange/factory/playbooks/02_setup.yml) imports [`playbooks/setup/setup.yml`](../../../../ansible/arcodange/factory/playbooks/setup/setup.yml), which pings the Pis, then imports three sub-playbooks: `backup_nfs.yml` (tagged `never`), `postgres.yml`, and `gitea.yml`. + +> [!IMPORTANT] +> **Postgres and Gitea do not run in Kubernetes.** They are Docker Compose stacks on `pi2` (the sole member of the `postgres` group, which `gitea` inherits as a child — see [Inventory & variables](inventory.md)). K3s only references them: Traefik exposes Gitea via an `ExternalName` Service, and the `pg-fix-table-ownership` CronJob reaches Postgres over the LAN. This keeps the two services available even when the cluster is being rebuilt. + +## Ordered steps + +| # | Sub-playbook | Purpose | Key vars / versions | +| --- | --- | --- | --- | +| 1 | [`setup/backup_nfs.yml`](../../../../ansible/arcodange/factory/playbooks/setup/backup_nfs.yml) | Provision the shared backup volume: a **Longhorn RWX PVC `backups-rwx` (50Gi)**, a Longhorn `RecurringJob`, a `busybox` deploy to spawn the share-manager, then mount the resulting NFS share at `/mnt/backups` on every Pi. | `tags: never`; `backup_size: 50Gi`, RecurringJob `thrice-a-month-backup` (`cron 0 5 */2 * *`, retain 2) | +| 2 | [`setup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/setup/postgres.yml) | Deploy the Postgres Compose stack (`deploy_docker_compose` + `deploy_postgresql` role), create the `gitea` DB/user, create the **pgbouncer auth_user + `user_lookup()` functions** in both `postgres` and `gitea` DBs, publish the K8s Secret `postgres-admin-credentials`, and install the **`pg-fix-table-ownership` CronJob**. | **Postgres `16.3-alpine`**; container `postgres`; CronJob daily `0 3 * * *` | +| 3 | [`setup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/setup/gitea.yml) | Deploy the Gitea Compose stack (`deploy_docker_compose` + `deploy_gitea` role), create admin `arcodange`, mint an API token via `gitea_token`, upload the avatar, register the SSH key, create org `arcodange-org`, then **delete the temp token**. | **Gitea `1.25.5`**; base URL `http://pi2:3000` | + +## NFS backup target — how the share is born + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%% +flowchart TD + classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4; + classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + + pvc["RWX PVC backups-rwx (50Gi)
longhorn-system"]:::cluster + rj["RecurringJob thrice-a-month-backup
cron 0 5 */2 *"]:::cluster + dep["busybox Deployment rwx-nfs
mounts the PVC"]:::cluster + sm["Longhorn share-manager
(spawned by the mount)"]:::cluster + svc["Service nfs-backups-rwx
ClusterIP :2049"]:::cluster + mount["mount /mnt/backups on pi1/pi2/pi3
NFS vers=4.1"]:::host + + pvc --> rj + pvc --> dep --> sm --> svc --> mount +``` + +1. A **ReadWriteMany Longhorn PVC** (`backups-rwx`, 50Gi) is created in `longhorn-system`. +2. A **`RecurringJob`** is attached to the volume so Longhorn snapshots/backs it up on the `0 5 */2 * *` schedule. +3. A **`busybox` Deployment (`rwx-nfs`)** mounts the PVC — the act of mounting an RWX volume makes Longhorn spawn an **NFS share-manager** pod. +4. A stable **ClusterIP Service** (`nfs-backups-rwx`, port 2049) is created (or reused) to front the share-manager. +5. Each Pi installs `nfs-common` and **mounts the share at `/mnt/backups`** (`vers=4.1`, `nofail`, `x-systemd.automount`), persisted in `fstab`. + +## Postgres — what gets created + +| Artifact | Where | Purpose | +| --- | --- | --- | +| Compose stack `arcodange_factory` | `pi2` Docker | Runs `postgres:16.3-alpine`, container `postgres`, port `5432`, data under `/home/pi/arcodange/docker_composes/postgres/data`. | +| `gitea` DB + user | inside Postgres | Created by the `deploy_postgresql` role from `applications_databases.gitea` (`gitea_database`). | +| pgbouncer `auth_user` (`pgbouncer_auth`) | `postgres` + `gitea` DBs | Login role used by the [pgbouncer pooler](../../lab-ecosystem/02-tools.md) for SCRAM lookups. | +| `user_lookup(text)` function | `postgres` + `gitea` DBs | `SECURITY DEFINER` function over `pg_shadow`; `EXECUTE` granted only to `pgbouncer_auth`. | +| K8s Secret `postgres-admin-credentials` | `kube-system` | Base64 admin user/password so the in-cluster CronJob can authenticate. | +| CronJob `pg-fix-table-ownership` | `kube-system` | Runs `postgres:16.3` daily at **03:00**; discovers `%_role` roles, derives each DB by stripping `_role`, and re-`ALTER TABLE ... OWNER TO` every public table — repairing ownership after a restore. | + +## Gitea — bootstrap sequence + +1. **Compose deploy** via `deploy_docker_compose`, then the `deploy_gitea` role wires Gitea to the Postgres DB (host/db/user/password pulled from the compose env). +2. **Admin user** `arcodange` (`arcodange@gmail.com`) is created with `--random-password --admin` if absent. +3. **API token** is minted by the `gitea_token` role and used for the next HTTP calls. +4. **Avatar** upload, **SSH public key** registration (idempotent), and **org `arcodange-org`** (full name "Arcodange") creation + avatar. +5. **Cleanup** — a `post_tasks` invocation of `gitea_token` with `gitea_token_delete: true` removes the temporary token. + +## Gotchas + +> [!WARNING] +> **The NFS play is `never`-tagged and order-sensitive.** `backup_nfs.yml` only runs when explicitly tagged, and several of its tasks (`Créer PVC RWX`, `Lancer un Deployment pour déclencher NFS`, `Attendre que le pod rwx-nfs soit Running`) are themselves `tags: never`. The RWX volume must already exist for the busybox deploy to spawn the share-manager; running the mount step before the share-manager is `Running` will hang on the `until` retry loop. + +> [!WARNING] +> **Postgres lives on `pi2` outside K3s.** Treat it as a single-host service: there is no Postgres pod to `kubectl get`. The cluster only sees the `postgres-admin-credentials` Secret and the `pg-fix-table-ownership` CronJob, both of which reach the DB over the LAN at `pi2:5432`. A `pi2` outage takes Postgres (and Gitea) down regardless of cluster health. + +> [!CAUTION] +> **`pg-fix-table-ownership` exists because restores break ownership.** After a Longhorn/data recovery, tables can come back owned by the wrong role and apps lose write access. The daily CronJob silently re-owns every `public` table to the `_role` matching each `%_role` PostgreSQL role. If you add a database whose owning role does **not** follow the `_role` naming convention, this job will not fix it — see [Naming conventions](../../lab-ecosystem/naming-conventions.md). + +> [!NOTE] +> **The admin password is random and printed once.** Gitea's admin is created with `--random-password`; capture it from the play output (or reset it via `docker exec`) — it is not stored in the inventory. The bootstrap API token is deliberately deleted at the end, so re-running the play re-mints a fresh one. diff --git a/vibe/guidebooks/factory-provisioning/ansible/03-cicd.md b/vibe/guidebooks/factory-provisioning/ansible/03-cicd.md new file mode 100644 index 0000000..51b6aa7 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/03-cicd.md @@ -0,0 +1,34 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **03 · CI/CD** + +# 03 · CI/CD — Gitea Actions runners + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [02 · Setup](02-setup.md) +> **Downstream:** [04 · Tools](04-tools.md) +> **Related:** [Lab ecosystem · 01 factory (ArgoCD caveat)](../../lab-ecosystem/01-factory.md) · [Roles reference](roles.md) · [Inventory & variables](inventory.md) + +## What it does + +`03 · CI/CD` registers and deploys the **Gitea Actions runner (`act_runner`)** on every Pi that is *not* the Gitea host, so CI jobs have executors. The whole stage is one playbook, [`playbooks/03_cicd.yml`](../../../../ansible/arcodange/factory/playbooks/03_cicd.yml) — there is no stage subdirectory. + +It targets `raspberries:&local:!gitea`, i.e. the raspberries that are local **minus** the `gitea` group. Since `gitea` resolves to `pi2`, the runner lands on **`pi1` and `pi3`** (see [Inventory & variables](inventory.md)). + +## Steps + +| # | Task / role | Purpose | Key detail | +| --- | --- | --- | --- | +| 1 | role `arcodange.factory.gitea_token` | Mint a `gitea_api_token` for later API use. | Reused across the collection (see [Roles reference](roles.md)). | +| 2 | `gitea actions generate-runner-token` (delegated to the Gitea host) | Fetch a **runner registration token** by `docker exec`-ing into the `gitea` container. | `delegate_to: groups.gitea[0]` | +| 3 | role `arcodange.factory.deploy_docker_compose` | Render the `act_runner` Compose stack with the registration token, instance URL, runner name, and labels. | image `gitea/act_runner:latest`; labels point at `runner-images:ubuntu-latest-ca` | +| 4 | `community.docker.docker_compose_v2` (down→up loop) | Apply the stack: a `loop: [absent, present]` recreates the runner so token/label changes take effect. | cache dirs under `/mnt/arcodange/gitea-runner-*` | + +The runner registers with `GITEA_INSTANCE_URL: http://:3000`, names itself `arcodange_global_runner_`, and advertises the **`ubuntu-latest` / `ubuntu-latest-ca`** labels — both mapped to the CA-trusting image built back in [01 · System](01-system.md). It mounts the Docker socket and the host CA store (`/etc/ssl/certs`, `/usr/local/share/ca-certificates`) so jobs trust internal TLS, and runs with `insecure: true` against the Gitea TLS endpoint. + +## Gotchas + +> [!WARNING] +> **ArgoCD is present in design but not deployed.** The factory pipeline intends `03_cicd` to also bring up ArgoCD (the app-of-apps), but **that step is commented out / not currently deployed in-cluster** — this stage only deploys the Gitea runners. Treat ArgoCD as "designed, not live" until the install is enabled. See the [ArgoCD caveat in lab-ecosystem · 01 factory](../../lab-ecosystem/01-factory.md). + +> [!WARNING] +> **The registration token is single-use and host-delegated.** Step 2 generates a fresh token every run via the Gitea container, so the runner re-registers on each apply. If the Gitea host (`pi2`) is down, token generation fails and no runner can register. diff --git a/vibe/guidebooks/factory-provisioning/ansible/04-tools.md b/vibe/guidebooks/factory-provisioning/ansible/04-tools.md new file mode 100644 index 0000000..aed81b3 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/04-tools.md @@ -0,0 +1,125 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **04 · Tools** + +# 04 · Tools — Vault + CrowdSec + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [Roles reference](roles.md) — deep mechanics of the `hashicorp_vault` and `crowdsec` roles +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [05 · Backup](05-backup.md) · [03 · CI/CD](03-cicd.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Stage 4 installs the **operational tooling layer** on top of a running cluster: HashiCorp **Vault** (the lab's single secret store) and **CrowdSec** (the WAF/IPS that fronts Traefik). The entry point [`playbooks/04_tools.yml`](../../../../ansible/arcodange/factory/playbooks/04_tools.yml) is a one-line wrapper that imports [`playbooks/tools/tools.yml`](../../../../ansible/arcodange/factory/playbooks/tools/tools.yml), which in turn chains two sub-playbooks — `hashicorp_vault.yml` then `crowdsec.yml`. Both run against `localhost` (they drive the cluster through `kubectl` / `kubernetes.core`, not over SSH to the Pis). + +> [!IMPORTANT] +> Vault is the chokepoint of the whole secret model. This page covers **what the playbook orchestrates**; the byte-level role internals (init, unseal, root-token minting, the OpenTofu OIDC backend) live in the [Roles reference](roles.md). Read [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) first for the conceptual model — the two auth backends, the unseal posture, and why there is no secret material in git. + +--- + +## What stage 4 deploys + +| Sub-playbook | File | Builds | Role invoked | +| --- | --- | --- | --- | +| Vault | [`tools/hashicorp_vault.yml`](../../../../ansible/arcodange/factory/playbooks/tools/hashicorp_vault.yml) | Initialises + unseals Vault, wires the Gitea OIDC/JWT auth backends via OpenTofu, publishes the `vault_oauth__sh_b64` Gitea Action secret | `hashicorp_vault` | +| CrowdSec | [`tools/crowdsec.yml`](../../../../ansible/arcodange/factory/playbooks/tools/crowdsec.yml) | A `VaultAuth` + `VaultStaticSecret` for the Turnstile captcha keys, a fresh bouncer API key, and the Traefik `crowdsec` middleware | `crowdsec` | + +--- + +## Step 1 — `hashicorp_vault.yml` + +### The credential prompt + +The play opens with a single `vars_prompt` for the **Gitea admin password** (`gitea_admin_password`, marked `unsafe: true` because the password may contain shell-hostile characters like `{`). This is the only interactive input the stage needs — everything else is derived or minted on the fly. + +### Orchestration flow + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef prompt fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef mint fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef vault fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef revoke fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + P["vars_prompt:
gitea_admin_password"]:::prompt + T["Mint temp GITEA_ADMIN_TOKEN
(role gitea_token, replace=true)"]:::mint + R["Run hashicorp_vault role:
init · unseal · OIDC backend · gitea secret"]:::vault + D["post_tasks:
delete GITEA_ADMIN_TOKEN"]:::revoke + + P --> T --> R --> D +``` + +1. **Mint a temporary token.** The `arcodange.factory.gitea_token` role generates a `GITEA_ADMIN_TOKEN` with scopes `write:admin,write:organization,write:repository,write:user` (and `gitea_token_replace: true`, so any stale token of the same name is rotated). It is stashed in the fact `vault_GITEA_ADMIN_TOKEN`. +2. **Run the `hashicorp_vault` role.** Invoked with three derived vars: the Postgres admin credentials (read straight out of the Postgres host's docker-compose `environment` via `hostvars[groups.postgres[0]]`), the `gitea_admin_token` (= the temp token), and the prompted `gitea_admin_password`. The role does the heavy lifting — see below. +3. **Revoke the temporary token.** A `post_tasks` block re-invokes `gitea_token` with `gitea_token_delete: true`, so the admin token never outlives the run. + +### What the `hashicorp_vault` role does + +The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) runs a fixed sequence; the OIDC backend setup is wrapped in a `block`/`always` so the freshly minted **root token is always revoked**, even on failure: + +| Phase | Task file | What happens | +| --- | --- | --- | +| **Init** | `init.yml` | First-time only. Checks `vault operator init -status`; if uninitialised, runs `vault operator init` with **1 key share / threshold 1** and writes the keys to `~/.arcodange/cluster-keys.json` (mode `600`). Idempotent on re-run. | +| **Unseal** | `unseal.yml` | Reads `cluster-keys.json` and runs `vault operator unseal` on every server pod. Required on **every reboot** — Vault always restarts sealed. | +| **Root token** | `new_root_token.yml` | Mints a one-shot root token via the `generate-root` OTP/nonce dance (using the unseal key), needed to authenticate the OpenTofu apply. | +| **OIDC backend** | `gitea_oidc_auth.yml` | Drives a Playwright script to register/read the Gitea OAuth app, then runs **OpenTofu in a throwaway Docker volume** to provision the `gitea` (OIDC) + `gitea_jwt` (JWT) auth backends, the admin identity, and the `kvv1` static secrets. Finally writes the `vault_oauth__sh_b64` script to Gitea Actions secrets. | +| **Revoke** | `revoke_token.yml` (in `always`) | Revokes the root token unconditionally. | + +> [!IMPORTANT] +> The OpenTofu apply runs the [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside an ephemeral Docker volume (`docker volume create` → `tofu init` + `tofu apply` → `docker volume rm`), with the state in a GCS backend (`gs://arcodange-tf`, prefix `tools/hashicorp_vault/gitea_oidc`). The CA is mounted read-only via `VAULT_CACERT`. The destroy step is commented out by design — this provisions, it does not tear down. + +### The `vault_oauth__sh_b64` Gitea secret + +The last act of the role renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) (an OIDC authorization-code → access-token helper for CI), base64-encodes it, and publishes it as the **org-level** Gitea Action secret `vault_oauth__sh_b64`. Because Gitea Action secrets are scoped per owner, the role then **re-publishes the identical secret to each user-owned namespace** listed in `gitea_secret_propagation_users` — repos under a personal account cannot read org-level secrets. This is what lets a Gitea Actions workflow obtain the OIDC JWT that authenticates to Vault under the `gitea_cicd_` role (the CI half of the [secret model](../../lab-ecosystem/secrets-and-vault.md)). + +> [!CAUTION] +> The role has an **off-by-default** `vault_oidc_force_reset` flag. When set, it runs `vault auth disable gitea` **and** `gitea_jwt` before re-applying — which **wipes every `gitea_cicd_` per-app JWT role** created by the tools-repo IaC. Leave it `false` unless you are deliberately rebuilding the OIDC backend from scratch (e.g. `bound_issuer` config drift). + +--- + +## Step 2 — `crowdsec.yml` + +The CrowdSec sub-playbook is a thin wrapper that runs the `crowdsec` role to bolt a CrowdSec-bouncer middleware onto Traefik. The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) wires three things together. + +| Step | What it creates | Detail | +| --- | --- | --- | +| **Turnstile secret** | `ServiceAccount` + `VaultAuth` + `VaultStaticSecret` in `kube-system` | Authenticates via the Kubernetes auth backend (role `factory_crowdsec_conf`) and pulls the Cloudflare Turnstile keys from `kvv2` path `cms/factory/turnstile` into a K8s Secret (`refreshAfter: 30s`). | +| **Bouncer key** | A CrowdSec LAPI bouncer named `traefik-plugin` | Runs `cscli bouncers add traefik-plugin` inside the LAPI pod; on collision it deletes and re-adds, so the run is repeatable. | +| **Traefik middleware** | A `traefik.io/v1alpha1` `Middleware` named `crowdsec` | Stream mode, captcha provider `turnstile` (site/secret keys from the Turnstile secret), Redis cache, trusted-IP allow-lists. | + +After applying the middleware the role **cleans up `Failed` CrowdSec pods** and **bounces Traefik** (scale to 0 → back to 1, inside a `block`/`rescue`/`always` that guarantees Traefik returns to 1 replica no matter what) so the new middleware config is loaded. + +> [!NOTE] +> The Turnstile keys come from the **CMS-managed** Vault path `cms/factory/turnstile` — they are provisioned outside this stage. CrowdSec only *reads* them here. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how `VaultStaticSecret` materialises a Vault path into a Kubernetes Secret. + +--- + +## Gotchas + +> [!WARNING] +> - **Vault must be unsealed before anything secret-dependent recovers.** Stage 4's unseal step reads `~/.arcodange/cluster-keys.json`; if that file is missing, init/unseal cannot proceed and the OpenTofu apply (which needs a live Vault) fails. The same file gates step 2 of the [power-cut recovery order](../../lab-ecosystem/storage-and-recovery.md). +> - **Docker is required on the control node.** The OIDC backend provisioning shells out to `docker run … opentofu` and `docker volume`. The Playwright step also runs containerised. A control node without Docker will fail this stage. +> - **`gitea_admin_password` is `unsafe`.** Do not strip the `unsafe: true` flag from the prompt — passwords with `{`/`}` are mangled by Jinja templating otherwise. +> - **Re-running is safe by default.** Init and unseal are idempotent; the temp admin token and root token are both revoked on the way out. Only `vault_oidc_force_reset` makes a re-run destructive. +> - **CrowdSec bounces Traefik.** The middleware step briefly scales Traefik to 0 — expect a short ingress blip during stage 4. The `always` block restores it to 1 even if the scale-down errors. + +--- + +## Where stage 4 sits + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef next fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + + s03["03 · CI/CD"]:::done + s04["04 · Tools
Vault · CrowdSec"]:::here + s05["05 · Backup"]:::next + + s03 --> s04 --> s05 +``` + +1. **03 · CI/CD** registered the `act_runner` executors — a prerequisite, since the `vault_oauth__sh_b64` secret published here is consumed by those CI runners. +2. **04 · Tools** (this page) stands up Vault and CrowdSec. +3. **05 · Backup** is next — it schedules the cron dumps that protect the state the cluster now holds. diff --git a/vibe/guidebooks/factory-provisioning/ansible/05-backup.md b/vibe/guidebooks/factory-provisioning/ansible/05-backup.md new file mode 100644 index 0000000..6bf7df0 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/05-backup.md @@ -0,0 +1,107 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **05 · Backup** + +# 05 · Backup — daily cron dumps + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [06 · Recover](06-recover.md) — how these dumps are replayed +> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [04 · Tools](04-tools.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Stage 5 installs three independent **cron-driven backup jobs** that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point [`playbooks/05_backup.yml`](../../../../ansible/arcodange/factory/playbooks/05_backup.yml) imports [`playbooks/backup/backup.yml`](../../../../ansible/arcodange/factory/playbooks/backup/backup.yml), which chains the three sub-playbooks, each passing `backup_root_dir: /mnt/backups`. + +Every job follows the **same anatomy**: run a daily cron at **04:00**, write a date-stamped archive to `/mnt/backups//`, prune anything older than **3 days**, and drop a matching `restore.sh` next to the backup script. `/mnt/backups` is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps. + +> [!NOTE] +> All three sub-playbooks **install** scripts and cron entries; they do not run a backup themselves (beyond a one-shot `test backup_cmd` smoke check that pipes to `/dev/null`). The actual backups fire from cron. To read failures, SSH to the host and use `sudo su` → `mails` (see [`backup/README.md`](../../../../ansible/arcodange/factory/playbooks/backup/README.md)). + +--- + +## The three jobs + +| Job | Sub-playbook | Host | Backup command | Artifact | Scripts dir | +| --- | --- | --- | --- | --- | --- | +| **Postgres** | [`backup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/backup/postgres.yml) | `postgres` | `docker exec pg_dumpall -U ` ∣ `gzip` | `backup_YYYYMMDD.sql.gz` | `…/docker_composes/postgres/scripts` | +| **Gitea** | [`backup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/backup/gitea.yml) | `gitea` | `docker exec -u git gitea dump --skip-log --skip-db --skip-package-data --type tar.gz` | `backup_YYYYMMDD.gitea.gz` | `…/docker_composes/gitea/scripts` | +| **K3s PVC** | [`backup/k3s_pvc.yml`](../../../../ansible/arcodange/factory/playbooks/backup/k3s_pvc.yml) | `pi1` | `kubectl get pv,pvc` + `volumes.longhorn.io` + `settings.longhorn.io` (YAML) | `backup_YYYYMMDD.volumes` | `/opt/k3s_volumes` | + +All three share: `keep_days: 3`, cron `minute: 0 hour: 4 user: root`, and `backup_dir: /mnt/backups/`. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4; + + C["cron · daily 04:00 · user root"]:::cron + PG["postgres.yml
pg_dumpall ∣ gzip"]:::job + GT["gitea.yml
gitea dump tar.gz"]:::job + PV["k3s_pvc.yml
PV · PVC · Longhorn CRDs"]:::job + D["/mnt/backups/{postgres,gitea,k3s_pvc}/
keep 3 days"]:::store + L["Longhorn:
snapshot · replicate · off-site"]:::ship + + C --> PG --> D + C --> GT --> D + C --> PV --> D + D --> L +``` + +1. A single daily **04:00 root cron** triggers each job's `backup.sh`. +2. **postgres.yml** runs `pg_dumpall` through `gzip`, **gitea.yml** streams a `gitea dump` tarball, **k3s_pvc.yml** serialises the volume metadata. +3. Each writes a date-stamped archive into `/mnt/backups//` and prunes files older than 3 days (`find … -mtime +3 -delete`). +4. Because `/mnt/backups` is a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron. + +--- + +## Job details + +### Postgres — `postgres.yml` + +The backup command is built from the Postgres host's docker-compose facts (`container_name`, `POSTGRES_USER`). `pg_dumpall` captures **all databases plus globals (roles)** in one logical dump, gzipped. The generated `restore.sh` takes an optional `YYYYMMDD` argument (defaults to the latest dump), `docker cp`s it into the container, gunzips, and replays with `psql -f`. If the restore misbehaves, the script reminds you to wipe the data dir before replaying. + +### Gitea — `gitea.yml` + +The dump runs as the `git` user with `--skip-db` (Postgres is backed up separately by the Postgres job) and `--skip-package-data`, streamed to stdout (`-f -`) so it never lands on the container's own disk. The `restore.sh` unpacks the tarball back into `/data/gitea` (config/data) and `/data/git/repositories` (repos), fixes `git:git` ownership, and **regenerates hooks** (`gitea admin regenerate hooks`) — without that step the restored repos have stale hook paths. + +### K3s PVC — `k3s_pvc.yml` + +This job does **not** back up volume *data* (Longhorn handles the bytes). It backs up the **Kubernetes objects** needed to re-bind those volumes: all `pv` + `pvc`, the **`volumes.longhorn.io` CRDs**, and `settings.longhorn.io`, concatenated into one `.volumes` YAML (`---`-separated). It writes the dump to both `/mnt/backups/k3s_pvc/` *and* a copy alongside the script. The `restore.sh` prefers a fallback dir (`/home/pi/arcodange/backups/k3s_pvc`) then the primary, picks the latest (or a dated) dump, and `kubectl apply`s it. + +> [!IMPORTANT] +> **Backing up the Longhorn `volumes.longhorn.io` CRDs is what enables *fast* recovery.** With the Volume CRDs in the backup, recovery is a single `kubectl apply` that re-associates the surviving on-disk replicas with their PVs (see [06 · Recover → `longhorn.yml`](06-recover.md)). **Without** the Volume CRDs, a Longhorn reinstall assigns **new engine IDs**, cannot adopt the orphaned replica directories, and you fall through to the slow **block-device data recovery** (`longhorn_data.yml`). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). This is the prevention half of the [storage failure mode](../../lab-ecosystem/storage-and-recovery.md). + +--- + +## Gotchas + +> [!WARNING] +> - **3-day retention is tight.** A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local `/mnt/backups` files are short-lived. +> - **The smoke test runs the real dump.** Each play has a `test backup_cmd` task that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design. +> - **Cron runs as `root`, scripts live in app dirs.** The `backup.sh`/`restore.sh` are written into the app's docker-compose `scripts/` dir (or `/opt/k3s_volumes`); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5. +> - **Gitea restore needs the hook regeneration.** Skipping `gitea admin regenerate hooks` leaves repos with broken push hooks — the `restore.sh` already does it, so use the script rather than a manual untar. +> - **Postgres and Gitea DB are backed up by *different* jobs.** Gitea dumps with `--skip-db`; its database rows come from the Postgres `pg_dumpall`. Restoring Gitea fully means restoring **both** archives. + +--- + +## Where stage 5 sits + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + s04["04 · Tools"]:::done + s05["05 · Backup
Postgres · Gitea · K3s PVC"]:::here + rec["recover/*
(on disaster)"]:::rec + + s04 --> s05 + s05 -. "feeds restore" .-> rec +``` + +1. **04 · Tools** stood up Vault and CrowdSec — the secret store stage 5's dumps help protect. +2. **05 · Backup** (this page) is the last linear stage: it schedules the daily dumps. +3. The artifacts here are the **input** to the on-demand [06 · Recover](06-recover.md) branch — the `.volumes` dump in particular gates whether recovery is fast (CRDs present) or slow (block-device). diff --git a/vibe/guidebooks/factory-provisioning/ansible/06-recover.md b/vibe/guidebooks/factory-provisioning/ansible/06-recover.md new file mode 100644 index 0000000..a57b45d --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/06-recover.md @@ -0,0 +1,149 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover** + +# 06 · Recover — Longhorn disaster recovery + +> [!NOTE] +> **Status:** 🟡 beta · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume +> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) + +The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?** + +> [!IMPORTANT] +> **Decision — pick the right playbook before you start:** +> - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted. +> - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4; + classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + classDef dead fill:#6b7280,stroke:#4b5563,color:#fff; + + Q{"Do the Longhorn
Volume CRDs
still exist?"}:::q + F["longhorn.yml
CSI/CRD recovery (fast)"]:::fast + S{"Raw replica
.img files
survive?"}:::q + D["longhorn_data.yml
block-device recovery (slow)"]:::slow + X["Data unrecoverable
(replicas zeroed)"]:::dead + + Q -- "yes" --> F + Q -- "no" --> S + S -- "yes" --> D + S -- "no" --> X +``` + +1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast. +2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk. +3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume. +4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this. + +> [!NOTE] +> This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo. + +--- + +## `longhorn.yml` — CSI/CRD recovery (CRDs present) + +Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`. + +| Phase | What it does | +| --- | --- | +| **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. | +| **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. | +| **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. | +| **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. | +| **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. | +| **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. | +| **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. | + +> [!IMPORTANT] +> Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe. + +--- + +## `longhorn_data.yml` — block-device data recovery (CRDs gone) + +This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file** — `vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml). + +```sh +ansible-playbook -i inventory/hosts.yml \ + playbooks/recover/longhorn_data.yml \ + -e @vars/recovery_volumes.yml +``` + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4; + + P0["Pre-flight + Phase 0:
auto-discover largest replica dir (>16K)"]:::pre + P1["Phase 1: back up untouched replica dir
(safe copy before any op)"]:::merge + P2["Phase 2: merge-longhorn-layers.py
→ single .img · test-mount RO"]:::merge + P3["Phase 3: create Volume CRD
(scale down workload, clear stuck PVCs)"]:::k8s + P5["Phase 5: attach via maintenance ticket
→ /dev/longhorn/<pv>"]:::k8s + P6["Phase 6: mkfs + rsync merged image
into live block device"]:::merge + P8["Phase 8: recreate PV (Retain) + PVC
pinned by volumeName"]:::k8s + P9["Phase 9: scale workload up · verify"]:::done + + P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9 +``` + +1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this. +2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`. +3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound. +4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise). +5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/` appears on the source node, with the frontend enabled. +6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions). +7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound. +8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod. + +> [!CAUTION] +> The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). + +> [!NOTE] +> **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml). + +--- + +## Gotchas + +> [!WARNING] +> - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed. +> - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that. +> - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica. +> - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`. +> - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves. + +--- + +## Why this is rehearsed + +A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate. + +--- + +## Where this branch sits + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + s05["05 · Backup
(produces .volumes dump)"]:::done + rec["recover/*
longhorn.yml · longhorn_data.yml"]:::here + s01["01 · System
(rejoin pipeline)"]:::done + + s05 -. "on disaster" .-> rec + rec -. "once recovered" .-> s01 +``` + +1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays. +2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone). +3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green. diff --git a/vibe/guidebooks/factory-provisioning/ansible/README.md b/vibe/guidebooks/factory-provisioning/ansible/README.md new file mode 100644 index 0000000..f5dd516 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/README.md @@ -0,0 +1,120 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **Ansible** + +# Ansible — factory provisioning + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [01 · System](01-system.md) · [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) · [04 · Tools](04-tools.md) · [05 · Backup](05-backup.md) · [06 · Recover](06-recover.md) · [Inventory & variables](inventory.md) · [Roles reference](roles.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Ansible is the **imperative half** of the factory: it takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a running K3s cluster with Docker, Longhorn storage, Gitea CI runners, CrowdSec, and Vault. OpenTofu (the declarative half) then provisions everything that lives *outside* the cluster — see the [OpenTofu sub-hub](../opentofu/README.md). + +--- + +## Collection layout + +Everything ships as a single Ansible **collection** committed under [`ansible/arcodange/factory/`](../../../../ansible/arcodange/factory). The collection root, not the repo root, is what `ansible-galaxy collection install` and the FQCN references (`arcodange.factory.`) resolve against. + +| File | Path | What it declares | +| --- | --- | --- | +| `galaxy.yml` | [`ansible/arcodange/factory/galaxy.yml`](../../../../ansible/arcodange/factory/galaxy.yml) | Collection identity: **namespace `arcodange`**, **name `factory`**, **version `1.0.0`**. Together they form the FQCN prefix `arcodange.factory.*` used by every role and playbook import. | +| `requirements.yml` | [`ansible/requirements.yml`](../../../../ansible/requirements.yml) | External dependencies pulled at install time (see table below). | +| `ansible.cfg` | [`ansible/arcodange/factory/ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg) | `collections_path = ~/.ansible/collections` and `scp_if_ssh = True` for the SSH connection plugin. | +| `inventory/` | [`ansible/arcodange/factory/inventory/`](../../../../ansible/arcodange/factory/inventory) | `hosts.yml` + `group_vars/`. Detailed in [Inventory & variables](inventory.md). | +| `playbooks/` | [`ansible/arcodange/factory/playbooks/`](../../../../ansible/arcodange/factory/playbooks) | The numbered pipeline `01..05` plus the `recover/` branch. | +| `roles/` | [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) | Seven reusable roles. Detailed in [Roles reference](roles.md). | + +### External dependencies (`requirements.yml`) + +| Dependency | Type | Why it is needed | +| --- | --- | --- | +| `geerlingguy.docker` | role | Installs and configures the Docker engine on each Pi. | +| `ansible.posix` | collection | POSIX primitives (mounts, sysctl, `synchronize`). | +| `community.crypto` | collection | Certificate/key generation for the step-ca PKI and Traefik. | +| `community.docker` | collection | Manages containers and Compose stacks (Gitea, act_runner). | +| `community.general` | collection | Broad utility modules used across the pipeline. | +| `kubernetes.core` | collection | `k8s` / `helm` modules used by every K3s-facing task. Needs the `kubernetes` Python lib at runtime. | +| `k3s-ansible` (`git+https://github.com/k3s-io/k3s-ansible.git`) | git role/collection | Upstream playbooks that install and cluster K3s itself. | + +> [!TIP] +> The runtime Python libraries (`kubernetes`, `jmespath`, `dnspython`) that `kubernetes.core` and friends import are declared in the **repo-root `pyproject.toml`**, not in `requirements.yml`. `uv sync` installs them; `ansible-galaxy` installs the Galaxy/git content. Both steps are required. + +--- + +## Invocation pattern + +The control node runs Ansible from a `uv`-managed venv. The `localhost` inventory entry sets `ansible_python_interpreter: "{{ ansible_playbook_python }}"`, so `uv run` is enough to put Ansible on the venv's Python — no hardcoded interpreter path. Full recipe lives in [`ansible/README.md`](../../../../ansible/README.md). + +1. **Sync the venv** — installs `ansible-core` plus the runtime Python deps: + ```sh + uv sync + ``` +2. **Install collection dependencies** — pulls the Galaxy + git content from `requirements.yml`: + ```sh + uv run ansible-galaxy collection install -r ansible/requirements.yml + ``` +3. **Run a stage** — point `-i` at the inventory directory and pass one numbered playbook: + ```sh + uv run ansible-playbook \ + -i ansible/arcodange/factory/inventory \ + ansible/arcodange/factory/playbooks/.yml + ``` + +### The vault password (`ANSIBLE_VAULT_PASSWORD_FILE`) + +Encrypted vars are decrypted with a password that is **sourced from the cluster, not stored on disk**. `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable script that reads the K8s secret `arcodange-ansible-vault` from the `kube-system` namespace: + +```sh +kubectl get secret -n kube-system arcodange-ansible-vault \ + --template='{{index .data.pass | base64decode}}' +``` + +> [!IMPORTANT] +> The same `arcodange-ansible-vault` secret in `kube-system` is consumed by the Gitea CI runners (needed for the Gitea mailer). Create it once with `kubectl create secret generic arcodange-ansible-vault --from-literal="pass=" -n kube-system`. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how this fits the broader secret model. + +--- + +## The provisioning pipeline + +The numbered playbooks are meant to be run **in order** on a fresh cluster — each is a thin wrapper that `import_playbook`s a stage directory (e.g. `01_system.yml` → `system/system.yml`). The `recover/` playbooks are **not** part of the linear sequence; they are an on-demand branch used only during disaster recovery. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef stage fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef recover fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + s01["01 · System
Docker · K3s · Longhorn · DNS · SSL"]:::stage + s02["02 · Setup
Gitea · Postgres · NFS backup"]:::stage + s03["03 · CI/CD
act_runner registration"]:::stage + s04["04 · Tools
CrowdSec · Vault"]:::stage + s05["05 · Backup
cron reports · PVC/db dumps"]:::stage + rec["recover/*
Longhorn + data restore"]:::recover + + s01 --> s02 --> s03 --> s04 --> s05 + s05 -. "on disaster" .-> rec + rec -. "rejoin pipeline" .-> s01 +``` + +1. **`01 · System`** — base OS hardening on each Pi, then Docker, Longhorn disk prep + iSCSI, K3s install, CoreDNS, the step-ca cert issuer, and final K3s config (kubeconfig, Longhorn, Traefik). +2. **`02 · Setup`** — deploys the cluster-resident services: Gitea, PostgreSQL (on `pi2`), and the NFS backup target. +3. **`03 · CI/CD`** — fetches a Gitea runner-registration token and rolls out the `act_runner` Docker Compose stack on every non-Gitea Pi so CI jobs have executors. +4. **`04 · Tools`** — installs the operational tooling layer: CrowdSec (WAF/IPS) and HashiCorp Vault. +5. **`05 · Backup`** — schedules the cron-driven backup + email-report jobs and the Gitea / Postgres / K3s-PVC dump routines. +6. **`recover/*` (on demand)** — invoked only after data loss to rebuild Longhorn and replay volume data; once recovered, the cluster re-enters the normal pipeline at `01 · System`. + +--- + +## Index + +| # | Page | Covers | State | +| --- | --- | --- | --- | +| 01 | [System](01-system.md) | RPi hardening, Docker, K3s, Longhorn/iSCSI, CoreDNS, step-ca SSL | ✅ | +| 02 | [Setup](02-setup.md) | Gitea, PostgreSQL, NFS backup target | ✅ | +| 03 | [CI/CD](03-cicd.md) | Gitea `act_runner` registration & Compose deploy | ✅ | +| 04 | [Tools](04-tools.md) | CrowdSec, HashiCorp Vault | ✅ | +| 05 | [Backup](05-backup.md) | Cron report jobs, Gitea/Postgres/PVC dumps | ✅ | +| 06 | [Recover](06-recover.md) | Longhorn + data restore (on-demand DR branch) | 🟡 | +| — | [Inventory & variables](inventory.md) | `hosts.yml` groups, `group_vars/` layering, host→service mapping | ✅ | +| — | [Roles reference](roles.md) | The seven `arcodange.factory.*` roles | ✅ | diff --git a/vibe/guidebooks/factory-provisioning/ansible/inventory.md b/vibe/guidebooks/factory-provisioning/ansible/inventory.md new file mode 100644 index 0000000..d997498 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/inventory.md @@ -0,0 +1,111 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Inventory & variables** + +# Inventory & variables + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [Roles reference](roles.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [PRD · isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md) + +The inventory is the single source of truth for **which machines exist** and **which service each machine runs**. It is a directory inventory — [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml) plus a layered [`group_vars/`](../../../../ansible/arcodange/factory/inventory/group_vars) tree — passed to every playbook with `-i ansible/arcodange/factory/inventory`. + +> [!IMPORTANT] +> This inventory describes **live production**. The three IPs `192.168.1.201-203` are the real Pis that run the public CMS, the Dolibarr ERP, and business email. A playbook pointed at this inventory mutates prod. The safe-environment work treats this file as the prod blast-radius and requires a **separate sandbox inventory + a prod-IP guard** before any sandbox apply — see the [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md) and the first row of the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md). + +--- + +## Hosts + +Defined in [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml). Three physical Pis are each reachable two ways — over the LAN (the canonical path) and through an internet port-forward managed at the firewall — plus the control node as `localhost`. + +| Host | `ansible_host` | `preferred_ip` | Port | Reach | +| --- | --- | --- | --- | --- | +| `pi1` | `pi1.home` | `192.168.1.201` | 22 | LAN | +| `pi2` | `pi2.home` | `192.168.1.202` | 22 | LAN | +| `pi3` | `pi3.home` | `192.168.1.203` | 22 | LAN | +| `internetPi1` | `rg-evry.changeip.co` | — | `51022` | WAN port-forward → `pi1` | +| `internetPi2` | `rg-evry.changeip.co` | — | `52022` | WAN port-forward → `pi2` | +| `internetPi3` | `rg-evry.changeip.co` | — | `53022` | WAN port-forward → `pi3` | +| `localhost` | (local connection) | — | — | control node | + +> [!NOTE] +> The `internetPiN` entries share one DNS name (`rg-evry.changeip.co`) and differ only by SSH port (`5N022`). The hosts file documents the choice of `changeip.co` over `arcodange.duckdns.org`: changeip is **managed directly with the firewall** rather than depending on a DuckDNS registry update, so the forward is stable. `preferred_ip` is a custom hostvar (not a connection variable) — roles read it to build DNS records, the Gitea SSH domain, and the Pi-hole local-DNS table. + +--- + +## Groups + +Groups map machines to roles. The membership is small and deliberate; read the table as "this service runs on these hosts". + +| Group | Members | Defined as | What it is for | +| --- | --- | --- | --- | +| `raspberries` | `pi1`, `pi2`, `pi3` + `internetPi1-3` | explicit hosts | Every Pi, LAN and WAN handles. Carries the shared `ansible_user: pi`. | +| `local` | `localhost`, `pi1`, `pi2`, `pi3` | explicit hosts | The control-node-facing group; `localhost` runs `kubectl`/`tofu`/`docker` tasks that talk to the cluster. | +| `postgres` | `pi2` | explicit host | The single PostgreSQL node. `pi2` is the database host. | +| `gitea` | `pi2` (via `children: postgres`) | child of `postgres` | Gitea co-locates with its database, so the group simply inherits `postgres`. `groups.gitea[0]` resolves to `pi2` everywhere. | +| `pihole` | `pi1`, `pi3` | explicit hosts | The HA DNS pair (Pi-hole + Gravity Sync). | +| `step_ca` | `pi1`, `pi2`, `pi3` | explicit hosts | Every Pi runs a step-ca node (primary `pi1`, standbys `pi2`/`pi3`). | +| `all` | everything (`children: raspberries`) | implicit + child | Ansible's universal group; `group_vars/all/` applies to all hosts. | + +> [!TIP] +> Because `gitea` is a **child of `postgres`** and `postgres` has exactly one host, every reference to `groups.gitea[0]` (the Gitea container, the API base URL `http://{{ groups.gitea[0] }}:3000`, the SSH domain) points at `pi2`. Move Postgres and Gitea follows automatically. + +--- + +## Connection variables + +| Variable | Where set | Value / effect | +| --- | --- | --- | +| `ansible_user` | `raspberries.vars` | `pi` — the SSH login on every Pi. | +| `ansible_ssh_extra_args` | per-host (`pi1`/`pi2`/`pi3`) | `-o StrictHostKeyChecking=no` — Pis get reimaged, so host-key churn is expected; the check is disabled rather than forcing `known_hosts` edits. | +| `ansible_port` | `internetPiN` | `51022` / `52022` / `53022` — the firewall's per-Pi SSH forwards. | +| `ansible_connection` | `localhost` | `local` — run on the control node, no SSH. | +| `ansible_python_interpreter` | `localhost` | `"{{ ansible_playbook_python }}"` — uses the `uv`-managed venv's Python, no hardcoded path. | + +The control-node tooling chain (`scp_if_ssh = True`) is set in [`ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg); the `collections_path` lives there too. + +--- + +## `group_vars/` layering + +Variables are split by group so each service owns its own file. The path `group_vars//.yml` is auto-loaded for every host in ``. + +| File | Scope | Declares | +| --- | --- | --- | +| [`all/common.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/common.yml) | all hosts | `user_home` — the control user's `$HOME`, looked up from the environment. | +| [`all/ssh.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/ssh.yml) | all hosts | SSH-public-key discovery: `first_found` over `id_ed25519_arcodange.pub` → `id_ed25519.pub` → `id_rsa.pub`, then splits the file into `ssh_public_key`, `ssh_key_title`, `ssh_key_algorithm`. Roles push this key to authorized hosts. | +| [`all/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/gitea.yml) | all hosts | `gitea_secret_propagation_users: [arcodange]` — user namespaces that must also receive org-level Gitea Action secrets (see the [`gitea_secret`](roles.md) role). | +| [`gitea/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea.yml) | `gitea` | `gitea_version: 1.25.5`, the `gitea_database` triple, and the full Gitea Docker Compose: Postgres backend (`postgres:5432`), the `smtps`/orange.fr mailer, SSH on `2222:22`, `ROOT_URL https://gitea.arcodange.lab/`, registration disabled. SSH domain is built from `hostvars[groups.gitea[0]].preferred_ip`. | +| [`gitea/gitea_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea_vault.yml) | `gitea` | **VAULTED.** The `gitea_vault.*` map — `GITEA__mailer__PASSWD` (consumed by the compose above) plus the `github_api_token` / `gitlab_api_token` read by the mirror roles. | +| [`postgres/postgres.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/postgres/postgres.yml) | `postgres` | The Postgres Docker Compose — `postgres:16.3-alpine`, `5432:5432`, data under `/home/pi/arcodange/docker_composes/postgres/data` — plus the `pgbouncer` auth-user block. | +| [`step_ca/step_ca.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca.yml) | `step_ca` | `step_ca_primary: pi1`, `step_ca_fqdn: ssl-ca.arcodange.lab`, the `step` user/home/dir, and `step_ca_listen_address: ":8443"`. | +| [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml) | `step_ca` | **VAULTED.** `vault_step_ca_password` (the CA root password) and `vault_step_ca_jwk_password` (the cert-manager JWK provisioner password). | + +> [!NOTE] +> Encrypted files are conventionally suffixed `_vault.yml`. They are normal `group_vars` files whose **contents** are `ansible-vault`-encrypted; non-vault siblings hold the plaintext structure that references the vaulted keys (e.g. `gitea/gitea.yml` interpolates `gitea_vault.GITEA__mailer__PASSWD`). + +--- + +## The vault model + +Two distinct mechanisms share the word "vault" here — keep them apart: + +1. **`ansible-vault`** encrypts the `*_vault.yml` files at rest in git (AES256). Decryption happens transparently at playbook runtime. +2. **The vault password itself is never on disk.** `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable that fetches the password from the K8s secret `arcodange-ansible-vault` in the `kube-system` namespace: + +```sh +kubectl get secret -n kube-system arcodange-ansible-vault \ + --template='{{index .data.pass | base64decode}}' +``` + +So decrypting any `*_vault.yml` requires `kubectl` access to the live cluster — the cluster *is* the key custodian. The setup recipe (and the `kubectl create secret` to seed it) lives in [`ansible/README.md`](../../../../ansible/README.md); how this fits the broader secret hierarchy is in [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md). + +> [!CAUTION] +> This is **not** HashiCorp Vault. HashiCorp Vault (`vault.arcodange.lab`) is a separate, cluster-resident service installed by the [`hashicorp_vault`](roles.md) role in the `04 · Tools` stage. The `arcodange-ansible-vault` K8s secret only holds the `ansible-vault` password and is also read by the Gitea CI runners for the mailer. + +--- + +## Why this page matters for safe-prod + +The variables above bind Ansible directly to live infrastructure: the host IPs, the prod Vault address, the prod Postgres superuser, and the prod Gitea forge. The safe-environment design maps each of these to a sandbox control — a parallel `inventory/sandbox/hosts.yml` with VM/cloud hosts, a pre-task guard that aborts on any `192.168.1.201-203` target unless `i_mean_prod=true`, and per-service overrides — detailed in the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md). Until that lands, **assume every run is a prod run**. diff --git a/vibe/guidebooks/factory-provisioning/ansible/roles.md b/vibe/guidebooks/factory-provisioning/ansible/roles.md new file mode 100644 index 0000000..b334e83 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/roles.md @@ -0,0 +1,186 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Roles reference** + +# Roles reference + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [Inventory & variables](inventory.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Roles live in two places, by reuse scope: + +- **Shared roles** — reusable across stages — live in [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) and are referenced by FQCN `arcodange.factory.`. +- **Nested roles** — owned by one playbook stage — live under [`playbooks//roles/`](../../../../ansible/arcodange/factory/playbooks) and are auto-discovered by that stage's playbook. + +This page is split by **altitude**. Tier 1 covers the heavyweight platform-service roles (one subsection each); Tier 2 is a single table of the smaller building-block roles. + +--- + +## Tier 1 — platform-service roles + +### `hashicorp_vault` + +[`playbooks/tools/roles/hashicorp_vault`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault) · runs on `localhost` in the `04 · Tools` stage. It initializes and unseals the cluster Vault and wires Gitea as an OIDC provider so CI jobs can authenticate to Vault. + +The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) flow is: + +1. **Init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/init.yml)) — first run only. Lists the Vault server pods in the `tools` namespace, checks `vault operator init -status`, and if uninitialized runs `vault operator init` with **`key-shares=1`, `key-threshold=1`** (defaults from [`defaults/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/defaults/main.yml)). The JSON output — unseal keys + initial root token — is written to `~/.arcodange/cluster-keys.json` (dir `0700`, file `0600`). +2. **Unseal** ([`unseal.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/unseal.yml)) — required after every reboot. Reads the keys file and runs `vault operator unseal` for each server, then revokes the *initial* root token (idempotent — tolerates an already-revoked token). +3. **Generate a fresh root token** ([`new_root_token.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/new_root_token.yml)) — runs the `generate-root` OTP/nonce dance using the unseal keys to mint a short-lived `vault_root_token`. +4. **Set up Gitea OIDC** ([`gitea_oidc_auth.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/gitea_oidc_auth.yml)) — drives Gitea through the bundled [`playwright_setupGiteaApp.js`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/playwright_setupGiteaApp.js) (via the [`playwright`](#tier-2--building-block-roles) role) to create an OAuth2 app, then applies the bundled OpenTofu [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside a disposable `ghcr.io/opentofu/opentofu` container (state on a throwaway docker volume) to provision the Vault JWT/OIDC backend. Finally it renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) into the Gitea Actions secret **`vault_oauth__sh_b64`** (base64) at **org** scope, then propagates the same secret to each user in `gitea_secret_propagation_users` (Action secrets are per-owner, so user-owned repos can't read org secrets). +5. **Revoke the temp root token** — the `always` block of `main.yml` revokes `vault_root_token` no matter how step 4 ended, so no long-lived root token survives the run. + +| Var | Default | Meaning | +| --- | --- | --- | +| `vault_unseal_keys_path` | `~/.arcodange/cluster-keys.json` | Where unseal keys + root token are stored. | +| `vault_unseal_keys_shares` / `_key_threshold` | `1` / `1` | Single-key seal (lab posture; `threshold <= shares`). | +| `vault_address` | `https://vault.arcodange.lab` | The cluster Vault endpoint. | +| `gitea_admin_user` / `gitea_admin_password` | `arcodange@gmail.com` / (prompted) | Credentials Playwright uses to create the OAuth app. | +| `vault_oidc_force_reset` | `false` | When `true`, `vault auth disable gitea` + `gitea_jwt` before re-applying. | + +> [!CAUTION] +> `vault_oidc_force_reset=true` is **destructive**: it disables and wipes **all** `gitea_cicd_*` per-app JWT roles created by the bundled tofu, every run. Default is off. Likewise, losing `~/.arcodange/cluster-keys.json` means the Vault can never be unsealed again — that file is the single point of failure for the whole secret plane (see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)). + +### `step_ca` + +[`playbooks/ssl/roles/step_ca`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca) · runs on the `step_ca` group (all three Pis) in the `01 · System` stage via [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml). It is the lab's internal ACME/CA for `*.arcodange.lab` certificates, run **active/standby**: primary `pi1`, replicas `pi2`/`pi3`. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/main.yml) imports five task files in order: + +1. **install** — install the `step` / `step-ca` binaries. +2. **init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/init.yml)) — primary only. `step ca init` (non-interactive, password file) with `creates:` guard so it is idempotent. The CA name is `Arcodange Lab CA`, DNS `ssl-ca.arcodange.lab`, listen `:8443`. +3. **sync** ([`sync.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/sync.yml)) — replicates the CA from primary to standbys. It takes a **lockfile** on the primary (`.sync.lock`), computes a deterministic `tar | sha256sum` **checksum** of `~/.step`, compares it to the last checksum cached on the controller, and only `rsync`s (pull → controller → push to standbys) when the checksum changed. This is how the standbys hold an identical CA without a shared filesystem. +4. **systemd** — install/enable the `step-ca` unit (the `restart step-ca` handler fires on cert/config change). +5. **provisioners** ([`provisioners.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/provisioners.yml)) — primary only. Ensures a **JWK provisioner named `cert-manager`** exists: lists provisioners, generates the JWK keypair (`creates:` guard) under `~/.step/provisioners/`, and `step ca provisioner add`s it. This is what lets in-cluster cert-manager request certs from the CA. + +| Var | Default | Meaning | +| --- | --- | --- | +| `step_ca_primary` | `pi1` | The writable CA node; standbys sync from it. | +| `step_ca_fqdn` | `ssl-ca.arcodange.lab` | CA DNS name; URL is `https://{fqdn}:8443`. | +| `step_ca_provisioner_name` / `_type` | `cert-manager` / `JWK` | The cert-manager provisioner. | +| `step_ca_force_reinit` | `false` | When `true`, stops the service and **wipes `~/.step`** before re-init. | + +| Secret | Source | +| --- | --- | +| `vault_step_ca_password` | CA root password — from vaulted [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml). | +| `vault_step_ca_jwk_password` | cert-manager JWK provisioner password — same vaulted file. | + +> [!CAUTION] +> `step_ca_force_reinit=true` **wipes the entire CA** (`~/.step`) on the primary and re-issues a new root — every previously issued `*.arcodange.lab` cert immediately becomes untrusted until clients reload the new root. Use only for a deliberate PKI rebuild. + +### `crowdsec` + +[`playbooks/tools/roles/crowdsec`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec) · runs on `localhost` in the `04 · Tools` stage. It wires CrowdSec's decisions into Traefik as a bouncer middleware with a Turnstile CAPTCHA. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) flow: + +1. **Vault → K8s secret plumbing** — creates a `ServiceAccount` (`factory-ansible-tool-crowdsec-traefik-plugin`), a `VaultAuth` (kubernetes auth, role `factory_crowdsec_conf`), and a `VaultStaticSecret` that reads **`kvv2/cms/factory/turnstile`** into a K8s secret (`refreshAfter: 30s`). The Turnstile sitekey/secret come from there. +2. **Bouncer key** — finds the CrowdSec LAPI pod in `tools` and runs `cscli bouncers add traefik-plugin` (deletes + re-adds on conflict) to obtain the bouncer API key. +3. **CAPTCHA HTML** — `inject_captcha_html.yml` pushes `captcha.html` into the Traefik PVC; this task is **tagged `never`** (opt-in only) so the default run skips it. +4. **Traefik Middleware** — applies a `traefik.io/v1alpha1` `Middleware` named **`crowdsec-bouncer`** (`crowdsec` in `kube-system`) configured with the bouncer key, stream mode, Turnstile (`captchaProvider: turnstile` + site/secret keys), and a **Redis cache at `redis.tools:6379`**. +5. **Restart Traefik** — scales the Traefik Deployment to 0 then back to 1 (with a `rescue`/`always` guard guaranteeing it scales back up) to load the new middleware. + +| Var | Default | Meaning | +| --- | --- | --- | +| `traefik_pvc_name` | `traefik` | The PVC the (tagged-`never`) captcha.html inject targets. | + +| Secret | Source | +| --- | --- | +| Turnstile sitekey + secret | Vault `kvv2/cms/factory/turnstile`, surfaced via `VaultStaticSecret`. | +| Bouncer API key | Minted at runtime by `cscli bouncers add`. | + +### `pihole` + +[`playbooks/dns/roles/pihole`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole) · runs on the `pihole` group (`pi1`, `pi3`) in the `01 · System` stage. It configures **HA DNS**: two Pi-hole nodes kept in sync. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/main.yml) includes three task files: + +1. **`ha_pihole_setup.yml`** — **waits for a manual Pi-hole install** (it prints the `curl … | sudo bash` command and `wait_for`s `/etc/pihole/pihole-FTL.db` for up to 10 minutes; Pi-hole itself is not installed by Ansible). It then patches [`pihole.toml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/ha_pihole_setup.yml) (listen port, `listeningMode = "ALL"`, enable `/etc/dnsmasq.d`) and writes three dnsmasq drop-ins: `10-custom-rules.conf` (wildcard `address=/fqdn/ip` from `pihole_custom_dns`), `20-rpis.conf` (`.home` → `preferred_ip` for every Pi), and `99-upstream.conf` (explicit upstream from `pihole_upstream_dns`). +2. **`gravity_setup.yml`** — sets up **Gravity Sync** between the two nodes: a `pihole_gravity` system user with a freshly **rotated ed25519 keypair** each run, cross-authorized `authorized_keys`, full **sudo** (`/etc/sudoers.d/gravity-sync`), the installer, and a generated `gravity-sync.conf` (each node points `REMOTE_HOST` at the other), then runs the sync. +3. **`client_setup.yml`** — points DNS clients at the Pi-hole pair by editing `/etc/resolv.conf` (insert nameservers after `search`) and the active NetworkManager connections via `nmcli` (per-interface `ipv4.dns` + `dns-priority`, eth0 50 / wlan0 100). + +| Var | Default | Meaning | +| --- | --- | --- | +| `pihole_primary` | `pi1` | First node; the other is derived as the secondary. | +| `pihole_ports` | `8081o,443os,…` | Web-interface listen ports. | +| `pihole_custom_dns` | `{}` | FQDN→IP wildcard records (validated as IPv4). | +| `pihole_upstream_dns` | `[8.8.8.8, 1.1.1.1, 8.8.4.4]` | Explicit upstreams (avoids DHCP-provided DNS). | + +> [!WARNING] +> This role is **not fully idempotent**: it depends on a human running the Pi-hole installer first, it **rotates the gravity SSH key on every run**, and it grants the `pihole_gravity` user passwordless **sudo ALL**. Treat reruns as state-changing, not no-ops. + +### `deploy_docker_compose` + +[`roles/deploy_docker_compose`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose) · shared. This is the **generic compose mechanism** every app deploy builds on. The caller passes a `dockercompose_content` dict; the [`tasks/main.yml`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose/tasks/main.yml): + +1. Derives `app_name` from `dockercompose_content.name` and creates `////` plus `data/` and `scripts/`. +2. Writes the compose file with `to_nice_yaml` and **validates it** with `validate: 'docker compose -f %s config'` — a bad compose fails the task before anything is written live. +3. Writes a small wrapper script `scripts/docker-compose` that runs `docker compose -f "$@"`, so the app can be driven without remembering the path. + +| Var | Default | Meaning | +| --- | --- | --- | +| `app_name` | `(dockercompose_content.name)` | App directory name. | +| `app_owner` / `app_group` | `pi` / `docker` | File ownership. | +| `root_path` | `/home/pi/arcodange` | Base path; `partition` (`docker_composes`) nests under it. | + +--- + +## Tier 2 — building-block roles + +Smaller roles, mostly Gitea/forge plumbing and one-shot helpers. Shared roles live in [`roles/`](../../../../ansible/arcodange/factory/roles); `deploy_gitea`/`deploy_postgresql` are nested under [`playbooks/setup/roles/`](../../../../ansible/arcodange/factory/playbooks/setup/roles). + +| Role | Purpose | Key vars / notes | Secrets | +| --- | --- | --- | --- | +| [`gitea_repo`](../../../../ansible/arcodange/factory/roles/gitea_repo) | Ensure a repo exists across Gitea + GitHub + GitLab and add **8h push mirrors** (`sync_on_commit: true`) to GitHub/GitLab. | Creates missing repos on each forge; mirror URLs + namespace IDs in [`vars/main.yml`](../../../../ansible/arcodange/factory/roles/gitea_repo/vars/main.yml). | `github_api_token`, `gitlab_api_token` (from `gitea_vault`). | +| [`gitea_token`](../../../../ansible/arcodange/factory/roles/gitea_token) | Generate / replace / delete a Gitea access token via `docker exec … gitea admin user generate-access-token`. | Stores the raw token in the fact named by `gitea_token_fact_name`; `gitea_token_replace` / `gitea_token_delete` toggles; scopes default to `write:admin,organization,package,repository,user`. | The minted token itself (a fact, not persisted). | +| [`gitea_secret`](../../../../ansible/arcodange/factory/roles/gitea_secret) | `PUT` a Gitea **Actions secret** at user or org scope. | `gitea_secret_name` / `_value`; `gitea_owner_type` (`user`\|`org`) selects the API path. | `gitea_api_token` (Authorization). | +| [`gitea_sync`](../../../../ansible/arcodange/factory/roles/gitea_sync) | List repos on all **three forges**, diff them, and call `gitea_repo` for the repos missing somewhere. | Computes `repos_incomplete = all − common`; loops `gitea_repo` over the gaps. | GitHub/GitLab/Gitea API tokens. | +| [`traefik_certs`](../../../../ansible/arcodange/factory/roles/traefik_certs) | Extract the live **`*.arcodange.lab`** cert from Traefik's `acme.json`. | `kubectl exec` into Traefik → `jq` the LetsEncrypt wildcard cert → `traefik_cert_pem` fact; no-op if already set. | — (reads in-cluster acme.json). | +| [`playwright`](../../../../ansible/arcodange/factory/roles/playwright) | Run a Playwright browser-automation script in Docker. | Builds `playwright:` (default `1.47.0`) from `files/`, runs the script with `playwright_env` injected as `-e`; default script `loginGitea.js`. Used by `hashicorp_vault` for the OIDC app setup. | Script-specific env (e.g. Gitea admin creds). | +| [`deploy_gitea`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea) | Deploy Gitea: template [`app.ini.j2`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea/tasks/main.yml), `docker compose up`, then **health-check `:3000`** until ready. | Compose source is `/home/pi/arcodange/docker_composes/gitea`; admin user `arcodange`. | (consumes the vaulted Gitea compose env). | +| [`deploy_postgresql`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql) | Deploy Postgres via compose, then per-app **create DB + user** ([`create_db_and_user.yml`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql/tasks/create_db_and_user.yml)). | Waits on `pg_isready`, loops `applications_databases` (`{app: {db_name, db_user, db_password}}`). | Per-app DB passwords from `applications_databases`. | + +--- + +## Role dependency view + +How the roles relate: shared building blocks feed the `setup`-stage app deploys, and a few platform-service roles include shared roles directly. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef shared fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef setup fill:#1e4620,stroke:#22c55e,color:#f9fafb; + classDef platform fill:#4a2c1e,stroke:#f59e0b,color:#f9fafb; + + dc["deploy_docker_compose
generic compose writer"]:::shared + pw["playwright
browser automation"]:::shared + gt["gitea_token
mint access token"]:::shared + gs["gitea_secret
PUT Actions secret"]:::shared + gr["gitea_repo
mirror to GitHub/GitLab"]:::shared + gsync["gitea_sync
diff 3 forges"]:::shared + tc["traefik_certs
extract lab cert"]:::shared + + dpg["deploy_postgresql"]:::setup + dgi["deploy_gitea"]:::setup + + hv["hashicorp_vault"]:::platform + sca["step_ca"]:::platform + cs["crowdsec"]:::platform + ph["pihole"]:::platform + + gsync --> gr + hv --> pw + hv --> gs + dc -. "used by app deploys" .-> dpg + dc -. "used by app deploys" .-> dgi +``` + +1. **`gitea_sync` → `gitea_repo`** — the sync role include-loops `gitea_repo` for each repo missing from one of the three forges. +2. **`hashicorp_vault` → `playwright`** — Vault's OIDC setup drives Gitea through Playwright to create the OAuth app. +3. **`hashicorp_vault` → `gitea_secret`** — the rendered `vault_oauth__sh_b64` is published as a Gitea Actions secret at org and user scope. +4. **`deploy_docker_compose` → `deploy_postgresql` / `deploy_gitea`** — the generic compose writer is the substrate the `setup`-stage app deploys lean on. +5. **`step_ca`, `crowdsec`, `pihole`** stand alone — they configure their own services (PKI, WAF, DNS) without including other roles. + +--- + +## See also + +- [Inventory & variables](inventory.md) — the groups (`gitea`, `postgres`, `step_ca`, `pihole`) these roles target, and the vaulted `group_vars` they read. +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — where `hashicorp_vault`'s OIDC tokens and the `kvv2/cms/factory/turnstile` path fit the broader secret model. +- [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) — how the compose `data/` dirs and the step-ca state relate to backup and disaster recovery. diff --git a/vibe/guidebooks/factory-provisioning/opentofu/README.md b/vibe/guidebooks/factory-provisioning/opentofu/README.md new file mode 100644 index 0000000..9362cd6 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/README.md @@ -0,0 +1,95 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **OpenTofu** + +# OpenTofu — factory provisioning + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [factory iac](factory-iac.md) · [postgres iac](postgres-iac.md) · [CI apply flow](ci-apply-flow.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +OpenTofu is the **declarative half** of the factory: it provisions everything that lives *outside* the K3s cluster — Gitea repos & CI users, Vault policies, Cloudflare DNS, OVH domains, a GCS backup bucket, and the in-cluster PostgreSQL roles/databases. The imperative half (the cluster itself) is built by [Ansible](../ansible/README.md). + +OpenTofu is pinned to **`1.8.2`** in CI (`OPENTOFU_VERSION`). + +--- + +## Two independent state roots + +There are **two separate Terraform/OpenTofu roots**, each with its own `backend.tf`, its own GCS state prefix, its own provider set, and its own CI workflow. They never share state and can be applied independently. + +| Root | Code path | State backend (GCS) | Triggered by | +| --- | --- | --- | --- | +| **factory iac** | [`iac/`](../../../../iac) | `gs://arcodange-tf/factory/main` | changes under `iac/**` → [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) | +| **postgres iac** | [`postgres/iac/`](../../../../postgres/iac) | `gs://arcodange-tf/factory/postgres` | changes under `postgres/**` → [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) | + +> [!NOTE] +> Both roots share the same GCS **bucket** (`arcodange-tf`) but live under **distinct prefixes** (`factory/main` vs `factory/postgres`), so their state objects never collide. + +--- + +## Providers + +| Provider | Version | Endpoint / scope | Auth | +| --- | --- | --- | --- | +| `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var | +| `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` | +| `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` (factory) / `GOOGLE_BACKEND_CREDENTIALS` (postgres backend) | +| `cloudflare/cloudflare` | `~> 5` | DNS / IAM | `CLOUDFLARE_API_TOKEN` env var | +| `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` | +| `cyrilgdn/postgresql` | `1.24.0` | `192.168.1.202` (pi2), `superuser` | `POSTGRES_USERNAME` / `POSTGRES_PASSWORD` (TF vars) | + +The first five providers belong to the **factory iac** root ([`iac/providers.tf`](../../../../iac/providers.tf)); the **postgres iac** root ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf)) declares only `postgresql` + `vault`. Both roots configure the `vault` provider identically (JWT, mount `gitea_jwt`, role `gitea_cicd`). + +--- + +## The Vault-JWT auth model + +Neither root carries long-lived Vault credentials. Instead CI mints a short-lived Gitea OIDC token and exchanges it for Vault access: + +1. A first job decodes the base64 secret **`vault_oauth__sh_b64`** and runs it (`base64 -d | bash`), producing a **Gitea OIDC JWT** as a job output (`gitea_vault_jwt`). +2. That JWT is exported into the apply job as **`TERRAFORM_VAULT_AUTH_JWT`**. +3. The `vault` provider's `auth_login_jwt` block consumes it against mount `gitea_jwt` / role `gitea_cicd`, yielding a scoped Vault token used to read the per-provider secrets (Google creds, Gitea token, Cloudflare token, OVH app keys, Postgres creds). + +See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the full Vault policy/mount design and [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough. + +--- + +## CI apply flow + +Both workflows share the same two-job shape: authenticate, then apply. The trigger paths differ (`iac/**` vs `postgres/**`) but the structure is identical. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef trigger fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef job fill:#1e4620,stroke:#22c55e,color:#f0fdf4; + classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + push["push / PR touching
iac/** or postgres/**"]:::trigger + auth["job: gitea_vault_auth
decode vault_oauth__sh_b64
mint Gitea OIDC JWT"]:::job + tofu["job: tofu
read Vault secrets via JWT
set provider env vars"]:::job + apply["dflook/terraform-apply@v1
auto_approve: true"]:::danger + + push --> auth + auth -- "gitea_vault_jwt output" --> tofu + tofu --> apply +``` + +1. A **push or PR** that touches files under `iac/**` (factory) or `postgres/**` (postgres) starts the matching workflow; `workflow_dispatch` allows a manual run. +2. The **`gitea_vault_auth`** job decodes `vault_oauth__sh_b64` and emits the Gitea OIDC JWT as `gitea_vault_jwt`. +3. The **`tofu`** job (`needs: gitea_vault_auth`) sets `TERRAFORM_VAULT_AUTH_JWT` from that output, reads the provider secrets out of Vault, and prepares the homelab CA cert (`VAULT_CACERT`). +4. The job runs **`dflook/terraform-apply@v1`** against the root's `path` (`iac` or `postgres/iac`) with **`auto_approve: true`**. + +> [!CAUTION] +> **Applies are auto-approve.** There is no manual plan-review gate — once a change to `iac/**` or `postgres/**` lands on `main`, CI applies it to the real Gitea, Vault, Cloudflare, OVH, GCS, and PostgreSQL targets without further confirmation. Treat every merge as a production change and review the diff *before* merging, not after. This trade-off is recorded in [ADR-0001 · safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md). + +--- + +## Index + +| Page | Covers | State | +| --- | --- | --- | +| [factory iac](factory-iac.md) | `iac/` root — Gitea, Vault, Google/GCS backup, Cloudflare, OVH | ✅ | +| [postgres iac](postgres-iac.md) | `postgres/iac/` root — PostgreSQL roles & databases on pi2 | ✅ | +| [CI apply flow](ci-apply-flow.md) | Both Gitea workflows, the Vault-JWT exchange, auto-approve apply | ✅ | diff --git a/vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md b/vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md new file mode 100644 index 0000000..b521895 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md @@ -0,0 +1,114 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **CI apply flow** + +# CI apply flow + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml), [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) +> **Downstream:** [factory iac](factory-iac.md), [postgres iac](postgres-iac.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) + +Two Gitea Actions workflows turn every commit that touches the OpenTofu code into a live `apply`. `IAC` ([`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml)) drives the factory infrastructure under [`iac/`](../../../../iac/); `Postgres` ([`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml)) drives the database stack under [`postgres/iac/`](../../../../postgres/). They share the same two-job shape: a short OIDC-auth job feeds a Vault JWT to a `tofu` job that reads secrets and runs `terraform apply`. + +> [!CAUTION] +> **`auto_approve: true` means every merge to `main` applies immediately — there is no plan-gate.** The `dflook/terraform-apply@v1` step skips the interactive approval, so any change that lands on `main` (or any matched `push`) rewrites real cloud and homelab state without a human reviewing the plan. Mitigations are entirely upstream of CI: (1) **mandatory code review** on the PR before merge, and (2) **least-privilege Vault policies** on the `gitea_cicd` role so a runaway apply can only touch the resources its token is scoped to. See [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md): the sandbox lane runs the *same* tofu but **plan-only** against a `sandbox/` state prefix and a throwaway DNS zone, so contributors can validate changes without an auto-apply. + +## Triggers + +Both workflows fire on the same three events; only the watched path globs differ. + +| Event | `IAC` (factory) | `Postgres` | +| --- | --- | --- | +| `push` | `iac/*.tf`, `iac/*.tfvars`, `iac/**/*.tf`, `iac/**/*.tfvars` | `postgres/**/*.tf`, `postgres/**/*.tfvars` | +| `pull_request` | same globs (YAML anchor `*tofuPaths`) | same globs (YAML anchor `*postgresTofuPaths`) | +| `workflow_dispatch` | manual, no inputs | manual, no inputs | + +> [!IMPORTANT] +> `concurrency` is keyed on `${{ github.ref }}-${{ github.workflow }}` with `cancel-in-progress: true`, so a newer push to the same branch cancels an in-flight run. A `pull_request` event triggers the workflow — but the `apply` still runs, so the safety contract is "review **before** merge", not "CI only plans on PRs". + +## Job 1 — `gitea_vault_auth` + +Mints a Gitea OIDC token that Vault will trust. The whole job is one step: + +```bash +echo -n "${{ secrets.vault_oauth__sh_b64 }}" | base64 -d | bash +``` + +| Field | Value | +| --- | --- | +| Runner | `ubuntu-latest` | +| Secret consumed | `vault_oauth__sh_b64` — a base64-encoded shell script | +| Step id | `gitea_vault_jwt` | +| Output | `gitea_vault_jwt` ← `steps.gitea_vault_jwt.outputs.id_token` | + +The decoded script asks Gitea for an OIDC `id_token` and emits it as a step output. The `tofu` job declares `needs: [gitea_vault_auth]` so it receives `needs.gitea_vault_auth.outputs.gitea_vault_jwt`. + +## Job 2 — `tofu` + +| Field | `IAC` | `Postgres` | +| --- | --- | --- | +| Job name | `Tofu` | `Tofu - Postgres` | +| `needs` | `gitea_vault_auth` | `gitea_vault_auth` | +| `OPENTOFU_VERSION` | `1.8.2` | `1.8.2` | +| `TERRAFORM_VAULT_AUTH_JWT` | `needs.gitea_vault_auth.outputs.gitea_vault_jwt` | same | +| `VAULT_CACERT` | `${{ github.workspace }}/homelab.pem` | same | +| Apply path | `iac` | `postgres/iac` | + +Step order inside the job: + +1. **read vault secret** — the shared `*vault_step` anchor (see below). +2. **`actions/checkout@v4`** — pull the repo into the workspace. +3. **prepare vault self signed cert** — `echo -n "${{ secrets.HOMELAB_CA_CERT }}" | base64 -d > $VAULT_CACERT`, writing the homelab CA to `homelab.pem` so the runner trusts `https://vault.arcodange.lab`. +4. **terraform apply** — `dflook/terraform-apply@v1` with the path above and `auto_approve: true`. + +### Vault secret reads (`*vault_step`) + +The `read vault secret` step uses [`arcodange-org/vault-action`](https://gitea.arcodange.lab/arcodange-org/vault-action), authenticating with `method: jwt`, `path: gitea_jwt`, `role: gitea_cicd`, `url: https://vault.arcodange.lab`, `caCertificate: ${{ secrets.HOMELAB_CA_CERT }}`, and `jwtGiteaOIDC` set to the auth job's output. The secrets it exports into the job env differ per workflow: + +| Workflow | Vault path | Selector | Exported as | +| --- | --- | --- | --- | +| `IAC` | `kvv1/google/credentials` | `credentials` | `GOOGLE_CREDENTIALS` | +| `IAC` | `kvv1/admin/gitea` | `token` | `GITEA_TOKEN` | +| `IAC` | `kvv1/admin/cloudflare` | `iam_token` | `CLOUDFLARE_API_TOKEN` | +| `IAC` | `kvv1/admin/ovh/app` | `*` (all keys) | `OVH_*` | +| `Postgres` | `kvv1/google/credentials` | `credentials` | `GOOGLE_BACKEND_CREDENTIALS` | +| `Postgres` | `kvv1/postgres/credentials` | `*` (all keys) | `TF_VAR_postgres_*` | + +`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS` authenticate the GCS state backend; the `TF_VAR_postgres_*` fan-out feeds the Postgres module's input variables directly. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how the `gitea_cicd` role and KV v1 mounts are provisioned. + +## End-to-end flow + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TD + push["push / PR / workflow_dispatch
on iac/** or postgres/** .tf .tfvars"] --> auth["job: gitea_vault_auth
base64 -d | bash -> Gitea OIDC id_token"] + auth -->|"gitea_vault_jwt output"| tofu["job: tofu
OPENTOFU_VERSION 1.8.2"] + tofu --> readvault["read vault secret
vault-action jwt role gitea_cicd"] + readvault -->|"GOOGLE_CREDENTIALS, TF_VAR_postgres_*, ..."| init["tofu init
GCS backend, state prefix"] + init --> apply["dflook/terraform-apply@v1
auto_approve: true"] + apply --> state["state updated in GCS
real cloud + homelab mutated"] + + classDef trigger fill:#1f3a5f,stroke:#7fb0ff,color:#eaf2ff; + classDef job fill:#3a2f5f,stroke:#b39dff,color:#f3eeff; + classDef secret fill:#5f3a2f,stroke:#ffb38a,color:#fff1e8; + classDef danger fill:#5f1f2f,stroke:#ff8a9d,color:#ffe8ec; + class push trigger; + class auth,tofu,init job; + class readvault secret; + class apply,state danger; +``` + +1. A **push**, **pull_request**, or **workflow_dispatch** event matching the `iac/**` or `postgres/**` path globs starts the workflow. +2. Job **`gitea_vault_auth`** runs `base64 -d | bash` on the `vault_oauth__sh_b64` secret to obtain a Gitea OIDC `id_token`, published as the `gitea_vault_jwt` output. +3. Job **`tofu`** (gated by `needs: gitea_vault_auth`) starts on `ubuntu-latest` with `OPENTOFU_VERSION 1.8.2` and `TERRAFORM_VAULT_AUTH_JWT` set to that output. +4. The **read vault secret** step exchanges the JWT (role `gitea_cicd`, path `gitea_jwt`) for the workflow's secrets and exports them as env vars (`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS`, `GITEA_TOKEN`, `CLOUDFLARE_API_TOKEN`, `OVH_*`, or `TF_VAR_postgres_*`). +5. **`tofu init`** configures the GCS backend, binding the working dir to its state prefix using the Google credentials just read. +6. **`dflook/terraform-apply@v1`** runs against `iac` (or `postgres/iac`) with `auto_approve: true` — no plan-gate. +7. The **state** in GCS is updated and the real cloud + homelab resources are mutated to match the committed code. + +## Related pages + +- [factory iac](factory-iac.md) — what the `iac/` stack provisions (the `IAC` workflow's target). +- [postgres iac](postgres-iac.md) — the `postgres/iac/` database stack (the `Postgres` workflow's target). +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — the `gitea_cicd` role, OIDC trust, and KV mounts behind every secret read here. +- [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) — the sandbox lane runs the same tofu plan-only against a `sandbox/` state prefix and a throwaway zone. diff --git a/vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md b/vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md new file mode 100644 index 0000000..ba73c6a --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md @@ -0,0 +1,148 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **factory iac** + +# factory iac — the `iac/` state root + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Code:** [`iac/`](../../../../iac) · **State backend:** `gs://arcodange-tf/factory/main` ([`iac/backend.tf`](../../../../iac/backend.tf)) +> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [CI apply flow](ci-apply-flow.md) · [postgres iac](postgres-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +The `iac/` root provisions everything that lives **outside** the K3s cluster: the Cloudflare R2 backend that holds OpenTofu state itself, the per-service Cloudflare and OVH API tokens consumed by the [cms](https://gitea.arcodange.lab/arcodange-org/cms) repo, a restricted Gitea CI user for reading private module repos, and the GCS bucket that backs up Longhorn volumes. Each provisioned credential is written **both** to a Gitea Actions secret (where the consuming workflow expects it) **and** to a Vault path (the durable source of truth — see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)). + +This root's state lives at `gs://arcodange-tf/factory/main` and is applied by [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) on any change under `iac/**` — see [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough. + +--- + +## Providers + +Declared in [`iac/providers.tf`](../../../../iac/providers.tf). + +| Provider | Source | Version | Endpoint / scope | Auth | +| --- | --- | --- | --- | --- | +| `gitea` | `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var | +| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` | +| `google` | `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` env var | +| `cloudflare` | `cloudflare/cloudflare` | `~> 5` | DNS / Pages / R2 / IAM | `CLOUDFLARE_API_TOKEN` env var | +| `ovh` | `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` | + +> [!NOTE] +> The Cloudflare account ID is **not** hard-coded — it is resolved at plan time from `data.cloudflare_account.arcodange` filtered on the account name `arcodange@gmail.com` ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)) and exposed as `local.cloudflare_account_id`. + +--- + +## Cloudflare — R2 backend bucket & service tokens + +Defined in [`iac/cloudflare.tf`](../../../../iac/cloudflare.tf). Two tokens are minted through the [`modules/cloudflare_token`](#the-cloudflare_token-module) mechanism: one scoped to the R2 state bucket, one broad token handed to the cms repo. + +| Resource | Type | Identity / scope | Secret destination | +| --- | --- | --- | --- | +| `cloudflare_r2_bucket.arcodange_tf` | R2 bucket | name `arcodange-tf`, jurisdiction `eu` | — (holds the *cms* repo's own OpenTofu state) | +| `module.cf_r2_arcodange_tf_token` | module → `cloudflare_account_token` | account: `Workers R2 Storage Read`, `Account Settings Read`; bucket: `Workers R2 Storage Bucket Item Write` | `vault_kv_secret.cf_r2_arcodange_tf` → `kvv1/cloudflare/r2/arcodange-tf` (S3 access key, secret, `https://.eu.r2.cloudflarestorage.com` endpoint) | +| `vault_policy.cf_r2_arcodange_tf` | Vault policy | name `factory__cf_r2_arcodange_tf` | read on `kvv1/cloudflare/r2/arcodange-tf` **and** `kvv1/zoho/self_client` (the Zoho mail client is created manually) | +| `module.cf_arcodange_cms_token` | module → `cloudflare_account_token` | account-scope: `Pages Write`, `Account DNS Settings Write`, `Account Settings Read`, `Zone Write`, `Zone Settings Write`, `DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write` | Gitea secrets `CLOUDFLARE_API_TOKEN` + `CLOUDFLARE_ACCOUNT_ID` on the `cms` repo; Vault `kvv1/cloudflare/cms/cf_arcodange_cms_token` | + +The `cms` repo (`data.gitea_repo.cms`, owner `arcodange-org`) receives the broad token because it manages the public site end to end: Cloudflare Pages deploys, DNS records, zone settings, the Tunnel, and Turnstile. + +> [!CAUTION] +> Both tokens are minted with **`expires_on = null`** — they never expire. A leaked `cf_arcodange_cms_token` grants standing DNS/Pages/Tunnel/Turnstile write on the whole account until manually revoked. There is no automatic rotation; rotation means tainting the module's `cloudflare_account_token` and re-applying. + +--- + +## OVH — OAuth2 client for the cms domain + +Defined in [`iac/ovh.tf`](../../../../iac/ovh.tf). A `CLIENT_CREDENTIALS` OAuth2 client lets the cms workflow edit DNS nameservers for `arcodange.fr`, constrained by an IAM policy. + +| Resource | Type | Scope | +| --- | --- | --- | +| `ovh_me_api_oauth2_client.cms` | OAuth2 client | name `cms repo`, flow `CLIENT_CREDENTIALS` — "arcodange.fr management" | +| `ovh_iam_policy.cms` | IAM policy | name `cms_manager`; identity = the OAuth2 client; resources = account URN + `urn:v1:eu:resource:domain:arcodange.fr`; allow = a handful of `me/*` reads, all domain **READ** reference-actions (computed via `data.ovh_iam_reference_actions.domain`), plus `domain:apiovh:nameServer/edit` | +| `gitea_repository_actions_secret.ovh_cms_client_id` | Gitea secret | `OVH_CLIENT_ID` on the `cms` repo | +| `gitea_repository_actions_secret.ovh_cms_client_secret` | Gitea secret | `OVH_CLIENT_SECRET` on the `cms` repo | +| `vault_kv_secret.ovh_cms_token` | Vault secret | `kvv1/ovh/cms/app` — `client_id`, `client_secret`, `urn` | + +> [!NOTE] +> The write surface is deliberately narrow: the policy grants **only** `nameServer/edit` for writes; everything else is read-only. This lets the cms pipeline point `arcodange.fr` at Cloudflare nameservers without exposing the broader OVH account. + +--- + +## Gitea — restricted CI module-reader user + +Defined in [`iac/gitea_tofu_ci_user.tf`](../../../../iac/gitea_tofu_ci_user.tf). A locked-down Gitea account whose SSH key lets CI clone private Terraform module repos without exposing a privileged token. + +| Resource | Type | Notes | +| --- | --- | --- | +| `random_password.tofu` | password | length 32 — the user's login password | +| `gitea_user.tofu` | Gitea user | username `tofu_module_reader`, email `tofu-module-reader@arcodange.fake`, `restricted = true`, `visibility = private`, `prohibit_login = false` | +| `tls_private_key.tofu` | keypair | algorithm **ED25519** | +| `gitea_public_key.tofu` | SSH key | public half attached to `tofu_module_reader` | +| `vault_kv_secret.gitea_admin_token` | Vault secret | `kvv1/gitea/tofu_module_reader` — `ssh_private_key` + `ssh_public_key` | + +> [!NOTE] +> Despite the Terraform resource name `gitea_admin_token`, the stored payload is the **SSH keypair**, not an admin token. The user is `restricted`, so it can only read repos it is explicitly granted access to. + +--- + +## Google / GCS — Longhorn backup target + +Defined in [`iac/gcs_backup.tf`](../../../../iac/gcs_backup.tf). A GCS bucket plus an HMAC key wired into Vault so the in-cluster Longhorn controller can pull S3-compatible backup credentials. See [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for how this fits the cluster-recovery story. + +| Resource | Type | Value | +| --- | --- | --- | +| `google_storage_bucket.longhorn_backup` | GCS bucket | name `arcodange-backup`, location `NAM4` (dual-region), `force_destroy = true`, `public_access_prevention = enforced` | +| `google_service_account.longhorn_backup` | service account | account_id `longhorn-backup` | +| `google_storage_bucket_iam_member.longhorn_backup` | IAM binding | `roles/storage.admin` on the bucket, member = the SA | +| `google_storage_hmac_key.longhorn_backup` | HMAC key | S3-compatible access_id + secret for that SA | +| `vault_kv_secret_v2.longhorn_gcs_backup` | Vault **KVv2** secret | mount `kvv2`, name `longhorn/gcs-backup`, `cas = 1`, `delete_all_versions = true` — `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_ENDPOINTS = https://storage.googleapis.com` | +| `vault_policy.longhorn_gcs_backup` | Vault policy | name `longhorn-gcs-backup` — read on `kvv2/data/longhorn/gcs-backup` | +| `vault_kubernetes_auth_backend_role.longhorn` | Vault k8s auth role | role `longhorn`, bound SA `longhorn-vault-secret-reader` in namespace `longhorn-system`, audience `vault`, policy `longhorn-gcs-backup` | + +The bound service-account name `longhorn-vault-secret-reader` must match the `VaultAuth` manifest in-cluster — that's the handshake that lets Longhorn read the HMAC creds at runtime. + +> [!WARNING] +> The HMAC key is an **S3-compatible** credential and is weaker than a native GCS service-account key: it is a long-lived static secret with no key rotation built into this config, and `roles/storage.admin` grants full read/write/delete on the backup bucket. Combined with `force_destroy = true`, a state operation that destroys `arcodange-backup` will delete every Longhorn backup without prompting. Treat this bucket as critical and irreplaceable infrastructure. + +--- + +## The `cloudflare_token` module + +Source: [`iac/modules/cloudflare_token/`](../../../../iac/modules/cloudflare_token). This local module turns **human-readable permission names** into a working Cloudflare account token, so callers never hard-code permission-group UUIDs. + +How it works ([`main.tf`](../../../../iac/modules/cloudflare_token/main.tf)): + +1. It reads **all** available permission groups via `data.cloudflare_account_api_token_permission_groups_list`, then builds `local.permission_map`: `":" => id` (e.g. `"account:Pages Write" => `), keyed by the last dotted segment of the group's scope. +2. Caller-supplied names (`var.permissions.account` / `var.permissions.bucket`) are looked up against that map; any name with no match lands in `local.missing_permissions` and trips a **`precondition`** that fails the apply with a clear "Permissions introuvables" error. +3. Policies are assembled dynamically — an `account` policy targeting `com.cloudflare.api.account.` and, if `var.bucket` is set, a `bucket` policy targeting `com.cloudflare.edge.r2.bucket.__`. +4. The `cloudflare_account_token.token` resource sets `expires_on = null` and **ignores** drift on `expires_on` and `policies` (the upstream permission IDs are unstable). Instead, a `null_resource.cloudflare_account_token_replace` hashes the **sorted permission names** into its triggers, and `replace_triggered_by` forces a fresh token whenever the *names* change — surviving id churn while still rotating on a real permission change. +5. Outputs ([`outputs.tf`](../../../../iac/modules/cloudflare_token/outputs.tf)): `token` (sensitive), `token_id`, `token_sha256`, and — when `var.bucket` is set — `r2_credentials` mapping `access_key_id = token.id` and `secret_access_key = sha256(token.value)` for S3-compatible R2 access. + +--- + +## Vault layout: mixed KVv1 / KVv2 + +This root writes to **both** KV engines, which is easy to trip over. + +| Path | Engine | Written by | +| --- | --- | --- | +| `kvv1/cloudflare/r2/arcodange-tf` | KVv1 (`vault_kv_secret`) | R2 backend token | +| `kvv1/cloudflare/cms/cf_arcodange_cms_token` | KVv1 | cms Cloudflare token | +| `kvv1/ovh/cms/app` | KVv1 | OVH OAuth2 client | +| `kvv1/gitea/tofu_module_reader` | KVv1 | CI user SSH key | +| `kvv2/longhorn/gcs-backup` | KVv2 (`vault_kv_secret_v2`) | Longhorn GCS HMAC | + +> [!WARNING] +> Most secrets here use the **KVv1** engine (`vault_kv_secret`), but the Longhorn backup secret uses **KVv2** (`vault_kv_secret_v2`). The policy paths differ accordingly — KVv2 reads target `kvv2/data/longhorn/gcs-backup` (note the `/data/` segment), whereas KVv1 policies read the literal path. Mixing the two engines means a policy copied from one secret to another will silently grant nothing. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the engine-level design. + +--- + +## Outputs + +The root exposes a single top-level `output "token"` (sensitive) = the cms Cloudflare token ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)). Everything else is delivered side-effect-style into Gitea secrets and Vault paths rather than as Terraform outputs. + +--- + +## See also + +- [CI apply flow](ci-apply-flow.md) — how `iac/**` changes reach `gs://arcodange-tf/factory/main` via the Vault-JWT exchange and auto-approve apply. +- [postgres iac](postgres-iac.md) — the sibling root that provisions in-cluster PostgreSQL. +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md). diff --git a/vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md b/vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md new file mode 100644 index 0000000..6935824 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md @@ -0,0 +1,116 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **postgres iac** + +# postgres iac — the `postgres/iac/` state root + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Code:** [`postgres/iac/`](../../../../postgres/iac) · **State backend:** `gs://arcodange-tf/factory/postgres` ([`postgres/iac/backend.tf`](../../../../postgres/iac/backend.tf)) +> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Related:** [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [CI apply flow](ci-apply-flow.md) · [factory iac](factory-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +The `postgres/iac/` root provisions **PostgreSQL roles, databases, and the pgbouncer auth function** on the live cluster database — one strand of the per-application `` join key described in [Naming conventions](../../lab-ecosystem/naming-conventions.md). For each application it creates a non-login owner role, an `` database owned by that role, and a `user_lookup()` function that lets PgBouncer authenticate against `pg_shadow`. A single `credentials_editor` login role (whose password is stored in Vault) is granted admin over every per-app role so that downstream tooling can mint application credentials without superuser rights. + +This root's state lives at `gs://arcodange-tf/factory/postgres` and is applied by [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) on any change under `postgres/**` — see [CI apply flow](ci-apply-flow.md). + +> [!CAUTION] +> This root runs as a **PostgreSQL superuser** ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf): `superuser = true`) pinned to the live database at **`192.168.1.202`** (pi2) **through PgBouncer**, with `sslmode = disable`. The provider can therefore **drop or alter live application databases** — an errant `terraform destroy` or a renamed `applications` entry will delete real data. And because the only route to Postgres is via PgBouncer on that host, **if PgBouncer is down OpenTofu cannot connect and no apply can run.** Treat every `postgres/**` merge as a production database change ([ADR-0001](../../../ADR/0001-safe-prod-like-environment.md)). + +--- + +## Providers + +Declared in [`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf). + +| Provider | Source | Version | Connection | Auth | +| --- | --- | --- | --- | --- | +| `postgresql` | `cyrilgdn/postgresql` | `1.24.0` | host `192.168.1.202` (pi2), via PgBouncer, `sslmode = disable`, `superuser = true` | `var.POSTGRES_USERNAME` / `var.POSTGRES_PASSWORD` (TF vars from `TF_VAR_POSTGRES_*`, sourced from Vault in CI) | +| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` | + +The two `POSTGRES_*` variables are declared `sensitive` in the same file; CI populates them from Vault as `TF_VAR_POSTGRES_USERNAME` / `TF_VAR_POSTGRES_PASSWORD` (see [CI apply flow](ci-apply-flow.md)). + +--- + +## The application set + +Everything in this root fans out over one variable. `var.applications` is a `set(string)` ([`variables.tf`](../../../../postgres/iac/variables.tf)) whose members are listed in [`terraform.tfvars`](../../../../postgres/iac/terraform.tfvars): + +| `applications` member | +| --- | +| `webapp` | +| `erp` | +| `crowdsec` | +| `plausible` | +| `dance-lessons-coach` | + +Adding an app to that list creates a full role + database + lookup-function bundle on the next apply; **removing** one would `DROP` the live database (see the caution above). + +--- + +## The `credentials_editor` role + +Defined in [`postgres/iac/main.tf`](../../../../postgres/iac/main.tf). A single login role, granted admin over every per-app role, whose credentials downstream tooling uses to provision application logins. + +| Resource | Type | Detail | +| --- | --- | --- | +| `random_password.credentials_editor` | password | length 24, `override_special = "-:!+<>"` | +| `postgresql_role.credentials_editor` | role | `login = true`, `create_role = true`; `lifecycle { ignore_changes = [roles] }` so its grant membership isn't reverted | +| `vault_kv_secret.postgres_admin_credentials` | Vault **KVv1** secret | `kvv1/postgres/credentials_editor/credentials` — `username` + `password` | + +--- + +## Per-application resources + +For each member of `var.applications`, `main.tf` creates the following (all `for_each` over the set): + +| Resource | Type | What it creates | +| --- | --- | --- | +| `postgresql_role.app_role[""]` | role | non-login role `_role` (`login = false`) — owns the database | +| `postgresql_grant_role.credentials_editor_app_role[""]` | grant | `credentials_editor` → `_role` **WITH ADMIN OPTION** | +| `postgresql_database.app_db[""]` | database | database ``, owner `_role`, `template = template0`, `alter_object_ownership = true` | +| `postgresql_function.pgbouncer_user_lookup[""]` | function | `user_lookup(i_username text)` in db `` — see below | +| `postgresql_grant.pgbouncer_user_lookup_public_revoke[""]` | grant | revoke (empty `privileges`) of `user_lookup` from role `public` in schema `public` | +| `postgresql_grant.pgbouncer_user_lookup[""]` | grant | `EXECUTE` on `user_lookup` to role `pgbouncer_auth`; `depends_on` the public-revoke (the two grants can't run in parallel) | + +So `webapp` yields role `webapp_role`, database `webapp`, function `webapp.user_lookup`, and the matching grants; likewise for `erp`, `crowdsec`, `plausible`, and `dance-lessons-coach`. + +### The pgbouncer `user_lookup()` function + +`postgresql_function.pgbouncer_user_lookup` defines a `plpgsql` function with **`security_definer = true`** and `parallel = "SAFE"`. It takes `i_username` (IN, text) and returns a record of `uname` + `phash`: + +```sql +BEGIN + SELECT usename, passwd FROM pg_catalog.pg_shadow + WHERE usename = i_username INTO uname, phash; + RETURN; +END; +``` + +PgBouncer's `auth_query` calls this to fetch the stored password hash. Because reading `pg_shadow` is privileged, the function is `SECURITY DEFINER` (runs as its owner). Access is locked down in two steps: first **revoke** the default `public` execute grant, then **grant** `EXECUTE` only to the `pgbouncer_auth` role — the `pgbouncer_auth` role itself is expected to already exist on the server (it is not created by this root). + +> [!NOTE] +> The two grants are ordered with an explicit `depends_on`: `postgresql_grant.pgbouncer_user_lookup` waits for `postgresql_grant.pgbouncer_user_lookup_public_revoke` because the provider can't apply both grants on the same object concurrently. + +--- + +## Vault layout + +This root writes a single KVv1 secret. + +| Path | Engine | Contents | +| --- | --- | --- | +| `kvv1/postgres/credentials_editor/credentials` | KVv1 (`vault_kv_secret`) | `username`, `password` of the `credentials_editor` login role | + +--- + +## No outputs + +There is **no `outputs.tf`** in this root. Nothing is exported as a Terraform output — the `credentials_editor` credentials are delivered into Vault, and the per-app roles/databases/functions are side effects on the live server. Consumers read the credentials from `kvv1/postgres/credentials_editor/credentials`, not from state outputs. + +--- + +## See also + +- [Naming conventions](../../lab-ecosystem/naming-conventions.md) — the `` databases here are one strand of the per-application `` join key (alongside namespaces, Vault paths, and repos). +- [CI apply flow](ci-apply-flow.md) — how `postgres/**` changes reach `gs://arcodange-tf/factory/postgres` and where `TF_VAR_POSTGRES_*` come from. +- [factory iac](factory-iac.md) — the sibling root for everything outside the cluster. +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md). diff --git a/vibe/guidebooks/lab-ecosystem/01-factory.md b/vibe/guidebooks/lab-ecosystem/01-factory.md index 79e54cd..a8cc875 100644 --- a/vibe/guidebooks/lab-ecosystem/01-factory.md +++ b/vibe/guidebooks/lab-ecosystem/01-factory.md @@ -5,6 +5,7 @@ > **Status:** ✅ Active > **Last Updated:** 2026-06-23 > **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md) +> **Deeper dive:** [Factory provisioning guidebook](../factory-provisioning/README.md) — page-by-page walkthrough of the Ansible playbooks/roles and OpenTofu modules summarized here > **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) `factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md). -- 2.49.1 From 548dacfc4415610f8cb748254074ac0dad3897f6 Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 21:41:15 +0200 Subject: [PATCH 4/9] docs(vibe): add tools/ and cms/ guidebooks Two code-grounded tree-docs guidebooks under vibe/guidebooks/, drilling into the lab-ecosystem 02-tools and 03-cms pages (bidirectional): - tools/ : hub + components.md (Vault+VSO, Prometheus, Grafana, CrowdSec, pgbouncer, Redis/KeyDB, Plausible, ClickHouse; pgcat/tool as Tier-2) + secrets-and-vso.md (Vault engines/auth, the app_roles/app_policy modules = the join-key machinery, VSO CRDs, secret-paths inventory). - cms/ : hub + site.md (Nuxt + dual Pages/k3s deploy) + cloudflare.md (zone via OVH->CF, Pages, cloudflared tunnel, Turnstile, R2 state) + zoho-email.md (OAuth, MX/SPF/DKIM/DMARC/BIMI, the 7 aliases). Sibling-repo code linked via full gitea URLs; vibe-internal links bidirectional. Reconciled the cloudflared tunnel token path to kvv2 cms/cloudflared (the chart VaultStaticSecret is kv-v2; the kvv1 tofu reference is a commented-out stub). 6 mermaid diagrams MCP-validated; zero dead links. Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 --- vibe/guidebooks/README.md | 2 + vibe/guidebooks/cms/README.md | 75 +++++++ vibe/guidebooks/cms/cloudflare.md | 182 +++++++++++++++++ vibe/guidebooks/cms/site.md | 165 +++++++++++++++ vibe/guidebooks/cms/zoho-email.md | 116 +++++++++++ vibe/guidebooks/lab-ecosystem/02-tools.md | 2 + vibe/guidebooks/lab-ecosystem/03-cms.md | 2 + vibe/guidebooks/tools/README.md | 114 +++++++++++ vibe/guidebooks/tools/components.md | 218 ++++++++++++++++++++ vibe/guidebooks/tools/secrets-and-vso.md | 234 ++++++++++++++++++++++ 10 files changed, 1110 insertions(+) create mode 100644 vibe/guidebooks/cms/README.md create mode 100644 vibe/guidebooks/cms/cloudflare.md create mode 100644 vibe/guidebooks/cms/site.md create mode 100644 vibe/guidebooks/cms/zoho-email.md create mode 100644 vibe/guidebooks/tools/README.md create mode 100644 vibe/guidebooks/tools/components.md create mode 100644 vibe/guidebooks/tools/secrets-and-vso.md diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md index fd2f104..5f62593 100644 --- a/vibe/guidebooks/README.md +++ b/vibe/guidebooks/README.md @@ -36,6 +36,8 @@ flowchart LR |---|---|---| | [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active | | [Factory provisioning](factory-provisioning/README.md) | Deep dive into how factory provisions everything: Ansible playbooks + roles and OpenTofu | ✅ Active | +| [Tools](tools/README.md) | Deep dive into the lab platform services in the `tools` namespace (Vault+VSO, Prometheus, Grafana, CrowdSec, poolers, Redis, Plausible, ClickHouse) | ✅ Active | +| [CMS](cms/README.md) | Deep dive into the public Nuxt site arcodange.fr + its Cloudflare DNS/tunnel/Turnstile and Zoho email IaC | ✅ Active | ## Rules to contribute diff --git a/vibe/guidebooks/cms/README.md b/vibe/guidebooks/cms/README.md new file mode 100644 index 0000000..b9a1046 --- /dev/null +++ b/vibe/guidebooks/cms/README.md @@ -0,0 +1,75 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **CMS** + +# CMS + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) +> **Downstream:** [Site (Nuxt)](site.md) · [Cloudflare](cloudflare.md) · [Zoho email](zoho-email.md) +> **Related:** [tools CrowdSec](../tools/components.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) + +This guidebook maps the [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) — the one app in the lab whose primary audience is the open Internet. It serves the public site **arcodange.fr** and owns the OpenTofu that wires its Cloudflare edge, its Cloudflared tunnel into the cluster, its Turnstile CAPTCHA, and its Zoho email. + +## Two faces of one repo + +The `cms` repo holds two distinct concerns that share a domain but live in different directories. + +| Face | Where | What it is | +|---|---|---| +| **The SITE** | repo root ([`app/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/app), [`content/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content), [`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart)) | A **Nuxt 4** application (Nuxt Content + Nuxt Studio) built to static output and deployed **two ways**: to **Cloudflare Pages** (public `arcodange.fr` / `www`) and into **k3s** via a Helm chart (ArgoCD app **`cms`**) reachable through the Cloudflared tunnel (e.g. `cms-rec.arcodange.fr`, `www.arcodange.lab`) | +| **The IaC** | [`cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare) | **OpenTofu** managing the `arcodange.fr` zone (registered at OVH, DNS delegated to Cloudflare), Cloudflare **Pages**, the **Cloudflared** Zero-Trust tunnel into internal Traefik, a **Turnstile** CAPTCHA feeding CrowdSec, and **Zoho** email | + +The site is *what visitors see*; the IaC is *how they reach it and how mail flows*. Both deploy from the same Gitea repo through Gitea Actions. + +## Public request + email flow + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef edge fill:#d97706,stroke:#b45309,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + + USER(["Visitor"]):::edge + CFDNS["Cloudflare DNS
arcodange.fr zone"]:::edge + PAGES["Cloudflare Pages
(static Nuxt build)"]:::proc + TUN["Cloudflared tunnel"]:::edge + TRAEFIK["internal Traefik"]:::proc + CS["CrowdSec bouncer
(Turnstile-backed)"]:::proc + CMS["cms pod (Nuxt)
cms-rec.arcodange.fr"]:::proc + MAIL(["Sender"]):::edge + ZOHO["Zoho
MX / SPF / DKIM / DMARC / BIMI"]:::store + + USER --> CFDNS + CFDNS -- "arcodange.fr / www" --> PAGES + CFDNS -- "*.arcodange.fr" --> TUN + TUN --> TRAEFIK --> CS --> CMS + MAIL -- "MX lookup arcodange.fr" --> ZOHO +``` + +1. A **visitor** resolves a hostname under `arcodange.fr` through **Cloudflare DNS** (the zone OpenTofu manages). +2. The apex and `www` records (proxied CNAMEs) land on **Cloudflare Pages**, which serves the static Nuxt build directly from the edge. +3. Wildcard `*.arcodange.fr` hostnames route through the **Cloudflared** Zero-Trust tunnel — no home-LAN ports are opened — onto **internal Traefik**, which passes the request through the **CrowdSec** bouncer (its CAPTCHA challenge backed by Turnstile) to the in-cluster **`cms`** Nuxt pod (e.g. `cms-rec.arcodange.fr`). +4. Separately, **email** to `arcodange.fr` follows the **MX** record to **Zoho**, with **SPF/DKIM/DMARC/BIMI** authenticating and presenting the mail. + +## Index + +| Page | What it maps | Status | +|---|---|---| +| [Site (Nuxt)](site.md) | The Nuxt 4 app: Nuxt Content + Studio, static build, the dual deploy to Cloudflare Pages and to k3s via the Helm chart / ArgoCD app `cms` | ✅ Active | +| [Cloudflare](cloudflare.md) | The `cloudflare/` OpenTofu: zone (OVH-registered, CF-delegated), Pages, the Cloudflared tunnel into Traefik, and the Turnstile CAPTCHA for CrowdSec | ✅ Active | +| [Zoho email](zoho-email.md) | Zoho mail IaC: domain verification, MX/SPF/DKIM/DMARC/BIMI records, and the public aliases | ✅ Active | + +## Maintenance rule + +> [!IMPORTANT] +> **If any component documented in this guidebook is altered, update the page describing it in the same change.** A reference map that drifts from the real `cms` repo sends readers and agents down dead paths. The PR that changes a component is the PR that updates its CMS guidebook page. + +## Cross-references + +- [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) — the whole-lab view of where `cms` sits among `factory` + `tools`. +- [tools CrowdSec](../tools/components.md) — the Traefik bouncer the Turnstile challenge feeds for public-edge decisioning. +- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — where the Cloudflared tunnel token, Turnstile secret, and Cloudflare/Zoho/OVH credentials live in Vault. +- [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) — the OpenTofu apply pipeline pattern the `cloudflare/` IaC follows in Gitea Actions. +- [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) — why public-facing surfaces like this one are isolated from a safe prod-like environment. +- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms). diff --git a/vibe/guidebooks/cms/cloudflare.md b/vibe/guidebooks/cms/cloudflare.md new file mode 100644 index 0000000..082e8a6 --- /dev/null +++ b/vibe/guidebooks/cms/cloudflare.md @@ -0,0 +1,182 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [CMS](README.md) > **Cloudflare** + +# Cloudflare + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [CMS](README.md) · [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) +> **Downstream:** [tools CrowdSec](../tools/components.md) (consumes the Turnstile widget) +> **Related:** [Zoho email](zoho-email.md) · [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [naming conventions](../lab-ecosystem/naming-conventions.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) + +This page maps [`cms/cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare) — the OpenTofu root that owns the **`arcodange.fr`** edge. One `tofu apply` registers the zone at OVH, **delegates its DNS to Cloudflare**, publishes the public site on **Cloudflare Pages**, opens a **Cloudflared** Zero-Trust tunnel into the in-cluster Traefik, mints the **Turnstile** CAPTCHA the [tools CrowdSec bouncer](../tools/components.md) challenges with, and (via a sibling module) wires **Zoho** mail. The Nuxt site itself is not built here — see [Site (Nuxt)](site.md). + +## Providers + +Declared in [`providers.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/providers.tf). Versions pinned in [`.terraform.lock.hcl`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/.terraform.lock.hcl). + +| Provider | Source | Version | Auth | Purpose | +|---|---|---|---|---| +| `cloudflare` | `cloudflare/cloudflare` | `~> 5` | `CLOUDFLARE_API_TOKEN` env | Zone, Pages, DNS records, Zero-Trust tunnel, Turnstile, zone settings | +| `ovh` | `ovh/ovh` | `~> 2.8` | `OVH_*` env (`ovh-eu` endpoint) | Domain registration + nameserver delegation | +| `vault` | `vault` | `5.5.0` | `auth_login_jwt` (mount `gitea_jwt`, role `gitea_cicd_cms`) at `https://vault.arcodange.lab` | Persists the Turnstile secret/sitekey; reads tunnel token | + +> [!NOTE] +> The Vault provider authenticates with a **Gitea-issued OIDC JWT** (`TERRAFORM_VAULT_AUTH_JWT`), the same OIDC→Vault pattern the [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) documents lab-wide. + +## State backend — S3 on Cloudflare R2 + +[`backend.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/backend.tf) keeps state in an **S3-compatible bucket on Cloudflare R2**, not AWS. The `skip_*` flags and `use_path_style` are what let the AWS S3 backend talk to R2. + +| Setting | Value | +|---|---| +| `bucket` | `arcodange-tf` | +| `key` | `cms/terraform.tfstate` | +| `region` | `auto` | +| `endpoints.s3` | `var.CLOUDFLARE_S3_ENDPOINT` (R2 S3 API URL) | +| `access_key` / `secret_key` | `var.CLOUDFLARE_S3_ACCESS_KEY` / `var.CLOUDFLARE_S3_SECRET_ACCESS_KEY` | +| Flags | `skip_credentials_validation`, `skip_metadata_api_check`, `skip_region_validation`, `skip_requesting_account_id`, `skip_s3_checksum`, `use_path_style` | + +> [!WARNING] +> The R2 backend credentials are **Terraform variables**, so they must be present in the environment *before* `tofu init` can read state. CI injects them from Vault path `kvv1/cloudflare/r2/arcodange-tf` (mapped to `TF_VAR_CLOUDFLARE_*` — see [CI](#ci--cloudflareyaml) below). Without those creds nothing — not even a read-only plan — can run. + +## Resource graph + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TD + classDef ovh fill:#1e3a8a,stroke:#1e40af,color:#fff + classDef cf fill:#d97706,stroke:#b45309,color:#fff + classDef mod fill:#059669,stroke:#047857,color:#fff + classDef vault fill:#7c3aed,stroke:#6d28d9,color:#fff + + OVHDOM["ovh_domain_name
arcodange.fr"]:::ovh + OVHNS["ovh_domain_name_servers
delegate NS"]:::ovh + ZONE["cloudflare_zone
arcodange.fr"]:::cf + PAGES["cloudflare_pages_project
arcodange-cms (branch main)"]:::cf + PDOM["cloudflare_pages_domain
arcodange.fr + www"]:::cf + DNS["cloudflare_dns_record
@ + www CNAME (proxied)"]:::cf + TUN["module.cf_tunnel
Zero-Trust tunnel 'lab'"]:::mod + CAP["module.cf_captcha_for_crowdsec
Turnstile widget"]:::mod + ZOHO["module.zoho
mail records"]:::mod + VBACK["module.vault_backend
cms app role (cloudflared)"]:::vault + + OVHDOM --> ZONE + ZONE -- "name_servers" --> OVHNS + ZONE --> PAGES --> PDOM + PAGES -- "subdomain target" --> DNS + ZONE --> DNS + ZONE --> TUN + ZONE --> ZOHO + OVHDOM --> CAP +``` + +1. **`ovh_domain_name "arcodange.fr"`** anchors the registration (imported into state, not created by OpenTofu). +2. **`cloudflare_zone`** creates the Cloudflare zone for that domain under the `arcodange@gmail.com` account. +3. **`ovh_domain_name_servers`** writes Cloudflare's assigned nameservers back at OVH, **delegating DNS to Cloudflare**. +4. **`cloudflare_pages_project "arcodange-cms"`** (production branch `main`) plus two **`cloudflare_pages_domain`** resources attach `arcodange.fr` and `www.arcodange.fr` to Pages. +5. **`cloudflare_dns_record`** publishes apex (`@`) and `www` as **proxied CNAMEs** pointing at the Pages project's `.pages.dev` subdomain. +6. The three **modules** (`cf_tunnel`, `cf_captcha_for_crowdsec`, `zoho`) and `vault_backend` hang off the same zone/domain/account. + +### DNS & zone resources + +| Resource | Name | Detail | +|---|---|---| +| `ovh_domain_name.arcodange_fr` | `arcodange.fr` | Registration; `# was terraform imported into state` | +| `cloudflare_zone.arcodange_fr` | `arcodange.fr` | Zone under account resolved from `arcodange@gmail.com` | +| `ovh_domain_name_servers.arcodange_fr` | — | Delegates NS to `cloudflare_zone…name_servers` (or `original_name_servers` when rolling back) | +| `terraform_data.arcodange_fr_initial_conf` | — | Snapshot of OVH's pre-Cloudflare config, kept for rollback inspection (`ignore_changes`) | +| `cloudflare_pages_project.arcodange_fr` | `arcodange-cms` | `production_branch = "main"` | +| `cloudflare_pages_domain.arcodange_fr` | `arcodange.fr` | Custom domain on Pages | +| `cloudflare_pages_domain.www_arcodange_fr` | `www.arcodange.fr` | Custom domain on Pages | +| `cloudflare_dns_record.root_cname` | `@` | CNAME → Pages `subdomain`, `proxied = true`, `ttl = 1` | +| `cloudflare_dns_record.www_cname` | `www` | CNAME → Pages `subdomain`, `proxied = true`, `ttl = 1` | + +All wiring lives in [`iac.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/iac.tf). The account id is resolved at plan time via `data.cloudflare_account` filtered on the `arcodange@gmail.com` account name. + +## Module: `cloudflared_tunnel` + +[`modules/cloudflared_tunnel/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/modules/cloudflared_tunnel). A **Zero-Trust Cloudflared tunnel** that lets public hostnames reach in-cluster services **without opening any home-LAN port** — Cloudflare originates the connection from inside the cluster outward. Instantiated as `module.cf_tunnel` with `tunnel_name = "lab"`. + +| Resource | Role | +|---|---| +| `cloudflare_zero_trust_tunnel_cloudflared.tunnel` | The tunnel named **`lab`** under the account | +| `cloudflare_zero_trust_tunnel_cloudflared_config.tunnel_config` | Ingress rules from `hostname_mappings`, terminating in a catch-all `http_status:404` | +| `data.cloudflare_zone.arcodange` | Looks up the zone (created by the root module) | +| `cloudflare_zone_setting.setting` | Sets **`always_use_https = on`** | +| `cloudflare_dns_record.dns` | One **proxied CNAME** per mapping → `.cfargotunnel.com` | + +The single ingress mapping passed from the root is: + +| Hostname | Service | +|---|---| +| `*.arcodange.fr` | `http://traefik.kube-system.svc.cluster.local:80` | + +So every wildcard subdomain under `arcodange.fr` lands on the cluster's **internal Traefik** (`origin_request.no_tls_verify = true`), which then routes to the right in-cluster app (e.g. the `cms` Nuxt pod, Grafana, etc.). Pairs with the apex/`www` Pages records above, which are *not* tunneled. + +> [!CAUTION] +> **The tunnel token is created by hand and rotation is not automated.** Cloudflare only issues a connector token from the web console, so it is **manually stored in Vault** under the KV-v2 mount `kvv2` at path `cms/cloudflared` (the in-repo `vault_kv_secret` resource is commented out for exactly this reason). The cluster-side `cloudflared` Deployment reads it via a `VaultStaticSecret` (Vault Secrets Operator), role `cms`, refreshed hourly. If the token is rotated in the console, the Vault entry must be updated **manually** — nothing in this IaC will do it. `module.vault_backend` provisions the `cms` Vault app role (service account `cloudflared`) that grants that read; see [secrets-and-vault](../lab-ecosystem/secrets-and-vault.md). + +## Module: `cloudflared_captcha_for_crowdsec` + +[`modules/cloudflared_captcha_for_crowdsec/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/modules/cloudflared_captcha_for_crowdsec). Mints a **Cloudflare Turnstile widget** and stores its keys in Vault for the [tools CrowdSec bouncer](../tools/components.md) to serve as a CAPTCHA challenge on remediated requests. + +| Resource | Detail | +|---|---| +| `cloudflare_turnstile_widget.turnstile` | `name = "crowdsec captcha"`, `mode = "invisible"`, `clearance_level = "interactive"`, `region = "world"`; `bot_fight_mode`/`ephemeral_id`/`offlabel` all `false` | +| `vault_kv_secret_v2.turnstile` | Writes `{ sitekey, secret }` to KV-v2 (`cas = 1`) | + +Instantiated as `module.cf_captcha_for_crowdsec` with `domain_names = [arcodange.fr, arcodange.lab, arcodange.duckdns.org]` and `vault_path = "cms/factory/turnstile"`. + +| What | Where | +|---|---| +| **Turnstile mode** | Invisible widget, interactive clearance — challenges only when CrowdSec flags a request | +| **Vault destination** | `kvv2/cms/factory/turnstile` → keys `sitekey` + `secret` | +| **Consumer** | The [CrowdSec Traefik bouncer in `tools`](../tools/components.md) reads sitekey + secret to render and verify the challenge | + +This is the one knot that ties the **`cms`** edge to the **`tools`** security stack: `cms` produces the Turnstile keys; `tools` consumes them. + +## Sibling module: Zoho mail + +`module.zoho` (source `./zoho`) lives in **this same OpenTofu root** and writes mail records into the same `cloudflare_zone`. It is documented separately on [Zoho email](zoho-email.md) — note that a `cms/cloudflare` apply touches mail DNS too, so plan output there is expected. + +## CI — `cloudflare.yaml` + +[`.gitea/workflows/cloudflare.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/cloudflare.yaml). Manual-only (`workflow_dispatch`), same Gitea-OIDC→Vault→`tofu apply` shape as the [tofu CI flow concept](../factory-provisioning/opentofu/ci-apply-flow.md). + +1. **`gitea_vault_auth`** — mints a Gitea OIDC id-token (decodes `vault_oauth__sh_b64` and runs it), exported as `gitea_vault_jwt`. +2. **`tofu`** — depends on the auth job; a shared `*vault_step` reads all secrets from Vault (role `gitea_cicd_cms`, mount `gitea_jwt`), prepares the homelab CA cert, then runs **`dflook/terraform-apply@v1`** on `path: cloudflare/` with **`auto_approve: true`** at **OpenTofu `1.8.2`**. + +### Vault secrets read by the workflow + +| Vault path | Mapped to | Used for | +|---|---|---| +| `kvv1/cloudflare/cms/cf_arcodange_cms_token` (`token`) | `CLOUDFLARE_API_TOKEN` | Cloudflare provider auth | +| `kvv1/cloudflare/r2/arcodange-tf` (`*`) | `TF_VAR_CLOUDFLARE_*` | R2/S3 state backend creds + endpoint | +| `kvv1/gitea/tofu_module_reader` (`ssh_private_key`) | `TERRAFORM_SSH_KEY` | SSH key to clone the `tools` git module (`vault_backend`) | +| `kvv1/ovh/cms/app` (`*`) | `OVH_*` | OVH provider auth | +| `kvv1/zoho/self_client` (`*`) | `ZOHO_*` **and** `TF_VAR_ZOHO_*` | Zoho API auth for `module.zoho` | + +> [!CAUTION] +> **`auto_approve: true` applies without a human gate.** Any dispatch of this workflow on any ref runs `tofu apply` straight against the live `arcodange.fr` edge and Vault. There is no plan-review step; review happens in the PR before merge, not in the apply. Treat a dispatch as a production change. + +## Gotchas + +> [!CAUTION] +> **Cloudflared tunnel token — manual, unrotated.** Created in the Cloudflare console and hand-placed in Vault under `kvv2` at path `cms/cloudflared`. No IaC rotates it. (Repeated here because it is the most common surprise.) + +> [!WARNING] +> **OVH → Cloudflare nameserver delegation is the live cutover.** `ovh_domain_name_servers` points OVH at Cloudflare's nameservers. The `use_ovh_initial_name_servers` variable (default `false`) is meant to flip delegation back to OVH's `original_name_servers`, but that **rollback path is untested** — `terraform_data.arcodange_fr_initial_conf` only *snapshots* the pre-Cloudflare config for inspection. Do not assume a clean revert. + +> [!WARNING] +> **R2-backed state creds gate everything.** State lives on Cloudflare R2 and the access/secret keys are `TF_VAR_` inputs (from `kvv1/cloudflare/r2/arcodange-tf`). If those creds are missing or rotated out from under the workflow, even `tofu init` fails — there is no fallback backend. + +## Cross-references + +- [CMS](README.md) — the guidebook hub; the public-request + email flow diagram. +- [Site (Nuxt)](site.md) — the Nuxt app served by the Pages project and the in-cluster pod this tunnel fronts. +- [Zoho email](zoho-email.md) — `module.zoho` lives in this same OpenTofu root. +- [tools CrowdSec](../tools/components.md) — consumer of the Turnstile widget minted here. +- [tofu CI flow concept](../factory-provisioning/opentofu/ci-apply-flow.md) — the shared Gitea-OIDC→Vault→apply pattern. +- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — where the tunnel token, Turnstile keys, and provider creds live. +- [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) — why this Internet-facing surface is isolated from the safe prod-like environment. +- Code: [`cms/cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare). diff --git a/vibe/guidebooks/cms/site.md b/vibe/guidebooks/cms/site.md new file mode 100644 index 0000000..5e20f40 --- /dev/null +++ b/vibe/guidebooks/cms/site.md @@ -0,0 +1,165 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [CMS](README.md) > **Site (Nuxt)** + +# Site (Nuxt) + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [CMS](README.md) · [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) +> **Downstream:** [Cloudflare](cloudflare.md) +> **Related:** [Zoho email](zoho-email.md) · [tools CrowdSec](../tools/components.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) + +The public site face of the [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms): a **Nuxt 4** application built to **static HTML** and shipped two ways from one image — to **Cloudflare Pages** (the live public `arcodange.fr`) and into **k3s** via a Helm chart behind the Cloudflared tunnel. This page maps the Nuxt app, its Docker build, the Helm chart, and the Gitea Actions that drive both deploys. + +## The Nuxt 4 application + +Configured in [`nuxt.config.ts`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/nuxt.config.ts). It runs `ssr: true` for dev but is shipped via **`nuxt generate`** — a full static prerender — so production is plain HTML served by a static file server, no Node runtime. + +| Concern | Setting | Notes | +|---|---|---| +| Rendering | `ssr: true`, shipped via `nuxt generate` | Static prerender to `.output/public`; Nitro `prerender.autoSubfolderIndex: false` | +| Site identity | `site.url: https://arcodange.fr`, `site.name: Arcodange`, `trailingSlash: true` | Drives canonical URLs, sitemap, robots via `@nuxtjs/seo` | +| Content | `@nuxt/content` collections | Markdown under [`content/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content); mermaid highlight enabled | +| Editing | **Nuxt Studio** at route **`/admin`** | `nuxt-studio` module; repo `arcodange-org/cms`, commits to `main` | +| Sitemap / robots | `@nuxtjs/sitemap` (`zeroRuntime: true`), `@nuxtjs/seo` | No runtime sitemap server — fully prerendered | +| Analytics | `@nuxtjs/plausible` | `apiHost: https://analytics.arcodange.fr`, `hashMode: true`, outbound tracking on, `localhost` ignored | +| i18n | `@nuxtjs/i18n` | Single locale **`fr`** (default `fr`); `htmlAttrs.lang: fr` | +| Images | `@nuxt/image` | `webp`/`jpeg`, quality 80 | +| Fonts | `@nuxt/fonts` | Local **Noto Emoji** preloaded | +| UI | `@nuxt/ui` | Plus `@nuxt/scripts`, `@nuxtjs/device`, `nuxt-booster`, `@compodium/nuxt` | + +### Content collections + +Declared in [`content.config.ts`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content.config.ts). Every collection is wrapped with `asSeoCollection()` (from `@nuxtjs/seo`) and sourced from a folder of Markdown under [`content/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content). + +| Collection | Source glob | Type | Schema extras | +|---|---|---|---| +| `parcours` | `parcours/*.md` | `page` | — | +| `site` | `site/*.md` | `page` | — | +| `tech` | `tech/*.md` | `page` | `date` (required), `image` (media), `featured` (default `false`) | +| `experiences` | `experiences/*.md` | `page` | `date`, `enddate`, `icon` (default `i-lucide-rocket`), `image`, `secondaryImage`, `descriptionHTML` | + +A content build transformer `~~/content/transformers/description-md` runs at build time, and Markdown highlighting registers the `mermaid` language. + +## Docker build: one image, two static trees + +[`Dockerfile`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/Dockerfile) is a multi-stage build that produces **two** static outputs from the same source and packs them into a static web server image. + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef base fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef build fill:#059669,stroke:#047857,color:#fff + classDef out fill:#d97706,stroke:#b45309,color:#fff + + DEPS["cms-deps:TAG
(Dockerfile.deps base)"]:::base + BUILD["build stage
npm ci"]:::build + PROD["nuxt generate
→ /app/prod"]:::out + STG["NUXT_SITE_ENV=staging
nuxt generate → /app/.output/public"]:::out + SWS["static-web-server:2
serves /public"]:::build + + DEPS --> BUILD + BUILD --> PROD + BUILD --> STG + PROD --> SWS + STG --> SWS +``` + +1. The **build stage** starts `FROM gitea.arcodange.lab/arcodange-org/cms-deps:${BASE_IMAGE_TAG}` — a prebuilt base ([`Dockerfile.deps`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/Dockerfile.deps), `node:24-slim` + `python3`/`make`/`g++`/`sqlite3`/`libvips` for `better-sqlite3`/`libvips`) — copies the source and runs `npm ci`. +2. The **prod** build: `npx nuxt generate`, then the output is moved to **`/app/prod`**. +3. The **staging** build: `NUXT_SITE_ENV="staging" npx nuxt generate`, leaving its output at **`/app/.output/public`**. +4. The **server stage** is `FROM joseluisq/static-web-server:2`; it copies the staging tree to **`/public`** and the prod tree to **`/prod`**, plus [`webserver.config.toml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/webserver.config.toml) as `/sws.toml`, and serves on port 80. + +> [!NOTE] +> **`/public` is staging, `/prod` is production.** The static-web-server serves `root = "./public"` (the **staging** build) by default — that is what the in-cluster k3s deploy exposes (e.g. `cms-rec.arcodange.fr`). The **prod** build at `/prod` is the tree extracted and pushed to Cloudflare Pages by the `arcodange_fr` workflow. One image therefore carries both faces. + +The final image is pushed to **`gitea.arcodange.lab/arcodange-org/cms`** (tags `latest` and the branch ref). + +## Helm chart + +[`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart) deploys the in-cluster face. The pod is just the static-web-server image above, fronted by Traefik with a CrowdSec middleware and reached either over the lab ingress (`www.arcodange.lab`) or through a sidecar Cloudflared tunnel (`cms-rec.arcodange.fr`). + +| Key | Value | Source | +|---|---|---| +| Chart name / version | `arcodange-cms` / `0.1.0`, `appVersion: latest` | [`Chart.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/Chart.yaml) | +| Image | `gitea.arcodange.lab/arcodange-org/cms:latest`, `pullPolicy: Always` | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/values.yaml) | +| Replicas | `1` (autoscaling disabled) | `replicaCount: 1`, `autoscaling.enabled: false` | +| Service | `ClusterIP`, port **80** (named `http`) | `service.port: 80` | +| Probes | liveness + readiness `httpGet /` on `http` | — | +| ServiceAccount | created, name **`cms`**, automount on | `serviceAccount.name: cms` | +| Lab ingress | `www.arcodange.lab`, path `/` Prefix | Traefik `websecure`, TLS via `letsencrypt` resolver (`arcodange.lab` + SAN `www.arcodange.lab`) | +| Edge middleware | `kube-system-crowdsec@kubernetescrd` | applied on both ingresses | +| Tunnel ingress | `cms-rec.arcodange.fr`, Traefik `web` entrypoint | `ingress.cloudflared.host` | +| Cloudflared sidecar | enabled, `Deployment`, `1` replica, image `cloudflare/cloudflared:latest` | `cloudflared.*` | +| Tunnel token | Vault KV-v2 `kvv2` path `cms/cloudflared`, role `cms`, refresh `1h` | `cloudflared.vault.*` | + +### Chart templates + +The chart renders these objects (in [`chart/templates/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates)): + +| Template | Renders | +|---|---| +| [`deployment.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/deployment.yaml) | the `cms` static-web-server pod, port `http`/80 | +| [`service.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/service.yaml) | ClusterIP service on 80 | +| [`ingress.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/ingress.yaml) | lab Traefik ingress for `www.arcodange.lab` + CrowdSec middleware | +| [`ingress_cloudflared.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/ingress_cloudflared.yaml) | `-cloudflared` ingress for `cms-rec.arcodange.fr` (web entrypoint) | +| [`cloudflared_tunnel.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/cloudflared_tunnel.yaml) | `cloudflared` SA, `VaultAuth`, `VaultStaticSecret`, and the cloudflared `Deployment`/`DaemonSet` | +| [`serviceaccount.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/serviceaccount.yaml) | the `cms` ServiceAccount | +| [`ingress_gitea.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/ingress_gitea.yaml), [`hpa.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/hpa.yaml) | optional Gitea ingress; HPA (disabled) | + +### Cloudflared tunnel template + +The cloudflared sidecar pulls its tunnel token from Vault through the [VSO](../tools/components.md) operator, never from a static manifest: + +1. A `ServiceAccount` **`cloudflared`** is created with a `VaultAuth` (Kubernetes auth, mount `kubernetes`, role from `cloudflared.vault.role` = `cms`, audience `vault`). +2. A **`VaultStaticSecret`** named `cloudflared-tunnel-token` reads **KV-v2** mount **`kvv2`** at path **`cms/cloudflared`** (refresh `1h`) and materialises a `cloudflared-tunnel-token` Secret. +3. The cloudflared `Deployment` (1 replica, pinned to a `control-plane` node via affinity) runs `cloudflared tunnel --no-autoupdate run --token $(TUNNEL_TOKEN) --no-tls-verify`, with `TUNNEL_TOKEN` injected from that Secret's `token` key. + +This connects Cloudflare's edge to internal Traefik so `cms-rec.arcodange.fr` reaches the in-cluster `cms` service without opening any home-LAN port — the cluster side of the tunnel whose Cloudflare side lives in the [Cloudflare IaC](cloudflare.md). + +## CI: building and deploying + +Three Gitea Actions workflows under [`.gitea/workflows/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows) cover the site. (A fourth, `cloudflare.yaml`, drives the OpenTofu — see [Cloudflare](cloudflare.md).) + +| Workflow | Triggers | What it does | +|---|---|---| +| [`docker-dependencies.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/docker-dependencies.yaml) | `workflow_dispatch`; push to `main` touching `package.json`, `package-lock.json`, `Dockerfile.deps` | Builds the **deps** base image, pushes `gitea.arcodange.lab/-deps:{latest,YYYYMMDD-SHA8}`, then creates+pushes a **git tag `deps-YYYYMMDD-SHA8`** (with retry, up to 30 attempts) | +| [`docker-content.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/docker-content.yaml) | `workflow_dispatch`; push to `main` touching `nuxt.config.ts`, `app/**`, `content.config.ts`, `content/**`, `public/**`, `package*.json`, `Dockerfile` | Finds the latest `deps-*` git tag, strips `deps-` to get `BASE_TAG`, builds the **full image** with `--build-arg BASE_IMAGE_TAG=$BASE_TAG`, pushes `gitea.arcodange.lab/:{latest,}` | +| [`arcodange_fr.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/arcodange_fr.yaml) | `workflow_dispatch` (input `image_tag`, default `main`) | Pulls `cms:`, `docker create` + `docker cp` to extract **`/prod`** to `./public`, writes a minimal `wrangler.toml`, then **`wrangler pages deploy`** to project `arcodange-cms`, branch `main` | + +> [!IMPORTANT] +> **The deps tag is the contract between the two Docker workflows.** `docker-dependencies` publishes both the `-deps` image and a matching **git tag** `deps-YYYYMMDD-SHA8`; `docker-content` discovers that tag (`git tag --list "deps-*" | sort -V | tail -n1`) to pin its `BASE_IMAGE_TAG`. Touch `package.json`/lockfile/`Dockerfile.deps` and the deps build must land first, or the content build pins a stale base. + +### From image to Cloudflare Pages + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef ci fill:#059669,stroke:#047857,color:#fff + classDef reg fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef edge fill:#d97706,stroke:#b45309,color:#fff + + DEP["docker-dependencies
→ -deps image + git tag deps-*"]:::ci + CON["docker-content
pins BASE_IMAGE_TAG"]:::ci + REG["registry
gitea.arcodange.lab/arcodange-org/cms"]:::reg + FR["arcodange_fr
extract /prod"]:::ci + PAGES["Cloudflare Pages
project arcodange-cms"]:::edge + K3S["k3s Helm chart
serves /public (staging)"]:::edge + + DEP --> CON --> REG + REG --> FR --> PAGES + REG --> K3S +``` + +1. **`docker-dependencies`** publishes the `-deps` base image and a `deps-YYYYMMDD-SHA8` git tag whenever dependencies change. +2. **`docker-content`** resolves that tag, builds the full dual-tree image, and pushes it to **`gitea.arcodange.lab/arcodange-org/cms`**. +3. **`arcodange_fr`** (manual) pulls that image, extracts the **`/prod`** tree, and deploys it to **Cloudflare Pages** project `arcodange-cms` on branch `main` — this is the live public `arcodange.fr`. +4. In parallel, the k3s **Helm chart** runs the same image and serves the **`/public`** (staging) tree behind Traefik + CrowdSec and the Cloudflared tunnel (`cms-rec.arcodange.fr`, `www.arcodange.lab`). + +## Cross-references + +- [CMS](README.md) — the guidebook hub: the two faces of the repo and the public request/email flow. +- [Cloudflare](cloudflare.md) — the Cloudflare side of the tunnel, the Pages project, and the zone the deploys publish into. +- [Zoho email](zoho-email.md) — mail for the same `arcodange.fr` domain. +- [tools CrowdSec](../tools/components.md) — the Traefik bouncer middleware fronting both chart ingresses. +- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — where the Cloudflared tunnel token (`kvv2` `cms/cloudflared`) and registry/CF credentials live. +- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms). diff --git a/vibe/guidebooks/cms/zoho-email.md b/vibe/guidebooks/cms/zoho-email.md new file mode 100644 index 0000000..8d57549 --- /dev/null +++ b/vibe/guidebooks/cms/zoho-email.md @@ -0,0 +1,116 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [CMS](README.md) > **Zoho email** + +# Zoho email + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [CMS](README.md) · [Cloudflare](cloudflare.md) +> **Downstream:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) +> **Related:** [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) · [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-env PRD](../../PRD/safe-prod-like-environment/README.md) + +Email for **arcodange.fr** is hosted at **Zoho Mail (EU region)** and provisioned *entirely from OpenTofu*. There is no Zoho web-console click-ops in the steady state: the same `tofu apply` that owns the Cloudflare zone also drives the Zoho REST API to read the organization, publish the DNS records mail delivery depends on, and create one mailbox alias + one Inbox sub-folder per address. This page lives under [`cms/cloudflare/zoho/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho), a sub-module of the [Cloudflare](cloudflare.md) tofu root. + +> [!CAUTION] +> **DNS/email changes here are high-stakes and slow to fail.** A wrong MX, SPF, DKIM, or DMARC record silently degrades or breaks `arcodange.fr` deliverability for **days** — receivers cache TTLs, reputation decays, and there is no synchronous error to catch in CI. DMARC is published as **`p=reject`**, so a broken SPF/DKIM alignment means conforming receivers *drop* legitimate mail outright rather than quarantine it. This is a prime motivation for the **safe environment**: changes to this module must be validated **plan-only against a throwaway/clone zone**, never iterated directly against the live `arcodange.fr` zone. See the [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) and the [safe-env PRD](../../PRD/safe-prod-like-environment/README.md). + +## How the module is wired + +The Cloudflare root ([`cloudflare/iac.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/iac.tf)) instantiates `module "zoho"`, passing it the live zone and domain plus the OAuth client credentials: + +| Input | Source | Purpose | +|---|---|---| +| `domain_name` | `ovh_domain_name.arcodange_fr.domain_name` | the domain to manage (`arcodange.fr`) | +| `dns_zone_id` | `cloudflare_zone.arcodange_fr.id` | Cloudflare zone the DNS records land in | +| `zoho_client_id` | `var.ZOHO_CLIENT_ID` (Vault `kvv1/zoho/self_client`) | OAuth2 self-client id | +| `zoho_client_secret` | `var.ZOHO_CLIENT_SECRET` (Vault `kvv1/zoho/self_client`) | OAuth2 self-client secret | + +In CI the secrets are injected by the `vault-action` step in [`.gitea/workflows/cloudflare.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/cloudflare.yaml), which maps the whole `kvv1/zoho/self_client` KV-v1 secret into **both** the shell env (`ZOHO_*`, consumed by the helper scripts) and the tofu vars (`TF_VAR_ZOHO_*`, consumed by `config.tf`): + +``` +kvv1/zoho/self_client * | ZOHO_ ; +kvv1/zoho/self_client * | TF_VAR_ZOHO_ ; +``` + +## OAuth2: client-credentials flow + +Zoho is a self-client (machine-to-machine) integration on the **EU** datacenter — every host is `*.zoho.eu` / `accounts.zoho.eu`. Authentication uses the OAuth2 **`client_credentials`** grant; there is no interactive user consent in the running flow (a commented device-code flow remains in [`.env`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/.env) as historical bootstrap). + +The token is minted in [`config.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/config.tf) via a `data "http"` POST to `https://accounts.zoho.eu/oauth/v2/token` with `grant_type=client_credentials` and the comma-joined scope list. The bearer is then folded into an `Authorization: Zoho-oauthtoken ` header (`local.auth_headers`) reused by every subsequent read. + +| Scope | Access | Why it is needed | +|---|---|---| +| `ZohoMail.partner.organization.READ` | READ (org) | resolve the org **ZOID** | +| `ZohoMail.organization.accounts.READ` | READ (accounts) | find the super-admin **account id / zuid** | +| `ZohoMail.organization.accounts.UPDATE` | UPDATE (accounts) | add / remove email aliases | +| `ZohoMail.organization.domains.READ` | READ (domains) | fetch the domain verification code + DKIM public key | +| `ZohoMail.folders.ALL` | ALL (folders) | list and create per-alias Inbox sub-folders | + +Lookup chain (each step feeds the next): + +1. `GET https://mail.zoho.eu/api/organization` → `local.org`, from which `zoid` builds `local.api_prefix = https://mail.zoho.eu/api/organization/`. +2. `GET {api_prefix}/domains/{domain_name}` ([`dns.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/dns.tf)) → `local.domain`, exposing `CNAMEVerificationCode` and `dkimDetailList[0].publicKey`. +3. `GET {api_prefix}/accounts` ([`email_aliases.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/email_aliases.tf)) → the single `iamUserRole == "super_admin"` account, giving its `accountId` and `zuid`. + +## DNS records published on the Cloudflare zone + +[`modules/zoho_mail_dns`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/modules/zoho_mail_dns) materialises every `cloudflare_dns_record` Zoho mail needs onto the live zone. The DKIM key and verification code are read live from the Zoho domain API (step 2 above) and passed in as module inputs, so the records always track what Zoho actually expects. All records use **TTL 3600** and apply to the apex (`@`) unless noted. + +| Name | Type | Value | Purpose | +|---|---|---|---| +| `@` | TXT | `"zoho-verification=.zmverify.zoho.eu"` | proves domain ownership to Zoho | +| `@` | MX | `mx.zoho.eu` (priority **10**) | primary inbound mail exchanger | +| `@` | MX | `mx2.zoho.eu` (priority **20**) | secondary mail exchanger | +| `@` | MX | `mx3.zoho.eu` (priority **50**) | tertiary mail exchanger | +| `@` | TXT | `"v=spf1 include:zohomail.eu ~all"` | SPF: authorise Zoho to send for the domain | +| `zmail._domainkey` | TXT | `""` (from `dkimDetailList[0].publicKey`) | DKIM public key for outbound signing | +| `_dmarc` | TXT | `"v=DMARC1; p=reject; rua=mailto:arcodange@gmail.com; ruf=mailto:arcodange@gmail.com; sp=reject; adkim=r; aspf=r; pct=100"` | DMARC policy: **reject** non-aligned mail, 100% coverage, aggregate+forensic reports to `arcodange@gmail.com` | +| `default._bimi` | TXT | `"v=BIMI1; l=https://arcodange.fr/.well-known/logo.svg; avp=brand;"` | BIMI: display the brand logo beside authenticated mail (created only when `bimi_logo_url != null`) | + +> [!WARNING] +> The DMARC policy is the strictest tier: `p=reject` **and** `sp=reject` (subdomains) with relaxed alignment (`adkim=r`, `aspf=r`) and `pct=100`. There is no `quarantine` grace band — any message that fails both SPF *and* DKIM alignment is rejected by conforming receivers. Validate SPF/DKIM correctness in the safe environment before touching the live `_dmarc` or apex records. + +## Email aliases + +Seven addresses are defined as a single map in [`email_aliases.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/email_aliases.tf) (`local.email_aliases`). Each is provisioned **twice** against the super-admin mailbox: as an **email alias** on the account, and as a matching **Inbox sub-folder** so mail to that address can be filtered into its own folder. + +| Alias (`@arcodange.fr`) | Display name | Purpose | +|---|---|---| +| `bonjour` | `Service Bonjour` | commercial / sales | +| `bureaux` | `Bureaux Arcodange` | official bodies (URSSAF, administration) | +| `contact` | `Premier Contact` | website contact form | +| `helloworld` | `✅ Arcodange 🏹💻🪽` | social networks, newsletter | +| `analytics` | `Analytics 📊🔍` | social networks, newsletter | +| `books` | `Accounting 📒🧮` | accounting / bookkeeping | +| `abonnements` | `Abonnements 📱🤖` | subscriptions (phone, AI, services) | + +Provisioning is *imperative-inside-declarative*: each alias is a `terraform_data` resource whose `triggers_replace` watches whether the alias/folder is already present, and whose `local-exec` provisioners shell out to [`zoho_api_call.sh`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/zoho_api_call.sh) on **create** and **destroy**: + +1. **Alias create** — `PUT {api_prefix}/accounts/{zuid}` with `mode=addEmailAlias`, scope `ZohoMail.organization.accounts.UPDATE`; fails fast if the response contains `OPERATION_NOT_PERMITTED`. +2. **Alias destroy** — same endpoint with `mode=deleteEmailAlias` (the bare local-part, split off the `alias:display` key). +3. **Folder create** — `POST /api/accounts/{accountId}/folders` with `parentFolderId` = the resolved **Inbox** folder id, scope `ZohoMail.folders.ALL`. +4. **Folder destroy** — looks the folder id up by name, `DELETE`s it, then also sweeps the corresponding `/Trash/` (or `/Trash/Inbox_`) folder Zoho leaves behind. + +> [!NOTE] +> `terraform_data` + `local-exec` is used because aliases and folders are Zoho-side mutations with no first-class Terraform provider. The `triggers_replace = { missing = !contains(...) }` guard makes the apply idempotent: the provisioner only re-runs when the alias/folder is genuinely absent, so a clean plan is a no-op rather than a re-create. + +## Helper scripts + +Both scripts live beside the tofu and are invoked from `local-exec`. They share the OAuth client env vars (`ZOHO_CLIENT_ID`, `ZOHO_CLIENT_SECRET`, `ZOHO_TOKEN_ENDPOINT`) injected from Vault. + +| Script | Role | +|---|---| +| [`zoho_api_call.sh`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/zoho_api_call.sh) | Thin HTTP wrapper. Parses `--endpoint`, `-x=`, `--scope`, `--data_json` / `--data_url`, and `--fail_if_str_in_resp`; sources `zoho_gen_token.sh`, attaches the bearer header, `curl`s the call, fails if a sentinel string (e.g. `OPERATION_NOT_PERMITTED`) appears, and emits compact JSON via `jq`. | +| [`zoho_gen_token.sh`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/zoho_gen_token.sh) | OAuth token cache. `gen_zoho_token ` returns a cached token from `/tmp/zoho_oauth_tokens.cache` when fresh, otherwise mints a new `client_credentials` token and stores it. | + +`zoho_gen_token.sh` is **lock-based and TTL-bounded**: + +- A mutex is taken by `mkdir /tmp/zoho_oauth_tokens.lock` (atomic dir creation), with up to 10 one-second retries, so concurrent `local-exec` provisioners don't corrupt the cache. The lock is released on every function exit via `trap`. +- Tokens are keyed by scope in `/tmp/zoho_oauth_tokens.cache` (file mode `600`). A token is reused only while younger than **3600 s (~1 h)**; `cleanup_cache` prunes expired entries on each call. +- The wrapper runs `cleanup_cache` before each request and re-traps it on `INT TERM EXIT`, so stale tokens never leak past their TTL. + +## Cross-references + +- **Parent tofu / zone & Pages:** [Cloudflare](cloudflare.md) — owns `cloudflare_zone.arcodange_fr` that this module writes records into, and the `vault-action` CI step that supplies the credentials. +- **Where these secrets come from:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) (`kvv1/zoho/self_client`). +- **How apply runs:** [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md). +- **Why a safe environment exists:** [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-env PRD](../../PRD/safe-prod-like-environment/README.md). diff --git a/vibe/guidebooks/lab-ecosystem/02-tools.md b/vibe/guidebooks/lab-ecosystem/02-tools.md index 116b7ef..874f342 100644 --- a/vibe/guidebooks/lab-ecosystem/02-tools.md +++ b/vibe/guidebooks/lab-ecosystem/02-tools.md @@ -5,6 +5,7 @@ > **Status:** ✅ Active > **Last Updated:** 2026-06-23 > **Upstream:** [01 · factory](01-factory.md) +> **Deeper dive:** [Tools guidebook](../tools/README.md) — deploy model, component inventory, and per-component internals > **Related:** [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) The [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) is deployed by factory's ArgoCD into the **`tools` namespace**. It is the platform layer that every app namespace depends on: secrets (Vault + VSO), observability (Prometheus + Grafana), edge security (CrowdSec), database pooling (pgbouncer / pgcat), caching (Redis/KeyDB), and analytics (Plausible + ClickHouse). Each component ships its own Helm chart or Kustomize overlay, and most carry an `iac/` directory of OpenTofu that declares the Vault config (roles, policies, dynamic-secret backends) that wires the component to secrets — see [secrets-and-vault.md](secrets-and-vault.md). @@ -69,6 +70,7 @@ flowchart TB ## Cross-references +- [Tools guidebook](../tools/README.md) — the deeper dive: deploy model (one ArgoCD app → meta-chart → per-component Applications), full component inventory, and per-component internals. - [Lab ecosystem hub](README.md) — the whole-lab map. - [01 · factory](01-factory.md) — the ArgoCD that deploys this namespace, and the `postgres/iac/` roles + `user_lookup()` that pgbouncer consumes. - [03 · cms](03-cms.md) — the public edge protected by **CrowdSec** (Turnstile → CrowdSec wiring). diff --git a/vibe/guidebooks/lab-ecosystem/03-cms.md b/vibe/guidebooks/lab-ecosystem/03-cms.md index 2bcdaad..930d04a 100644 --- a/vibe/guidebooks/lab-ecosystem/03-cms.md +++ b/vibe/guidebooks/lab-ecosystem/03-cms.md @@ -6,6 +6,7 @@ > **Last Updated:** 2026-06-23 > **Upstream:** [01 · factory](01-factory.md) > **Related:** [02 · tools](02-tools.md) · [secrets-and-vault.md](secrets-and-vault.md) +> **Deeper dive:** [CMS guidebook](../cms/README.md) The [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) is the **public-facing site** of the lab: a Nuxt static site served at **`arcodange.fr`**, plus the OpenTofu that owns its Cloudflare edge and its Zoho email. It is the one app whose primary audience is the open Internet, so it ties together the public-DNS, tunnel, CAPTCHA, and email plumbing. @@ -75,6 +76,7 @@ flowchart LR ## Cross-references +- [CMS guidebook](../cms/README.md) — the deeper-dive map of the `cms` repo: the Nuxt site, the Cloudflare edge, and Zoho email. - [Lab ecosystem hub](README.md) — the whole-lab map. - [01 · factory](01-factory.md) — the ArgoCD app `cms`, and `iac/cloudflare.tf` / `iac/ovh.tf` that grant the CMS its Cloudflare token and OVH nameserver-edit rights. - [02 · tools](02-tools.md) — **CrowdSec** (the Traefik bouncer the Turnstile challenge feeds). diff --git a/vibe/guidebooks/tools/README.md b/vibe/guidebooks/tools/README.md new file mode 100644 index 0000000..6ae0afc --- /dev/null +++ b/vibe/guidebooks/tools/README.md @@ -0,0 +1,114 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **Tools** + +# Tools + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Guidebooks index](../README.md) · [lab-ecosystem 02 · tools](../lab-ecosystem/02-tools.md) +> **Downstream:** [Components](components.md) · [Secrets & VSO](secrets-and-vso.md) +> **Related:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) + +The [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) is the lab's **platform layer**: the cluster-wide services every app namespace leans on — secrets (Vault + VSO), observability (Prometheus + Grafana), edge security (CrowdSec), database pooling (pgbouncer), caching (Redis/KeyDB), and analytics (Plausible + ClickHouse). Everything in this repo lands in the single **`tools` namespace**. + +This hub explains the **deploy model** — how one factory-owned ArgoCD Application fans out into one Application per component — and gives a **component inventory**. For per-component internals see [Components](components.md); for how secrets reach the pods see [Secrets & VSO](secrets-and-vso.md). + +## Deploy model + +The whole repo is wired into the cluster through a single **meta-chart** that factory's ArgoCD points at: + +1. Factory's ArgoCD declares **one** Application named `tools` whose source is this repo's [`chart/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart) meta-chart. +2. That meta-chart renders two kinds of object from [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml): + - an **AppProject** named `tools` ([`chart/templates/project.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/templates/project.yaml)) that pins every child Application to `sourceRepos: tools` and `destinations: tools` namespace only; + - one ArgoCD **Application per component** ([`chart/templates/apps.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/templates/apps.yaml) — a `range` over `.Values.tools`), each pointing `path:` at the matching **top-level directory** of the repo (`path: pgbouncer`, `path: grafana`, …). +3. Each child Application targets `namespace: tools`, with `automated` sync (`prune: true`, `selfHeal: true`) and `CreateNamespace=true`. +4. A component directory is **either** a Helm chart (`Chart.yaml` whose `dependencies:` pull the upstream chart + the `tool` library) **or** a Kustomize overlay (`kustomization.yaml` using a `helmCharts:` inflation generator). +5. [`tool/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/tool) is a Helm **library chart** (`type: library`): it ships shared templates/helpers consumed by the component charts via `dependencies:` and is **not deployable** on its own. + +> [!NOTE] +> A component is deployed **only if it appears as a key under `tools:` in [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml)**. `pgcat` is present in the repo but commented out there, so no Application is rendered for it. + +## Component inventory + +| Component | How declared (chart + version OR Kustomize) | Ingress host | Persistence | Purpose | +|---|---|---|---|---| +| **hashicorp-vault** | Helm — `hashicorp/vault` `0.28.1` (+ `tool` lib) | `vault.arcodange.lab` (Traefik, Let's Encrypt) | `storage "file"` at `/vault/data` + audit storage (PVC) | Secrets engine: KV, transit, PostgreSQL dynamic creds; auth `kubernetes` + Gitea OIDC/JWT | +| **vault-secrets-operator (VSO)** | Helm — `hashicorp/vault-secrets-operator` `0.9.0`, a dependency of the `hashicorp-vault` chart | — | — | Injects Vault secrets into pods via `VaultAuth` / `VaultDynamicSecret` CRDs; client-cache `direct-encrypted` via transit | +| **prometheus** | Helm — `prometheus-community/prometheus` `28.13.0` (app `v3.10.0`) | none (in-cluster) | `persistentVolume` enabled, `8Gi` | Metrics scraping + TSDB storage | +| **grafana** | Helm — `grafana/grafana` `10.3.0` (+ `tool` lib) | `grafana.arcodange.lab` (Traefik, Let's Encrypt) | `persistence.enabled: false` (ephemeral; dashboards provisioned) | Dashboards; datasources Prometheus + ClickHouse | +| **crowdsec** | Helm — `crowdsecurity/crowdsec` `0.20.1` (+ `tool` lib) | none (Traefik bouncer + AppSec on the edge) | LAPI state in external PostgreSQL (via pgbouncer) | Behavioural detection; agent parses Traefik logs, AppSec virtual-patching | +| **pgbouncer** | Helm — `icoretech/pgbouncer` `2.3.1` (+ `tool` lib) | none (cluster service `pgbouncer.tools`) | stateless (config only) | Connection pooler to the **external** PostgreSQL on `pi2` (`192.168.1.202`), pinned via `kubernetes.io/hostname: pi2` | +| **redis / KeyDB** | Helm — `pascaliske/redis` `2.1.0` (+ `tool` lib) | none (cluster service) | PVC `create: true`, `1Gi` at `/data` | In-memory cache; KeyDB master + replica, Redis-compatible | +| **plausible** | **Kustomize** — inflates `pascaliske/plausible` `2.0.0` | `analytics.arcodange.lab` (Traefik `IngressRoute`, Let's Encrypt) | stateless app; data lives in ClickHouse | Privacy-friendly web analytics; `DB_HOST: pgbouncer.tools` | +| **clickhouse** | **Kustomize** — inflates `pascaliske/clickhouse` `0.4.0` + local `databases` chart | none (cluster service) | PVC `16Gi` (StatefulSet) | OLAP column store backing Plausible | +| **pgcat** *(disabled)* | Helm — `improwised/pgcat` `0.1.0` — **commented out** in `chart/values.yaml` | — | — | Alternative pooler; not rendered (too constraining: must list every db/user, md5-only auth) | +| **tool** *(library)* | Helm **library chart** (`type: library`), not deployable | — | — | Shared templates/helpers consumed by the component charts | + +## How tools fit together + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TB + classDef ext fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef edge fill:#d97706,stroke:#b45309,color:#fff + classDef meta fill:#2563eb,stroke:#1e40af,color:#fff + + ARGOCD["factory ArgoCD
Application: tools"]:::meta + META["tools meta-chart
chart/ (apps.yaml + project.yaml)"]:::meta + PROJ["AppProject: tools"]:::meta + + subgraph NS["tools namespace"] + VAULT[("hashicorp-vault
+ VSO")]:::ext + PROM["prometheus"]:::proc + GRAF["grafana"]:::proc + CS["crowdsec
Traefik bouncer + AppSec"]:::edge + PGB["pgbouncer"]:::proc + REDIS[("redis / KeyDB")]:::ext + PLA["plausible"]:::proc + CH[("clickhouse")]:::ext + PODS["app + tool pods"]:::proc + end + + PG[("external PostgreSQL
pi2 · 192.168.1.202")]:::ext + TRAEFIK["Traefik ingress
vault / grafana / analytics .arcodange.lab"]:::edge + + ARGOCD --> META + META --> PROJ + META -- "one Application per component" --> NS + VAULT -- "inject secrets (VSO)" --> PODS + PGB -- "pools to" --> PG + PLA -- "writes analytics" --> CH + PROM --> GRAF + CH --> GRAF + TRAEFIK --> VAULT + TRAEFIK --> GRAF + TRAEFIK --> PLA + CS -- "fronts the edge" --> TRAEFIK +``` + +1. **Factory's ArgoCD** owns a single Application named `tools` pointed at this repo's `chart/` meta-chart. +2. The **meta-chart** renders the `tools` **AppProject** (which scopes every child to the `tools` repo + `tools` namespace) and **one Application per component** listed under `tools:` in `chart/values.yaml`. +3. Every child Application deploys into the **`tools` namespace** — Vault+VSO, Prometheus, Grafana, CrowdSec, pgbouncer, Redis/KeyDB, Plausible, ClickHouse. +4. **Vault + VSO** inject secrets into app and tool pods via the `VaultAuth` / `VaultDynamicSecret` CRDs. +5. **pgbouncer** pools connections out to the **external PostgreSQL** on `pi2` (`192.168.1.202`), the same database CrowdSec's LAPI and Plausible use through it. +6. **Plausible** writes analytics into **ClickHouse**; both **Prometheus** and **ClickHouse** are wired as **Grafana** datasources. +7. **Traefik** publishes `vault.arcodange.lab`, `grafana.arcodange.lab`, and `analytics.arcodange.lab` over Let's Encrypt, with **CrowdSec** running as the bouncer/AppSec layer fronting that edge. + +## Pages in this guidebook + +| Page | What it covers | Status | +|---|---|---| +| [Components](components.md) | Per-component internals: chart values, ingress, persistence, how each gets its secrets | ✅ Active | +| [Secrets & VSO](secrets-and-vso.md) | How Vault + the Vault Secrets Operator deliver static and dynamic secrets into `tools` pods | ✅ Active | + +## Maintenance rule + +> [!IMPORTANT] +> **If a component in the `tools` repo changes, update this guidebook in the same change.** Adding or removing a key under `tools:` in `chart/values.yaml`, bumping an upstream chart version, switching a component between Helm and Kustomize, or changing an ingress host or persistence size all alter the inventory above — keep the table and the diagram in sync as part of the same PR. A reference map that drifts from reality sends readers (and agents) confidently down dead paths. + +## Cross-references + +- [lab-ecosystem 02 · tools](../lab-ecosystem/02-tools.md) — the parent whole-lab view of this namespace. +- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — the lab-wide Vault model these services depend on. +- [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) — how each component's `iac/` (Vault config) is applied. +- [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) — why a safe, prod-like environment shapes how these platform services are run. diff --git a/vibe/guidebooks/tools/components.md b/vibe/guidebooks/tools/components.md new file mode 100644 index 0000000..020b226 --- /dev/null +++ b/vibe/guidebooks/tools/components.md @@ -0,0 +1,218 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Tools](README.md) > **Components** + +# Components + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Tools hub](README.md) · [lab-ecosystem 02 · tools](../lab-ecosystem/02-tools.md) +> **Downstream:** [Secrets & VSO](secrets-and-vso.md) +> **Related:** [storage & recovery concept](../lab-ecosystem/storage-and-recovery.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [naming conventions](../lab-ecosystem/naming-conventions.md) + +This is the **per-component reference** for the `tools` platform layer: pinned chart/app versions, the values that actually matter (replicas, storage, ports, auth), and the cross-service wiring. Every component lands in the single **`tools` namespace**. For the deploy model (how one ArgoCD Application fans out into one per component) see the [Tools hub](README.md); for how Vault secrets reach the pods see [Secrets & VSO](secrets-and-vso.md). + +Components split into two **tiers**: + +- **Tier 1** — the load-bearing services, each with its own subsection and value tables below. +- **Tier 2** — supporting / inactive pieces, summarised in a single table. + +Severity legend (GitHub alerts): `[!NOTE]` informational · `[!TIP]` good-to-know · `[!WARNING]` operational hazard · `[!CAUTION]` live risk. + +--- + +## Tier 1 — load-bearing services + +### hashicorp-vault + +[`hashicorp-vault/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault) — the lab's secrets brain. The chart bundles **three** dependencies: the upstream `vault` server, the `vault-secrets-operator` (VSO) that injects secrets into pods, and the shared `tool` library chart. + +| Key | Value | +|---|---| +| Chart deps | `vault` `0.28.1`, `vault-secrets-operator` `0.9.0`, `tool` `0.1.0` | +| Mode | `standalone` (single instance, **not** HA / raft) | +| Storage | `storage "file"` at `/vault/data` + audit storage enabled | +| Listener | TLS **off** (`tls_disable = 1`) on `[::]:8200` — terminated at the edge | +| Ingress | `vault.arcodange.lab` (Traefik `websecure`, Let's Encrypt, `localIp@file` middleware) | +| UI | enabled (`ui = true`) | +| Log level | `trace` | + +**Mounts (secret engines) exposed:** + +| Mount | Type | Purpose | +|---|---|---| +| `kvv1` | KV v1 | Static secrets (legacy / v1 layout) | +| `kvv2` | KV v2 | Versioned static secrets (primary store) | +| `transit` | transit | Encryption-as-a-service; backs VSO client-cache (`vso-client-cache` key) | +| `postgres` | database | Dynamic PostgreSQL credentials (connection via `pgbouncer.tools:5432`) | + +**Auth methods enabled:** + +| Method | Used by | +|---|---| +| `kubernetes` | In-cluster workloads (VSO, app ServiceAccounts) authenticate by SA token | +| `gitea_jwt` | Gitea Actions / OIDC-JWT pipelines authenticate from CI | + +> [!NOTE] +> The full secret-engine layout, VSO `VaultAuth` / `VaultConnection` / `VaultDynamicSecret` wiring, and the `kvv2/data/...` path conventions are documented in [Secrets & VSO](secrets-and-vso.md) — this page only inventories what the chart stands up. + +The VSO sub-chart ships a `defaultVaultConnection` pointing at `http://hashicorp-vault.tools.svc.cluster.local:8200` and a client cache with `persistenceModel: direct-encrypted`, encrypted through the `transit` mount. + +### prometheus + +[`prometheus/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus) — metrics collection and TSDB, via the `kube-prometheus`-style community chart. + +| Key | Value | +|---|---| +| Chart deps | `prometheus` `28.13.0` (app `v3.10.0`), `tool` `0.1.0` | +| Server replicas | `1` (Deployment, `strategy: Recreate`) | +| Server storage | `persistentVolume` enabled, **8Gi** at `/data` (`ReadWriteOnce`) | +| Retention | `15d` | +| Alertmanager | enabled, persistence **2Gi** (`ReadWriteOnce`) | +| node-exporter | enabled (DaemonSet, `prometheus-node-exporter` sub-chart) | +| kube-state-metrics | enabled | +| pushgateway | enabled (`prometheus.io/probe: pushgateway`) | +| Scrape / eval interval | `1m` (scrape timeout `10s`) | +| Ingress | none — **internal only** | + +**Scrape targets** (default `scrapeConfigs`, all enabled): the Prometheus server itself, the Kubernetes API servers, nodes + kubelet cadvisor, plus **annotation-based** service-endpoint and pod discovery (`prometheus.io/scrape`, `prometheus.io/port`, `prometheus.io/path`, `prometheus.io/scheme`), with `*-slow` (5m) variants for cheaper targets. + +### grafana + +[`grafana/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana) — dashboards over Prometheus and ClickHouse. + +| Key | Value | +|---|---| +| Chart deps | `grafana` `10.3.0` (app `latest`), `tool` `0.1.0` | +| Replicas | `1` (Deployment, `RollingUpdate`) | +| Persistence | **disabled** — ephemeral; dashboards/datasources are provisioned at boot | +| Ingress | `grafana.arcodange.lab` (Traefik `websecure`, Let's Encrypt, `localIp@file` middleware) | +| Plugin | `grafana-clickhouse-datasource` | +| Resources | requests `100m` / `128Mi`, limits `100m` / `512Mi` | +| Timezone | `Europe/Paris` | + +**Datasources (provisioned):** + +| Name | Type | Target | Default | +|---|---|---|---| +| Prometheus | `prometheus` | `http://prometheus-server.tools.svc.cluster.local` | ✅ yes | +| clickhouse | `grafana-clickhouse-datasource` | `clickhouse.tools.svc.cluster.local:9000` (native, `tlsSkipVerify`) | no | + +> [!WARNING] +> The Grafana **admin password is static and committed** in [`grafana/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana/values.yaml) (`adminUser: admin`). The provisioned ClickHouse datasource password is committed there too (`secureJsonData.password`). Treat these as lab-only credentials; do not reuse them outside the homelab. + +### crowdsec + +[`crowdsec/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec) — behavioural edge security that feeds a Traefik blocklist. + +| Key | Value | +|---|---| +| Chart deps | `crowdsec` `0.20.1`, `tool` `0.1.0` | +| LAPI | Deployment (`RollingUpdate`, `maxUnavailable: 0`) — the local API + decision store | +| Agent | DaemonSet pinned to control-plane nodes (`node-role.kubernetes.io/control-plane`) | +| Log source | parses **Traefik** pod logs in `kube-system` (`podName: traefik-*`, `program: traefik`) | +| Collections | `crowdsecurity/traefik`, `crowdsecurity/http-cve` (+ AppSec rules below) | +| AppSec (WAF) | **enabled** — `crowdsecurity/appsec-default` on `0.0.0.0:7422`; collections `appsec-virtual-patching` + `appsec-generic-rules` | +| Database | external PostgreSQL `crowdsec` via **pgbouncer** (`host: pgbouncer.tools:5432`, `type: postgresql`) | +| DB credentials | dynamic, from secret `crowdsec-db-credentials` (`DB_USER` / `DB_PASSWORD`, sourced via VSO) | +| Console | enrolled as instance `homelab` | + +The decisions CrowdSec produces are surfaced as a **Traefik middleware blocklist applied at the edge**, so malicious IPs are dropped before they reach app namespaces. `server_reset_query: DEALLOCATE ALL` on pgbouncer (below) exists specifically to keep CrowdSec's prepared statements happy through the pooler. The CAPTCHA challenge CrowdSec serves on remediated requests is a **Cloudflare Turnstile widget minted by the `cms` repo** — see the [CMS Cloudflare page](../cms/cloudflare.md), which produces the sitekey/secret this bouncer consumes from Vault. + +### pgbouncer + +[`pgbouncer/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgbouncer) — the connection pooler in front of the **external** PostgreSQL. + +| Key | Value | +|---|---| +| Chart deps | `pgbouncer` `2.3.1` (`icoretech/pgbouncer`), `tool` `0.1.0` | +| Scheduling | `nodeSelector: kubernetes.io/hostname: pi2` (co-located with PostgreSQL) | +| Upstream DB | external PostgreSQL at `192.168.1.202:5432` (the `pi2` host), wildcard database `"*"` | +| Auth type | `scram-sha-256` | +| `auth_query` | `SELECT uname, phash FROM user_lookup($1)` | +| `server_reset_query` | `DEALLOCATE ALL` (clears prepared statements — fixes CrowdSec re-use) | +| `server_idle_timeout` | `7200` (2h) | +| `ignore_startup_parameters` | `extra_float_digits` (unsupported JDBC arg) | +| Exporter | disabled | +| Service | `pgbouncer.tools:5432` (cluster-internal) | + +> [!NOTE] +> pgbouncer is the single front door to the lab's PostgreSQL: CrowdSec, Plausible, and Vault's `postgres` dynamic-secret backend all connect through `pgbouncer.tools:5432`, never to `192.168.1.202` directly. + +### redis (KeyDB) + +[`redis/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/redis) — the in-memory cache / session store. The chart targets **KeyDB** (EqAlpha, Redis-compatible), tuned for the 2× Raspberry Pi 5 nodes. + +| Key | Value | +|---|---| +| Chart deps | `redis` `2.1.0` (`pascaliske/redis`), `tool` `0.1.0` | +| Workload | **StatefulSet** (master at index 0, replica running `replicaof` the master) | +| Storage | PVC `create: true`, **1Gi** at `/data` (`ReadWriteOnce`) | +| Tuning | `server-threads 4` (ARM-tuned for the Pi 5 cores) | +| Port | `6379` (`ClusterIP`) | +| Security | `runAsUser/Group/fsGroup: 999`, non-root | +| Timezone | `Europe/Paris` | + +> [!NOTE] +> Access the instance for inspection with `kubectl port-forward -n tools svc/redis 6379:6379` and Redis Insights (per the [chart README](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/redis/README.md)). + +### plausible + +[`plausible/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible) — privacy-friendly web analytics. Deployed via a **Kustomize** overlay that inflates the upstream Helm chart (not a `Chart.yaml` dependency like the Tier-1 charts above). + +| Key | Value | +|---|---| +| Declared via | Kustomize `helmCharts:` inflation generator | +| Chart / version | `plausible` `2.0.0` (`pascaliske/plausible`), image `ghcr.io/plausible/community-edition` | +| Replicas | `1` (Deployment) | +| Ingress | `analytics.arcodange.lab` (Traefik IngressRoute, Let's Encrypt, `localIp@file` middleware) | +| App DB | PostgreSQL via **pgbouncer** — an **init container** assembles `DATABASE_URL` from VSO dynamic creds | +| Event store | **ClickHouse** (see below) | +| GeoIP | MaxMind **GeoLite2** (`GeoLite2-Country` + `GeoLite2-City`), license key from secret `plausible-geoip` | +| Secrets | `SECRET_KEY_BASE` / `TOTP_VAULT_KEY` from existing secret `plausible-config` (VSO-fed) | + +Plausible writes analytics events to ClickHouse and stores app/account state in PostgreSQL — two distinct backends, both reached through lab-internal services. + +### clickhouse + +[`clickhouse/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/clickhouse) — the OLAP column store behind Plausible. Also a **Kustomize** overlay inflating the upstream chart, plus a `databases` sub-chart that runs an init job. + +| Key | Value | +|---|---| +| Declared via | Kustomize `helmCharts:` inflation generator (`chartHome: charts`) | +| Chart / version | `clickhouse` `0.4.0` (`pascaliske/clickhouse`), image `clickhouse/clickhouse-server` | +| Workload | **StatefulSet**, `replicas: 1` | +| Storage | PVC **16Gi** at `/var/lib/clickhouse` (`ReadWriteOnce`) | +| Ports | `8123` (HTTP), `9000` (native protocol) | +| Custom user | `arcodange` (full network access, `access_management: 1`) via `custom-users.xml` | +| Security | `runAsUser/Group/fsGroup: 101`, non-root | +| Timezone | `Europe/Paris` | + +> [!WARNING] +> The ClickHouse `arcodange` user password is **static and committed** in [`clickhouse/clickhouseValues.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/clickhouse/clickhouseValues.yaml) (`custom-users.xml`). The same value appears in Grafana's provisioned datasource — keep the two in sync if you rotate it. + +> [!CAUTION] +> ClickHouse carries a `nodeAffinity` that **excludes `pi2`** (`kubernetes.io/hostname NotIn [pi2]`). `pi2` hosts PostgreSQL and pgbouncer; ClickHouse is deliberately kept off it to avoid I/O contention on that node. A cluster where `pi2` is the only schedulable node will leave ClickHouse `Pending`. + +--- + +## Tier 2 — supporting & inactive + +| Component | Status | Notes | +|---|---|---| +| [`pgcat/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgcat) | ❌ disabled | Alternative Postgres pooler (`pgcat` chart `0.1.0`). Not in service — its sole pool has empty `username`/`password`/`database` placeholders, and it is **not** keyed under `tools:` in [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml), so ArgoCD renders no Application for it. [pgbouncer](#pgbouncer) is the active pooler. | +| [`tool/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/tool) | ✅ active (library) | Helm **library chart** (`type: library`, version `0.1.0`) consumed by **every** component chart via `dependencies:`. Ships shared templates/helpers; **not deployable** on its own. | + +--- + +## Gotchas + +> [!WARNING] +> **No high availability.** Every Tier-1 service runs a **single replica** — Vault (`standalone`), Prometheus (`replicaCount: 1`), Grafana (`replicas: 1`), ClickHouse and Redis/KeyDB StatefulSets (`replicas: 1`), Plausible and the CrowdSec LAPI (single Deployment). Any node drain or pod restart is a brief outage for that service, not a failover. + +> [!WARNING] +> **Static, committed passwords.** Grafana admin (+ its ClickHouse datasource), the ClickHouse `arcodange` user, and the pgbouncer admin/auth users all carry plaintext credentials in their `values.yaml`. They are lab-only; rotate before any exposure and never copy them to a real environment. + +> [!CAUTION] +> **ClickHouse must avoid `pi2`.** The `NotIn [pi2]` `nodeAffinity` keeps it off the PostgreSQL/pgbouncer host. If `pi2` is the only schedulable node, ClickHouse (and therefore Plausible analytics) stays `Pending`. See the [storage & recovery concept](../lab-ecosystem/storage-and-recovery.md) for how PVC-backed services map onto specific nodes. + +> [!CAUTION] +> **Vault is single-instance and starts sealed.** After **any** restart (pod reschedule, node reboot, chart upgrade) Vault comes up **sealed** with no automatic unseal configured — every VSO injection and dynamic-secret lease blocks until an operator unseals it. This is the first thing to check when secrets stop flowing across the cluster; the unseal procedure lives in [Secrets & VSO](secrets-and-vso.md). diff --git a/vibe/guidebooks/tools/secrets-and-vso.md b/vibe/guidebooks/tools/secrets-and-vso.md new file mode 100644 index 0000000..66e3fd3 --- /dev/null +++ b/vibe/guidebooks/tools/secrets-and-vso.md @@ -0,0 +1,234 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Tools](README.md) > **Secrets & VSO** + +# Tools — Secrets & VSO + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Tools](README.md) · [Components](components.md) +> **Downstream:** consumed by every `tools`-namespace pod and by every app's CI/CD +> **Related:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [naming-conventions concept](../lab-ecosystem/naming-conventions.md) · [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md) · [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) + +This page maps how secrets live in **HashiCorp Vault** (engines, auth backends) and how they reach **Kubernetes pods** via the **Vault Secrets Operator (VSO)**. The keystone is the **`app_policy` + `app_roles` module pair**: the machinery that turns a single `` name into a matched set of Vault policies, roles, and CI identities — the same `` join key documented in the [naming-conventions concept](../lab-ecosystem/naming-conventions.md). + +Vault itself runs as a component in the `tools` namespace; see the [Components](components.md) page for its deploy shape. The admin/bootstrap layer (the `kvv1` engine, the `gitea_jwt` auth backend, the base `gitea_cicd` role, the Kubernetes auth backend mount) is created **by factory's Ansible-managed Vault Terraform** in [`hashicorp_vault.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf); everything in this page that is *per-app* is created by the IaC under [`hashicorp-vault/iac`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac). + +> [!CAUTION] +> Vault runs **standalone** with file/raft storage and starts **sealed** after any restart or node reboot. Until it is unsealed, every VSO read fails and no app can fetch DB creds or config — pods that depend on a `VaultDynamicSecret` will not start. Unseal procedure and key custody live in [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md). + +--- + +## 1) Vault engines & auth backends + +All engines below are mounted by [`hashicorp-vault/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) except `kvv1`, which is bootstrapped by factory's Ansible Vault Terraform. + +| Mount | Type | Holds | Defined in | +|---|---|---|---| +| `kvv1` | KV **v1** | Admin / cloud secrets: `kvv1/google/credentials`, `kvv1/gitea/*`, `kvv1/cloudflare/*`, `kvv1/ovh/*`, `kvv1/postgres/credentials`, `kvv1/admin/*` | factory [`hashicorp_vault.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) | +| `kvv2` | KV **v2** (versioned) | Per-app config secrets under `kvv2//*` | [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) | +| `transit` | transit | The **VSO client-cache encryption key** `vso-client-cache` — lets VSO persist its client cache encrypted so it survives an operator restart without re-auth storms | [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) | +| `postgres` | database | **Dynamic** Postgres creds at `postgres/creds/`; connects to the DB through `pgbouncer.tools:5432` using the `credentials_editor` root account | [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) | + +The `postgres` connection is configured with `allowed_roles = ["*"]` and a root-rotation statement (`ALTER USER … WITH PASSWORD`); the editor username/password come from the sensitive `POSTGRES_CREDENTIALS_EDITOR_*` variables. + +### Auth backends + +| Backend | Mount | Who uses it | Role(s) | +|---|---|---|---| +| `kubernetes` | `kubernetes` | VSO controller + every app pod's ServiceAccount | `vault-secret-operator` (VSO itself), `` (one per app), `factory_crowdsec_conf` | +| `gitea_jwt` | `gitea_jwt` | CI/OpenTofu jobs running in Gitea Actions | `gitea_cicd` (base, factory-bootstrapped) + per-app `gitea_cicd_` | + +- **`kubernetes`** auth ([`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf)) is configured against `https://kubernetes.default.svc:443`. The VSO role `vault-secret-operator` binds SA `hashicorp-vault-vault-secrets-operator-controller-manager` in ns `tools`, `audience = vault`, and carries the `edit-vso-client-cache` policy (encrypt/decrypt on `transit/.../vso-client-cache`). +- **`gitea_jwt`** is the OIDC/JWT backend for CI. Its backend, `default_role = gitea_cicd`, and the base `gitea_cicd` role are created by factory's Vault bootstrap; the Vault provider in each IaC project logs in via `auth_login_jwt { mount = "gitea_jwt", role = "gitea_cicd[_]" }` using the `TERRAFORM_VAULT_AUTH_JWT` env var. See the [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) for how the token is minted in the pipeline. + +### Terraform state + +Each IaC project keeps its state in the **`arcodange-tf` GCS bucket** under a distinct prefix: + +| Project | GCS prefix | +|---|---| +| Vault admin/app machinery | `tools/hashicorp_vault/main` | +| Plausible | `tools/plausible/main` | +| CrowdSec | `tools/crowdsec/main` | + +--- + +## 2) The `app_policy` + `app_roles` modules — the `` join-key machinery + +> [!IMPORTANT] +> These two modules are the heart of the secrets layer. Given a single `` name they emit a **matched, name-derived** set of Vault objects so that an app's runtime, its CI, and its database identity all line up on the same key. This is the Vault half of the lab-wide [naming convention](../lab-ecosystem/naming-conventions.md): the same `` string also names the Kubernetes namespace, the ServiceAccount, the Postgres `_role`, and the Gitea repo. + +The two modules live on **opposite sides of the trust boundary**: + +- [`modules/app_policy`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_policy) is declared **once, centrally**, in the Vault admin project ([`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf), `for_each` over `var.applications`). It creates the **policies and the CI identity** — the privileged bits — so the app's own repo never holds them. +- [`modules/app_roles`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_roles) is declared **by the subordinate app project** (pulled over SSH as a Git module), running under the ``-ops policy. It creates the **roles** the app needs. + +### `app_roles` — runtime roles (declared by the app repo) + +For ``, [`app_roles/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_roles/main.tf) creates: + +| Resource | Path | Key settings | +|---|---|---| +| Kubernetes auth role | `auth/kubernetes/role/` | `bound_service_account_names = [] + extras`, `bound_service_account_namespaces = [] + extras`, `token_ttl = 3600` (1h), `token_policies = [default, ]`, `audience = vault` | +| Postgres dynamic role | `postgres/roles/` | `db_name = postgres`; creation SQL: `CREATE ROLE "{{name}}" WITH LOGIN PASSWORD … VALID UNTIL …` then `GRANT _role TO "{{name}}"`; revocation: `REASSIGN OWNED BY "{{name}}" TO _role` then `REVOKE ALL ON DATABASE FROM "{{name}}"` | + +> [!IMPORTANT] +> The Postgres dynamic role's creation SQL does `GRANT _role TO {{name}}` and its revocation does `REASSIGN OWNED BY {{name}} TO _role`. **The non-login `_role` must already exist in Postgres** — it is created by factory's [postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) (`postgresql_role.app_role[""]`, owner of the `` database). If that role is missing, every ephemeral-user creation/revocation fails. This is the ordering dependency between the two repos: **factory postgres/iac before tools app_roles**. + +> [!NOTE] +> The Kubernetes auth role binds **both** SA names **and** namespaces — the check is an **AND**. A token presenting SA `` from the wrong namespace (or any other SA from ns ``) is rejected. The default binding is SA `` in ns ``; the `service_account_names` / `service_account_namespaces` inputs widen it (e.g. CrowdSec/Plausible run in ns `tools`, not a namespace named after the app). + +The Postgres role can be skipped with `disable_database = true`; the DB name defaults to `` but can be overridden via `database`. + +### `app_policy` — policies + CI identity (declared centrally) + +For ``, [`app_policy/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_policy/main.tf) creates: + +| Resource | Name | Grants | +|---|---|---| +| **App policy** | `` | `read,list` on `kvv2/data//*`; `read` on `postgres/creds/*` — what the runtime pod can do | +| **Ops policy** | `-ops` | The CI bundle (below) | +| **JWT role** | `gitea_cicd_` (mount `gitea_jwt`) | `token_policies = [default] + 's ops_policies`, `bound_audiences = [gitea_app_id]`, `user_claim = email`, `role_type = jwt` | +| **Identity group** | `-ops` | Internal group carrying the `-ops` policy, so Vault users mapped to their Gitea entity inherit ops rights | + +The **`-ops` policy** is the privilege set a CI job needs to *manage* the app's own corner of Vault and the clouds: + +- `create/update` on `auth/token/create`; `read` on `sys/mounts/auth/*` (so the Vault provider works); +- full CRUD on `postgres/roles/*` and on `auth/kubernetes/role/*` (so `app_roles` can apply) — the k8s-role rule is **parameter-constrained**: it may only set `bound_service_account_names`/`bound_service_account_namespaces` to the whitelisted `[] + extras` lists and `token_policies` to `["default",""]`, preventing a CI job from minting a role with broader bindings; +- full CRUD on the app's KV-v2 data, delete/undelete/destroy, and `metadata` (`kvv2/data|delete|undelete|destroy|metadata//*`); +- `read` on `kvv1/google/credentials` (the GCS backend SA), `kvv1/gitea/tofu_module_reader` (the bot SSH key that lets CI pull the `app_roles` Git module); +- CRUD on `kvv1/cloudflare/*` and `kvv1/ovh/*` (cloud DNS/edge secrets scoped to the app). + +> [!NOTE] +> The policy document is post-processed with two `replace()` calls. The Vault provider serializes the whitelisted list parameters as a JSON-encoded string (`"["webapp"]"`); the replaces strip the outer quotes so Vault receives a real list. If you change those `allowed_parameter` blocks, keep the replaces in sync. + +### Apps wired in `terraform.tfvars` + +[`terraform.tfvars`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/terraform.tfvars) declares the `applications` set the central `app_policy` `for_each` walks: + +| `` | Extra SA | Extra ns | Extra ops policy | Notes | +|---|---|---|---|---| +| `webapp` | — | — | — | defaults: SA `webapp` / ns `webapp` | +| `erp` | — | — | — | defaults | +| `cms` | `cloudflared` | — | `factory__cf_r2_arcodange_tf` | extra SA for the Cloudflare tunnel; extra ops policy for the CF R2 Terraform-state bucket | +| `crowdsec` | — | `tools` | — | runs in ns `tools` | +| `plausible` | — | `tools` | — | runs in ns `tools` | + +> [!NOTE] +> `terraform.tfvars` uses the key `ops_policies` for the CMS extra policy while `variables.tf` declares the optional attribute as `policies`; the central `main.tf` passes `each.value.policies` into the module's `ops_policies` input. Read these together when adding a new app so the extra-policy list actually lands on the JWT role. + +--- + +## 3) VSO CRDs — how a secret becomes a Kubernetes Secret + +The [Vault Secrets Operator](https://developer.hashicorp.com/vault/docs/platform/k8s/vso) watches three custom resources and writes plain Kubernetes `Secret` objects that pods consume normally (env / volume). The app repo ships the CRDs; the operator does the Vault round-trips. + +| CRD | What it does | Refresh / rotation | +|---|---|---| +| `VaultAuth` | Picks the auth method (`kubernetes`), the `mount`, the Vault `role` (= ``), and the pod **ServiceAccount** (= ``) used to log in; references a `VaultConnection` (here the in-cluster `default` → `http://hashicorp-vault.tools.svc.cluster.local:8200`) | n/a — used by the other two CRDs via `vaultAuthRef` | +| `VaultStaticSecret` | Reads a **KV-v2** path → writes a k8s `Secret` | `refreshAfter` (the lab uses `30s`) | +| `VaultDynamicSecret` | Reads `postgres/creds/` (a **dynamic** lease) → writes a k8s `Secret`; `rolloutRestartTargets` lists Deployments to restart when creds rotate | follows the Vault lease TTL (1h); VSO renews/re-issues and restarts the targets | + +### Worked example — Plausible (`tools` namespace) + +Files under [`plausible/resources`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources): + +1. **`VaultAuth` `plausible`** ([`vaultauth.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/vaultauth.yaml)) — `method: kubernetes`, `role: plausible`, `serviceAccount: plausible`, `audiences: [vault]`. This is the Vault role `app_roles` created in [`plausible/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/iac/main.tf). +2. **`VaultStaticSecret` `plausible`** ([`vaultsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/vaultsecret.yaml)) — `kvv2` path `plausible/config` → Secret `plausible-config` (`refreshAfter: 30s`). The config payload holds **`SECRET_KEY_BASE`** and **`TOTP_VAULT_KEY`**, both **generated by Terraform** (`random_password`, base64-encoded) and written to `kvv2/plausible/config` via `vault_kv_secret_v2` in the plausible IaC. +3. **`VaultStaticSecret` `plausible-geoip`** ([`geoipsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/geoipsecret.yaml)) — `kvv2` path `plausible/geoip` → Secret `plausible-geoip` exposing **`LICENSE_KEY`** (the MaxMind GeoIP license, an admin-seeded value, fed to the `geoipupdate` sidecar via env `GEOIPUPDATE_LICENSE_KEY`). +4. **`VaultDynamicSecret` `plausible-db-credentials`** ([`vaultdynamicsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/vaultdynamicsecret.yaml)) — `postgres/creds/plausible` → Secret `plausible-db-credentials`; `rolloutRestartTargets` restarts Deployment `plausible`. An **init container** ([`add-initcontainer.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/add-initcontainer.yaml)) reads `username`/`password` from that Secret and writes `DATABASE_URL` (`postgres://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/${DB_NAME}`) into a shared `generated-secrets` volume the app reads. + +### Worked example — CrowdSec (`tools` namespace) + +Templates under [`crowdsec/templates`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec/templates): + +1. **`VaultAuth` `crowdsec`** ([`vaultauth.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec/templates/vaultauth.yaml)) — `role: crowdsec`, `serviceAccount: crowdsec`. +2. **`VaultDynamicSecret` `crowdsec-db-credentials`** ([`vaultdynamicsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec/templates/vaultdynamicsecret.yaml)) — `postgres/creds/crowdsec` → Secret `crowdsec-db-credentials`; `rolloutRestartTargets` restarts Deployment **`crowdsec-lapi`** (the Local API that owns the DB connection). + +### `factory_auth.tf` — the Ansible CrowdSec/Traefik plugin reader + +Separately from the per-app machinery, [`factory_auth.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/factory_auth.tf) wires a Kubernetes auth role **`factory_crowdsec_conf`** for SA **`factory-ansible-tool-crowdsec-traefik-plugin`** in ns **`kube-system`** (`token_ttl = 3600`). It carries policy `factory_crowdsec_conf`, which grants `read,list` on **`kvv2/data/cms/factory/*`**. This is how the Ansible-deployed CrowdSec/Traefik bouncer plugin reads the **Turnstile** configuration that the [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) writes into `kvv2/cms/factory/*` — a cross-repo handoff entirely through Vault, with no shared file. The producer side (the Turnstile widget and the `vault_kv_secret_v2` write) is documented on the [CMS Cloudflare page](../cms/cloudflare.md). + +--- + +## 4) Secret-paths inventory + +| Path | Engine | Holds | Producer | Consumer | +|---|---|---|---|---| +| `kvv2//config` | KV v2 | App runtime config | app CI (KV CRUD via `-ops`) | `VaultStaticSecret` → pod | +| `kvv2/plausible/config` | KV v2 | `SECRET_KEY_BASE`, `TOTP_VAULT_KEY` | Plausible IaC (`random_password` → `vault_kv_secret_v2`) | `VaultStaticSecret plausible` → `plausible-config` | +| `kvv2/plausible/geoip` | KV v2 | `LICENSE_KEY` (MaxMind) | admin-seeded | `VaultStaticSecret plausible-geoip` → `geoipupdate` sidecar | +| `kvv2/cms/factory/turnstile` | KV v2 | Cloudflare Turnstile config | `cms` repo IaC | `factory_crowdsec_conf` k8s role → Ansible CrowdSec/Traefik plugin | +| `postgres/creds/` | database | Ephemeral DB user (`username`/`password`, 1h lease) | Vault on demand (role ``, `GRANT _role`) | `VaultDynamicSecret` → pod (e.g. `plausible-db-credentials`, `crowdsec-db-credentials`) | +| `transit/.../vso-client-cache` | transit | VSO client-cache encryption key | Vault admin IaC | VSO controller (encrypt/decrypt its cache) | +| `kvv1/cloudflare/*` | KV v1 | Cloudflare DNS/edge secrets | admin | app CI (`-ops` CRUD) | +| `kvv1/ovh/*` | KV v1 | OVH secrets | admin | app CI (`-ops` CRUD) | +| `kvv1/gitea/tofu_module_reader` | KV v1 | Bot SSH key to pull the `app_roles` Git module | admin | app CI (`-ops` read) | +| `kvv1/google/credentials` | KV v1 | GCS Terraform-backend SA key | admin | every IaC CI job (read) | + +--- + +## 5) Secrets flow + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TB + classDef eng fill:#7c3aed,stroke:#5b21b6,color:#ffffff + classDef auth fill:#b45309,stroke:#92400e,color:#ffffff + classDef crd fill:#059669,stroke:#047857,color:#ffffff + classDef k8s fill:#2563eb,stroke:#1e40af,color:#ffffff + classDef ci fill:#be123c,stroke:#9f1239,color:#ffffff + + subgraph VAULT["Vault (tools ns)"] + KV2["kvv2 engine
kvv2/<app>/*"]:::eng + PG["postgres engine
postgres/creds/<app>"]:::eng + TR["transit
vso-client-cache"]:::eng + KKUB["kubernetes auth
role <app> (SA AND ns)"]:::auth + KJWT["gitea_jwt auth
gitea_cicd_<app>"]:::auth + end + + subgraph RUNTIME["Runtime path"] + VA["VaultAuth
role <app>, SA <app>"]:::crd + VSS["VaultStaticSecret
kvv2/<app>/config"]:::crd + VDS["VaultDynamicSecret
postgres/creds/<app>"]:::crd + SEC["k8s Secret
<app>-config / -db-credentials"]:::k8s + POD["App pod
(SA <app>)"]:::k8s + end + + subgraph CICD["CI path"] + GHA["Gitea Actions
OpenTofu job"]:::ci + TOFU["apply app_roles
(under <app>-ops)"]:::ci + end + + KKUB --> VA + VA --> VSS + VA --> VDS + KV2 --> VSS + PG --> VDS + VSS --> SEC + VDS -- "rolloutRestart on rotation" --> SEC + SEC --> POD + TR -. "encrypts client cache" .-> VA + + GHA -- "JWT login" --> KJWT + KJWT --> TOFU + TOFU -- "creates" --> KKUB + TOFU -- "creates" --> PG +``` + +1. **Vault** mounts the engines (`kvv2`, `postgres`, `transit`) and the two auth backends (`kubernetes`, `gitea_jwt`), all in the `tools` namespace. +2. A pod's `VaultAuth` logs in through the **`kubernetes`** backend with SA `` against role ``; the role accepts only when **both** the SA name **and** its namespace match (AND). +3. `VaultStaticSecret` reads `kvv2//config` and `VaultDynamicSecret` reads `postgres/creds/` using that auth; VSO writes the values into ordinary k8s `Secret` objects. +4. The pod consumes the Secret (env or volume); on a dynamic-cred **rotation** VSO restarts the `rolloutRestartTargets` Deployment so it picks up the new credentials. +5. The **`transit`** key `vso-client-cache` encrypts VSO's client cache so an operator restart doesn't trigger a re-auth storm. +6. On the CI side, a **Gitea Actions** OpenTofu job logs into the **`gitea_jwt`** backend as `gitea_cicd_` (audience = the Gitea OAuth app id, identity from the `email` claim). +7. Running under the `-ops` policy, that job **applies the `app_roles` module**, creating/updating the Kubernetes auth role and the Postgres dynamic role for `` — closing the loop so the runtime path in steps 2-4 works. + +--- + +## Gotchas + +- **Vault must be unsealed after every restart.** Sealed Vault → all VSO reads fail → dynamic-secret consumers won't start. See [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md). +- **The Kubernetes auth role binds SA *and* namespace (AND).** The wrong namespace, or a different SA in the right namespace, is rejected. Apps in ns `tools` (CrowdSec, Plausible) widen the binding via `service_account_namespaces`. +- **The Postgres dynamic role depends on `_role` existing.** `GRANT _role TO {{name}}` (create) and `REASSIGN OWNED BY {{name}} TO _role` (revoke) both fail if factory's [postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) hasn't created the `_role` non-login role first. Order: **factory postgres/iac → tools app_roles**. +- **The `ops_policies` vs `policies` key mismatch** in `terraform.tfvars` / `variables.tf` (see §2) — read both when adding an app's extra ops policy. +- **The sandbox uses a separate Vault.** Per the [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md), the prod-like sandbox stands up its own Vault instance; none of the paths or roles above are shared with it. Don't assume a secret seeded in prod exists in the sandbox. -- 2.49.1 From 4823394e0e37b022ed9a262dc92ad986e95dbd12 Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 21:58:36 +0200 Subject: [PATCH 5/9] docs(vibe): add applications/ guidebook (webapp + url-shortener) Tree-docs guidebook under vibe/guidebooks/applications/ documenting the common app pattern and two contrasting archetypes, drilling into lab-ecosystem/01-factory (bidirectional): - README.md : the shared app pattern (repo = Dockerfile + chart + optional iac + CI; ArgoCD app-of-apps; the join key; .fr vs .lab ingress conventions) + a two-archetype comparison. - webapp.md : canonical Go + Postgres exemplar (chart, VaultAuth/Static/Dynamic CRDs, inline iac vs the shared app_roles module, CI); notes the current nuance that the live pod still uses the static pgbouncer_auth DATABASE_URL. - url-shortener.md : Rust + SQLite-on-Longhorn-RWO counterpart (single replica, no iac/no Vault, CI mirrors the upstream image); the power-cut recovery story. erp is referenced in prose only (its own guidebook lands next). Sibling-repo code via full gitea URLs; 2 mermaid diagrams MCP-validated; zero dead links. Co-Authored-By: Claude Opus 4.8 --- vibe/guidebooks/README.md | 1 + vibe/guidebooks/applications/README.md | 118 ++++++++++++ vibe/guidebooks/applications/url-shortener.md | 172 ++++++++++++++++++ vibe/guidebooks/applications/webapp.md | 155 ++++++++++++++++ vibe/guidebooks/lab-ecosystem/01-factory.md | 4 + 5 files changed, 450 insertions(+) create mode 100644 vibe/guidebooks/applications/README.md create mode 100644 vibe/guidebooks/applications/url-shortener.md create mode 100644 vibe/guidebooks/applications/webapp.md diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md index 5f62593..55e1ff2 100644 --- a/vibe/guidebooks/README.md +++ b/vibe/guidebooks/README.md @@ -38,6 +38,7 @@ flowchart LR | [Factory provisioning](factory-provisioning/README.md) | Deep dive into how factory provisions everything: Ansible playbooks + roles and OpenTofu | ✅ Active | | [Tools](tools/README.md) | Deep dive into the lab platform services in the `tools` namespace (Vault+VSO, Prometheus, Grafana, CrowdSec, poolers, Redis, Plausible, ClickHouse) | ✅ Active | | [CMS](cms/README.md) | Deep dive into the public Nuxt site arcodange.fr + its Cloudflare DNS/tunnel/Turnstile and Zoho email IaC | ✅ Active | +| [Applications](applications/README.md) | The deployed apps and the common pattern they share — webapp (Go + Postgres) and url-shortener (Rust + SQLite); erp has its own guidebook | ✅ Active | ## Rules to contribute diff --git a/vibe/guidebooks/applications/README.md b/vibe/guidebooks/applications/README.md new file mode 100644 index 0000000..29c4ffc --- /dev/null +++ b/vibe/guidebooks/applications/README.md @@ -0,0 +1,118 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **Applications** + +# Applications + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Lab ecosystem hub](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md) +> **Downstream:** [webapp](webapp.md) · [url-shortener](url-shortener.md) +> **Related:** [naming-conventions](../lab-ecosystem/naming-conventions.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) + +This guidebook maps the **deployed applications** — the workloads ArgoCD runs in their own `` namespace — and, more importantly, the **single repeatable pattern** every one of them follows. Once you know the pattern, every app reads as a variation on the same skeleton: a Gitea repo whose contents (Dockerfile + Helm chart + optional Vault IaC + CI) and whose `` name fully determine how it builds, deploys, gets its secrets, and is reached from the network. + +Two apps are presented in depth as the canonical archetypes: [webapp](webapp.md) (Go + external Postgres) and [url-shortener](url-shortener.md) (Rust + embedded SQLite). Other apps in the cluster — `erp`, `dance-lessons-coach`, `telegram-gateway`, `plausible` — instantiate the same pattern; `erp` has its own guidebook (forthcoming) and is not linked here yet. + +## The common app pattern + +Every application is a self-contained Gitea repo under the `arcodange-org` (or `arcodange`) org that carries the same four ingredients. The `` name — the repo name — is the join key that threads through all of them (see the [naming-conventions concept](../lab-ecosystem/naming-conventions.md)). + +| Ingredient | Path in the app repo | What it is | Required? | +|---|---|---|---| +| **Dockerfile** | `Dockerfile` | Multi-stage build producing the runtime image, pushed to the Gitea container registry as `gitea.arcodange.lab//` | ✅ always | +| **Helm chart** | `chart/` | `Chart.yaml` + `values.yaml` + `templates/` (deployment, service, ingress, serviceaccount, hpa, config, NOTES, optional PVC, optional Vault CRDs) — the unit ArgoCD syncs | ✅ always | +| **Vault IaC** | `iac/` | OpenTofu that declares the app's Vault objects: a Postgres dynamic-secret role keyed on `` + a Kubernetes auth role bound to the `` ServiceAccount. The canonical form pulls the [`app_roles` module](../tools/secrets-and-vso.md) from `tools`; the privileged `app_policy` half is declared centrally so the app repo never holds it | 🟡 only apps needing Postgres / Vault KV | +| **CI workflows** | `.gitea/workflows/` | A `dockerimage` job that builds + pushes the image on every `main` push, and (when `iac/` exists) a `vault` job that runs `tofu apply` against Vault, gated to changes under `iac/*.tf` | ✅ image build · 🟡 vault apply | + +### How a chart becomes a running app + +Factory's [ArgoCD app-of-apps](../lab-ecosystem/01-factory.md) emits **one `Application` CRD per app**, and every field is derived mechanically from the `` name: + +| Application field | Value | Source | +|---|---|---| +| `repoURL` | `https://gitea.arcodange.lab//` | `` + optional org override | +| `path` | `chart` | fixed convention | +| `namespace` | `` (`CreateNamespace=true`) | `` | +| `syncPolicy` | `automated` with `prune: true` + `selfHeal: true` | app-of-apps default | + +The same `` name is also the Postgres database/role name, the Vault role name, the KV path prefix, and the ServiceAccount name — one string keying the whole stack. See [naming-conventions](../lab-ecosystem/naming-conventions.md). + +### Ingress convention — `.fr` public vs `.lab` internal + +Every app that serves HTTP exposes itself through two Traefik ingresses with a fixed split by domain suffix: + +| Ingress | Domain | Traefik entrypoint | Middlewares | TLS / cert | Reached via | +|---|---|---|---|---|---| +| **Public** | `.arcodange.fr` | `web` | `kube-system-crowdsec@kubernetescrd` (CrowdSec bouncer) | terminated at the edge | the **Cloudflared tunnel** — the public web entrypoint | +| **Internal** | `.arcodange.lab` | `websecure` | `localIp@file` (LAN-only allow-list) | a cert from **either** the Traefik `letsencrypt` resolver **or** cert-manager's `step-issuer` (`StepClusterIssuer`) | the LAN directly | + +> [!NOTE] +> The two archetypes differ only in cert mechanism, not in the convention: webapp's internal ingress carries the `letsencrypt` certresolver annotations, while url-shortener's internal ingress requests its cert from cert-manager's `step-issuer`. Both still ride `websecure` + `localIp@file`; both still expose a `.fr` twin behind the CrowdSec middleware. + +## Two archetypes compared + +The deployed apps fall into two shapes. Pick the matching archetype's page when adding or modifying an app. + +| Aspect | [webapp](webapp.md) | [url-shortener](url-shortener.md) | +|---|---|---| +| Language / build | **Go** (golang:1.23 → alpine runtime) | **Rust** (cargo-chef → `scratch` runtime) | +| State | **External Postgres**, reached through the `tools` **pgbouncer** pooler with credentials delivered by **VSO** | **Embedded SQLite** on a `/data` file | +| Persistence | none in-cluster (DB lives on `pi2`) | a **Longhorn RWO PVC** (`storageClassName: longhorn`, `helm.sh/resource-policy: keep`) mounted at `/data` | +| Replicas | **scalable** (stateless pods; HPA-ready) | **single** — RWO volume cannot be shared across pods | +| `iac/` + Vault | **yes** — declares a Postgres dynamic-secret role + a k8s auth role; pod consumes **dynamic, rotating** DB creds via `VaultAuth` + `VaultDynamicSecret` + `VaultStaticSecret` CRDs | **none** — no Vault objects, no DB role | +| Recovery | restore from the **PostgreSQL backup** (factory `05_backup` → `/mnt/backups`) | **Longhorn block-device recovery** of the PVC (raw replica `.img` files) — see [ansible recover](../factory-provisioning/ansible/06-recover.md) | + +The choice is essentially *"shared/scalable state that survives a single node"* (Postgres, webapp shape) versus *"self-contained single-writer state co-located with the pod"* (SQLite-on-Longhorn, url-shortener shape). The trade-off and why both are kept prod-like is recorded in the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md). + +## Generic app lifecycle + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef net fill:#b45309,stroke:#92400e,color:#fff + + REPO["app repo
Dockerfile + chart/ + iac/ + .gitea/workflows"]:::src + IMG["image pushed
gitea registry <org>/<app>"]:::store + VAULT["tofu apply
Vault role + Postgres role"]:::store + ARGO["ArgoCD
deploys chart (ns <app>)"]:::proc + POD["pod
Postgres via pgbouncer + VSO
OR SQLite on Longhorn PVC"]:::proc + TR["Traefik ingress
.fr public / .lab internal"]:::net + + REPO -- "dockerimage CI" --> IMG + REPO -- "vault CI (iac apps only)" --> VAULT + IMG --> ARGO + VAULT -. "creds for DB apps" .- POD + ARGO --> POD + POD --> TR +``` + +1. The **app repo** holds the four ingredients: a Dockerfile, a `chart/`, an optional `iac/`, and `.gitea/workflows`. +2. On a push to `main`, the **dockerimage** workflow builds the image and pushes it to the Gitea container registry as `/`. +3. For apps with an `iac/`, the **vault** workflow runs `tofu apply` to declare the app's Postgres dynamic-secret role and Kubernetes auth role in Vault. +4. **ArgoCD** (factory's app-of-apps) syncs the chart into the `` namespace, deriving `repoURL`/`path`/`namespace` from the `` name. +5. The **pod** comes up; a Postgres-backed app receives rotating DB credentials through pgbouncer + VSO, while a SQLite-backed app mounts its Longhorn PVC at `/data`. +6. **Traefik** publishes the pod through two ingresses: the `.fr` public route (CrowdSec middleware, via the Cloudflared tunnel) and the `.lab` internal route (`websecure` + `localIp` + a letsencrypt/step-issuer cert). + +## Index + +| Page | Archetype | Status | +|---|---|---| +| [webapp](webapp.md) | Canonical **Go + external Postgres** exemplar — `iac/` + Vault dynamic creds, scalable stateless pods | ✅ Active | +| [url-shortener](url-shortener.md) | **Rust + embedded SQLite** counterpart — single replica on a Longhorn RWO PVC, no Vault | ✅ Active | + +`erp` and the other apps (`dance-lessons-coach`, `telegram-gateway`, `plausible`) follow the same pattern; `erp` will be cross-linked here once its dedicated guidebook ships. + +## Maintenance rule + +> [!IMPORTANT] +> **When an app's repo changes shape, its page here changes in the same PR.** If you alter the chart structure, the ingress convention, the Vault wiring, the persistence model, or the CI workflows of a deployed app, update this hub and the relevant archetype page in the same change. A reference map that drifts from the real `chart/` and `iac/` sends agents confidently down dead paths. + +## Cross-references + +- [01 · factory](../lab-ecosystem/01-factory.md) — the ArgoCD app-of-apps that emits one `Application` per app on this page. +- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the `app_policy` + `app_roles` module pair that turns `` into Vault policies, roles, and CI identities; the VSO runtime path the Postgres archetype rides. +- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app PostgreSQL database + `_role` the webapp archetype depends on. +- [naming-conventions](../lab-ecosystem/naming-conventions.md) — the `` join key that threads through repo, image, namespace, DB, Vault role, and ServiceAccount. +- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) — why the lab keeps apps deployed prod-like and the state/recovery trade-offs behind the two archetypes. diff --git a/vibe/guidebooks/applications/url-shortener.md b/vibe/guidebooks/applications/url-shortener.md new file mode 100644 index 0000000..d351aa8 --- /dev/null +++ b/vibe/guidebooks/applications/url-shortener.md @@ -0,0 +1,172 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Applications](README.md) > **url-shortener** + +# url-shortener (Chhoto URL) + +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Applications index](README.md) · [lab-ecosystem hub](../lab-ecosystem/README.md) +> **Downstream:** [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md) · [Ansible recover playbooks](../factory-provisioning/ansible/06-recover.md) +> **Related:** [webapp](webapp.md) (the stateless counterpart) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [secrets-and-vault](../lab-ecosystem/secrets-and-vault.md) + +`url-shortener` is the lab's **stateful** application — the deliberate mirror image of [webapp](webapp.md). Where webapp is a horizontally-scalable, Postgres-backed, Vault-credentialed reference app, `url-shortener` is a single-pod, SQLite-on-a-disk service that proves the lab can run a genuinely stateful workload on Longhorn block storage and recover it after a crash. + +The application itself is **Chhoto URL**, a tiny Rust/[actix-web](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/Cargo.toml) URL shortener, mirrored into the lab from the upstream project [`sintan1729/chhoto-url`](https://github.com/SinTan1729/chhoto-url). The lab does not fork the source — it mirrors the published Docker image (see [§5 CI](#5-ci--mirror-not-build)) — but keeps a copy of the source and a custom Helm chart in [the `url-shortener` repo](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main) so the lab owns its packaging. + +> [!NOTE] +> A second public web app, the **erp** application, exists in the ecosystem but does not yet have a guidebook page; it is mentioned here in prose only and will be cross-linked once its guidebook ships. + +--- + +## 1. The app & image + +Chhoto URL is a single self-contained binary: an `actix-web` HTTP server with a bundled SQLite engine and a plain static frontend. There is no separate database process and no application runtime to install. + +| Aspect | Detail | Source | +| --- | --- | --- | +| Language / framework | Rust, [`actix-web`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/src/main.rs) `4.5.x` | [`actix/Cargo.toml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/Cargo.toml) | +| Database driver | [`rusqlite`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/src/database.rs) with the `bundled` feature — SQLite is compiled **into** the binary | `Cargo.toml` | +| Sessions | `actix-session` cookie store; the session key is regenerated at boot, so a restart invalidates all logins | [`actix/src/main.rs`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/src/main.rs) | +| Frontend | Plain HTML/CSS/JS served by `actix-files` from [`resources/`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/resources) (`index.html`, `static/styles.css`, `static/script.js`, `static/404.html`) — no build step, no SPA framework | `resources/` | +| Listen port | `4567` (overridable via the `port` env var) | `actix/src/main.rs` | +| Slug generation | Adjective-name `Pair` or `UID` styles; the lab forces `UID` (see [§3](#3-the-chart)) | `actix/src/utils.rs` | + +### Image build + +| Image | Base / approach | Result | File | +| --- | --- | --- | --- | +| `Dockerfile` | `cargo-chef` dependency caching → musl static build (`x86_64-unknown-linux-musl`) → `FROM scratch`, copying only the binary + `resources/` | A single-arch, ~6 MB image with **no** OS, shell, or libc | [`Dockerfile`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/Dockerfile) | +| `Dockerfile.multiarch` | Per-arch `FROM scratch` stages selected by `TARGETARCH`, copying pre-built musl binaries | `amd64` / `arm64` / `armv7` images from a local cross-compile | [`Dockerfile.multiarch`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/Dockerfile.multiarch) | +| `compose.yaml` | Pulls the upstream `sintan1729/chhoto-url:latest`, binds `4567`, mounts a `db` volume at the SQLite path | Local-dev only — not the deployed topology | [`compose.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/compose.yaml) | + +The `FROM scratch` design is what makes the storage story so stark: the container has nothing **except** the SQLite file on its mounted volume. All durable state is one file. + +--- + +## 2. Storage — the key contrast + +This is the section that distinguishes `url-shortener` from every other lab app. The entire database is a single SQLite file, `/data/urls.sqlite`, living on a Longhorn PersistentVolumeClaim. + +| PVC field | Value | Why it matters | +| --- | --- | --- | +| `accessModes` | `ReadWriteOnce` (RWO) | Only **one** node can mount the volume at a time | +| `resources.requests.storage` | `128Mi` | A URL table is tiny; this is generous | +| `storageClassName` | `longhorn` | Replicated block storage; the source of durability | +| `annotations` | `helm.sh/resource-policy: keep` | The PVC (and its data) **survives `helm uninstall`** | +| Mount | `/data` → SQLite at `/data/urls.sqlite` | Set by the ConfigMap `db_url` ([§3](#3-the-chart)) | + +Source: [`chart/templates/pvc.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/pvc.yaml) and the volume mount in [`chart/templates/deployment.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/deployment.yaml). + +Because the volume is **RWO**, `replicaCount` is hard-pinned to `1`. A second pod cannot mount the same volume, so there is **no HA and no rolling update** — the next pod can only start after the previous one has released the disk. This is the exact data shape that was reconstructed during the **2026-04-13 power-cut** incident via Longhorn block-device recovery (see [§6](#6-recovery)). + +--- + +## 3. The chart + +`url-shortener` ships its own Helm chart at [`chart/`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart). It is deployed by ArgoCD like every other lab app, following the [naming conventions](../lab-ecosystem/naming-conventions.md). + +| Concern | Setting | Source | +| --- | --- | --- | +| Image | `gitea.arcodange.lab/arcodange-org/url-shortener`, tag defaults to `.Chart.AppVersion` (`6.5.3`) | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml), [`Chart.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/Chart.yaml) | +| Service | `ClusterIP`, port `4567` | [`service.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/service.yaml) | +| Internal ingress | Host `url.arcodange.lab` on `websecure`; TLS via cert-manager `StepClusterIssuer` (`step-issuer`); `localIp@file` middleware (LAN-only) | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml) | +| Public ingress | Host derived from the internal host by the `.lab → .fr` substitution in `_helpers.tpl`; `PathRegexp` matcher (`/[^/]+`) so it intercepts shortlink redirects; **no `localIp`** (it carries the `crowdsec` middleware instead, since the public path must be reachable) | [`public-ingress.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/public-ingress.yaml), [`_helpers.tpl`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/_helpers.tpl) | +| ConfigMap | `db_url=/data/urls.sqlite`, `site_url` = the `.fr` FQDN, `slug_style=UID`, `slug_length=4` | [`config.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/config.yaml) | +| Probes | Liveness + readiness HTTP `GET /` on the `http` port | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml) | +| Autoscaling | **Disabled.** `maxReplicas > 1` would fail under RWO (a second pod cannot mount the volume) | [`hpa.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/hpa.yaml), [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml) | + +> [!NOTE] +> The `_helpers.tpl` `url-shortener.fqdn` template builds the public `site_url` by taking the first internal ingress host and running `replace ".lab" ".fr"` — so the same value drives the internal `url.arcodange.lab` and the public `.fr` redirect domain. This is the same `.lab ↔ .fr` split documented in [naming-conventions](../lab-ecosystem/naming-conventions.md). + +--- + +## 4. No iac/, no Vault — a deliberate deviation + +`url-shortener` has **no `iac/` directory and no Vault CRDs**, and this is intentional, not an oversight. + +| Convention (see [webapp](webapp.md)) | Why url-shortener skips it | +| --- | --- | +| `iac/` OpenTofu module declaring a Postgres role | There is no Postgres. SQLite is embedded; there is no database server to provision. | +| Vault `app_roles` + VSO-synced dynamic credentials | There are no credentials to issue — the app talks to a local file, not a networked database. See [secrets-and-vault](../lab-ecosystem/secrets-and-vault.md) for the pattern url-shortener opts out of. | + +For a SQLite-on-a-file app, **the Helm chart *is* the IaC**: the PVC, the ConfigMap, and the ingress fully describe the deployable surface. There is no second provisioning tier. Compare with [webapp](webapp.md), where the Postgres role and Vault role are first-class infrastructure objects managed outside the chart. + +--- + +## 5. CI — mirror, not build + +The single workflow, [`.gitea/workflows/dockerimage.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/.gitea/workflows/dockerimage.yaml), does **not** build the local `actix/` source. It mirrors the upstream image. + +| Step | Action | +| --- | --- | +| 1. Discover version | `wget` the upstream `Cargo.toml` from `SinTan1729/chhoto-url` on GitHub and parse `version` | +| 2. Login | Authenticate to `gitea.arcodange.lab` using the `PACKAGES_TOKEN` secret | +| 3. Pull | `docker pull sintan1729/chhoto-url:` from Docker Hub (`latest` and the discovered version) | +| 4. Retag | `docker tag` to `gitea.arcodange.lab/:` | +| 5. Push | `docker push` both `latest` and the version tag to the Gitea registry | + +So the in-repo `actix/` source and the `Dockerfile`/`Dockerfile.multiarch` exist for **reference and local cross-compilation** (`Makefile` targets `build-release` / `docker-release` use `cross` for `amd64` / `arm64` / `armv7`), but the cluster runs the **mirrored upstream image**. The multi-arch build is a manual, local-developer flow — it is not run in CI. + +--- + +## 6. Recovery + +`url-shortener`'s SQLite-on-RWO is the canonical example that the lab's recovery tooling targets. When a node dies mid-write or the cluster loses power, the durable artifact to reconstruct is exactly this kind of Longhorn block volume. + +- The concept and policy live in [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md). +- The mechanics live in the [Ansible recover playbooks](../factory-provisioning/ansible/06-recover.md): the `longhorn_data.yml` block-device recovery flow reconstructs precisely this single-file SQLite volume, which is what was recovered in the **2026-04-13** power-cut. + +> [!WARNING] +> Single replica + RWO means **downtime on any pod move**: a node drain, an upgrade, or a reschedule cannot overlap the old and new pods — the new one waits for the disk to detach. There is **no application-level redundancy**. The only durability is **Longhorn volume replication plus backups**; if the volume is lost and unbacked, the URL database is gone. Treat backup health as the single point of failure for this app. + +--- + +## 7. Deviations from convention (vs webapp) + +A side-by-side of where `url-shortener` (stateful) departs from [webapp](webapp.md) (the stateless reference): + +| Dimension | webapp | url-shortener | +| --- | --- | --- | +| Database | PostgreSQL (external server) | SQLite (embedded, single file) | +| Vault / secrets | `app_roles` + VSO-synced dynamic creds | **None** — no networked credentials | +| `iac/` directory | Yes (Postgres role, Vault role) | **None** — the Helm chart is the IaC | +| Replicas | Scalable (HPA-eligible) | **1, hard-pinned** (RWO forbids more) | +| Rolling update / HA | Yes | **No** — single pod, downtime on move | +| CI | Builds the app source | **Mirrors** the upstream image | +| Recovery shape | Postgres backup/restore | Longhorn block-device recovery | + +--- + +## 8. Deploy + storage path + +```mermaid +%%{init: {'theme':'base'}}%% +flowchart LR + upstream["Docker Hub
sintan1729/chhoto-url"] + ci["Gitea Actions
dockerimage.yaml (mirror)"] + reg["Gitea registry
gitea.arcodange.lab/
arcodange-org/url-shortener"] + argo["ArgoCD + Helm chart"] + pod["Pod (replicaCount 1)
actix on :4567"] + pvc["Longhorn PVC (RWO, 128Mi)
keep policy"] + db["/data/urls.sqlite"] + + upstream -- "pull + retag" --> ci + ci -- "push latest + version" --> reg + reg -- "image ref" --> argo + argo -- "deploy" --> pod + pod -- "mounts (RWO)" --> pvc + pvc -- "holds" --> db + + classDef box fill:#1f2933,stroke:#7b8794,color:#f5f7fa; + class upstream,ci,reg,argo,pod,pvc,db box; +``` + +1. **Upstream** — the canonical `sintan1729/chhoto-url` image is published to Docker Hub by the original maintainer. +2. **CI mirror** — the lab's `dockerimage.yaml` workflow pulls that image, retags it, and pushes it to the Gitea registry (it does not build from `actix/`). +3. **Gitea registry** — `gitea.arcodange.lab/arcodange-org/url-shortener` holds both `latest` and the version tag. +4. **ArgoCD + Helm** — the chart references the registry image (tag defaults to `appVersion`) and renders the Deployment, Service, ingresses, ConfigMap, and PVC. +5. **Pod** — a single `actix` pod listens on `4567`; HPA and rolling updates are off. +6. **Longhorn PVC** — the pod mounts the RWO volume at `/data`; only one pod can hold it. +7. **SQLite file** — all durable state is the single `/data/urls.sqlite` file, which is what [Longhorn block-device recovery](../factory-provisioning/ansible/06-recover.md) reconstructs. + +--- + +See also: [webapp](webapp.md) (the stateless, Postgres-backed contrast) · [Applications index](README.md) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md). diff --git a/vibe/guidebooks/applications/webapp.md b/vibe/guidebooks/applications/webapp.md new file mode 100644 index 0000000..2cb6900 --- /dev/null +++ b/vibe/guidebooks/applications/webapp.md @@ -0,0 +1,155 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [Applications](README.md) > **webapp** + +# webapp + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Applications](README.md) · [01 · factory](../lab-ecosystem/01-factory.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) +> **Downstream:** the template every simple Postgres-backed app (`erp`, `dance-lessons-coach`) is cloned from +> **Related:** [url-shortener](url-shortener.md) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) + +`webapp` is the **canonical simple-app exemplar** — a deliberately small Go diagnostic app whose whole job is to exercise the lab's plumbing so every other Postgres-backed app can be cloned from its shape. It ships the four ingredients of the [common app pattern](README.md) (Dockerfile, `chart/`, `iac/`, `.gitea/workflows`) in their most legible form, with no business logic to read around. When you add a new simple app, this repo is the skeleton you copy. + +What it actually does at runtime is a handful of probes: + +| Endpoint | Purpose | +|---|---| +| `GET /` | Serves an HTML form that posts a number to `/query` | +| `GET /query?param=N` | Runs the parameterized query `SELECT 42 + $1` against Postgres and renders the result — the end-to-end "is the DB reachable and answering" check | +| `GET /liveness` | Always-`200 OK` liveness probe (no DB touch) | +| `GET /readiness` | Calls `db.Ping()`; returns `503 NOT READY` if Postgres is unreachable — so the pod only takes traffic once the DB is live | +| `GET /display-info` | Dumps the request's cookies, client IP, and headers — used to confirm the real client IP survives the ingress path | +| `GET /oauth-callback`, `/retrieve`, `/test-oauth-callback` | OAuth device-flow test endpoints (see below) — a workaround for Gitea lacking the OIDC device grant | + +> [!NOTE] +> The OAuth endpoints exist because Gitea's OIDC provider only advertises `authorization_code` + `refresh_token`, not the device grant. `webapp` stands in as the redirect target: `/oauth-callback` stores the `code` keyed by the client-chosen `state` in an in-memory cache (5-minute TTL), and a CLI client then polls `/retrieve?state=…` to exchange its `state` for the `code`. `/retrieve` is IP-gated — it only answers callers in the LAN CIDRs (`192.168.0.0/16`, the IPv6 prefix, the k3s `10.42.0.0/16`) or an explicit `OAUTH_DEVICE_CODE_ALLOWED_IPS` allow-list — which is exactly why the pod must see the **real client IP** (see the `nodeSelector` note in [the chart](#2-the-chart)). + +--- + +## 1) The app & image + +A single-file Go program with two third-party dependencies, built into a tiny runtime image. + +| Aspect | Value | +|---|---| +| Source | [`main.go`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/main.go) — one file, `package main` | +| Module | [`go.mod`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/go.mod) · `gitea.arcodange.lab/arcodange-org/webapp` · Go 1.23 | +| Postgres driver | `github.com/lib/pq` v1.10.9 — registered as the `postgres` driver; the connection string comes from `DATABASE_URL` | +| Cache | `github.com/patrickmn/go-cache` v2.1.0 — the in-memory `state → code` store for the OAuth callback (5 min default expiry, 10 min cleanup) | +| Listen port | `:8080`, plain HTTP (`net/http` default mux) — TLS is terminated upstream at Traefik | +| Query | `SELECT 42 + $1` — parameterized (`$1` bound, not interpolated) so the diagnostic endpoint is not an injection vector | +| Dockerfile | [`Dockerfile`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/Dockerfile) — multistage: `golang:1.23-alpine` builder (`go build -o app .`) → `alpine:latest` runtime with `ca-certificates`, `EXPOSE 8080`, `CMD ["./app"]` | +| Image | `gitea.arcodange.lab/arcodange-org/webapp` — pushed to the Gitea container registry by CI (tags `latest` + the git ref name) | + +The runtime image carries no config of its own: everything (`DATABASE_URL`, `OAUTH_ALLOWED_HOSTS`, the optional `OAUTH_DEVICE_CODE_ALLOWED_IPS`) arrives as environment variables from the chart's ConfigMap. + +--- + +## 2) The chart + +The Helm chart at [`chart/`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart) ([`Chart.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/Chart.yaml): `name: webapp`, `appVersion: "latest"`) is the unit ArgoCD syncs into the `webapp` namespace. It is the boilerplate `helm create` scaffold plus the Vault CRDs and a hardcoded second ingress. + +| Chart object | Template | Shape | +|---|---|---| +| **Deployment** | [`deployment.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/deployment.yaml) | `replicaCount: 1`; `revisionHistoryLimit: 3`; one container on `containerPort: 8080`; `envFrom` the ConfigMap; liveness + readiness probes | +| **Node pinning** | `nodeSelector` in [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/values.yaml) | `kubernetes.io/hostname: pi1` — pinned to the **network entrypoint node** so traffic avoids NAT and the pod sees the **real client IP** (load-bearing for the IP-gated `/retrieve` and for `/display-info`) | +| **ConfigMap** | [`config.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/config.yaml) | `OAUTH_ALLOWED_HOSTS: webapp.arcodange.lab,webapp.arcodange.fr`; `DATABASE_URL: postgres://pgbouncer_auth:pgbouncer_auth@pgbouncer.tools/postgres?sslmode=disable` | +| **Public ingress** | [`ingress.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/ingress.yaml) (values-driven) | host `webapp.arcodange.fr`, Traefik `web` entrypoint (HTTP), middleware `kube-system-crowdsec@kubernetescrd` (the CrowdSec bouncer) | +| **Internal ingress** | [`localIngress.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/localIngress.yaml) (hardcoded manifest) | `Ingress/webapp-local`, host `webapp.arcodange.lab`, Traefik `websecure` entrypoint, `letsencrypt` certresolver, middleware `localIp@file` (LAN-only) | +| **Service** | [`service.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/service.yaml) | `ClusterIP` on port 8080 → `http` target | +| **ServiceAccount** | [`serviceaccount.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/serviceaccount.yaml) | created as `webapp`, token auto-mounted — the identity VSO uses to authenticate to Vault | +| **Probes** | `livenessProbe` / `readinessProbe` in [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/values.yaml) | liveness → `/liveness` (cheap), readiness → `/readiness` (**pings the DB**), both on the `http` port | +| **HPA** | [`hpa.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/hpa.yaml) | gated on `autoscaling.enabled`, which is `false` — **HPA disabled**; the single replica is fixed | + +> [!NOTE] +> The internal `.lab` ingress is shipped as a **hardcoded `localIngress.yaml`** (a plain manifest, not the templated `ingress.yaml`), and the equivalent block in `values.yaml` is left commented out. That is why webapp has two ingress *templates* but the values file only configures the `.fr` public one. The split — `web`/CrowdSec public vs `websecure`/`localIp`/letsencrypt internal — is the lab-wide [ingress convention](README.md#ingress-convention--fr-public-vs-lab-internal). + +--- + +## 3) Vault CRDs in the chart + +The chart ships the **full Vault Secrets Operator (VSO) wiring** as three CRDs. Together they let the pod authenticate to Vault as `webapp` and pull both static config and dynamic DB credentials. + +| CRD | Template | What it declares | +|---|---|---| +| **VaultAuth** | [`vaultauth.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/vaultauth.yaml) | `kubernetes` auth method on mount `kubernetes`, **role `webapp`**, ServiceAccount `webapp`, audience `vault` — the login other CRDs reference via `vaultAuthRef: auth` | +| **VaultStaticSecret** | [`vaultsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/vaultsecret.yaml) | `kv-v2` on mount `kvv2`, path **`webapp/config`** → k8s Secret **`secretkv`** (created by VSO), `refreshAfter: 30s` | +| **VaultDynamicSecret** | [`vaultdynamicsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/vaultdynamicsecret.yaml) | mount `postgres`, path **`creds/webapp`** → k8s Secret **`vso-db-credentials`** (created by VSO), with a `rolloutRestartTargets` entry on the `webapp` **Deployment** so the pod restarts when creds rotate | + +The runtime path these ride — VSO reading Vault on the pod's behalf and materializing k8s Secrets — is documented in [tools secrets-and-vso](../tools/secrets-and-vso.md). + +--- + +## 4) The key nuance — wiring shipped, live pod still on static creds + +> [!NOTE] +> **webapp provisions the complete dynamic-DB-credentials path but does not yet consume it.** The chart's `VaultDynamicSecret` (path `postgres/creds/webapp` → Secret `vso-db-credentials`, with a `rolloutRestart` on the Deployment) and the matching `iac/` role together stand up the **entire** per-app dynamic-credentials machinery end to end. But the **running Deployment takes `DATABASE_URL` from the ConfigMap**, which points at the **shared static `pgbouncer_auth` user** (`postgres://pgbouncer_auth:pgbouncer_auth@pgbouncer.tools/postgres?sslmode=disable`), and it does **not** mount the `vso-db-credentials` Secret. So `webapp` demonstrates the dynamic-creds wiring in full as a reference, while its live pod runs on the shared static account. Switching the live pod to rotating per-app credentials is simply a matter of consuming the `vso-db-credentials` Secret (e.g. project its fields into `DATABASE_URL` instead of the ConfigMap value). This is the current state, by design as an exemplar — not a misconfiguration. + +This is the one place to read `webapp` carefully: as a **template** it shows you every CRD and IaC resource a dynamic-creds app needs; as a **deployed workload** it is still on the shared pooler user. When you clone it for a real app, the last step is to wire the pod to `vso-db-credentials`. + +--- + +## 5) iac/ — Vault objects declared inline + +The OpenTofu under [`iac/`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac) declares webapp's Vault objects. Unlike `erp` and `dance-lessons-coach`, which call the shared **`app_roles` module** from `tools` (see [tools secrets-and-vso](../tools/secrets-and-vso.md)), webapp declares them **inline** in `main.tf` — which is exactly why it reads as the legible reference: every resource is visible in one file rather than hidden behind a module call. + +| File | Contents | +|---|---| +| [`providers.tf`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac/providers.tf) | `vault` provider v4.4.0 at `https://vault.arcodange.lab`; authenticates via `auth_login_jwt { mount = "gitea_jwt", role = "gitea_cicd_webapp" }` (token from `TERRAFORM_VAULT_AUTH_JWT`) | +| [`backend.tf`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac/backend.tf) | GCS backend, bucket `arcodange-tf`, prefix **`webapp/main`** | +| [`main.tf`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac/main.tf) | three inline resources (below) | + +The three resources in `main.tf`: + +| Resource | Vault object | Detail | +|---|---|---| +| `vault_database_secret_backend_role.role` | `postgres/creds/webapp` | creation SQL `CREATE ROLE "{{name}}" WITH LOGIN PASSWORD … VALID UNTIL '{{expiration}}'` then **`GRANT webapp_role TO "{{name}}"`**; revocation `REVOKE ALL ON DATABASE webapp FROM …` | +| `vault_kubernetes_auth_backend_role.role` | k8s auth role `webapp` | bound to SA `webapp` + namespace `webapp`, `audience = vault`, `token_policies = ["default", "webapp"]`, `token_ttl = 3600` | +| `vault_kv_secret_v2.webapp_config` | `kvv2/webapp/config` | the KV-v2 config secret VSO reads into the `secretkv` k8s Secret | + +> [!IMPORTANT] +> The `GRANT webapp_role TO …` statement depends on the **`webapp_role`** Postgres group role being created first by factory's [postgres-iac](../factory-provisioning/opentofu/postgres-iac.md). webapp's IaC mints *short-lived login roles* that inherit the privileges of that pre-existing `webapp_role`; if `webapp_role` does not exist, the dynamic-credential creation fails at grant time. + +--- + +## 6) CI workflows + +Two Gitea Actions workflows under [`.gitea/workflows/`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/.gitea/workflows), each gated to the part of the repo it owns. + +| Workflow | File | Trigger | What it does | +|---|---|---|---| +| **Hashicorp Vault** | [`vault.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/.gitea/workflows/vault.yaml) | push / PR touching `iac/*.tf` (+ manual) | a `gitea_vault_auth` job mints the Gitea OIDC token, then a `tofu` job runs **`terraform apply`** on `iac/` with **OpenTofu 1.8.2**; reads `kvv1/google/credentials` for the GCS backend; `VAULT_CACERT` is built from the `HOMELAB_CA_CERT` secret | +| **Docker Build** | [`dockerimage.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/.gitea/workflows/dockerimage.yaml) | push to `main` (ignoring `README.md`, `chart/**`) + manual | logs into the Gitea registry with `PACKAGES_TOKEN`, `docker build`, pushes **`latest`** and the **git ref-name** tag to `gitea.arcodange.lab/` | + +The `vault.yaml` flow — Gitea OIDC → Vault JWT login → `tofu apply` with GCS state — is the lab-standard CI apply path described in [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md). The image is then rolled out cluster-side: ArgoCD Image Updater (factory side) watches the registry on a **digest** strategy and bumps the running Deployment when the `latest` digest changes. + +--- + +## 7) `` convention mapping for webapp + +The single string `webapp` is the join key threading through every layer of the stack (the lab-wide [naming convention](../lab-ecosystem/naming-conventions.md)): + +| Layer | Value for `webapp` | +|---|---| +| Gitea repo | `arcodange-org/webapp` | +| Container image | `gitea.arcodange.lab/arcodange-org/webapp` (tags `latest` + ref-name) | +| PostgreSQL | database `webapp`, group role **`webapp_role`** (from factory [postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)) | +| Vault — dynamic DB | `postgres/creds/webapp` | +| Vault — KV config | `kvv2/webapp/config` | +| Vault — k8s auth role | `webapp` (policies `default`, `webapp`; SA + ns `webapp`) | +| Vault — CI JWT role | `gitea_cicd_webapp` (mount `gitea_jwt`) | +| Terraform state | GCS `arcodange-tf` prefix `webapp/main` | +| Kubernetes | namespace `webapp`, ServiceAccount `webapp` | +| ArgoCD | `Application` `webapp` (chart synced into ns `webapp`) | +| Ingress hosts | public `webapp.arcodange.fr` · internal `webapp.arcodange.lab` | + +--- + +## Cross-references + +- [url-shortener](url-shortener.md) — the **stateful contrast**: Rust + embedded SQLite on a Longhorn PVC, single replica, **no `iac/` and no Vault objects** at all. webapp (shared/scalable Postgres) vs url-shortener (self-contained single-writer SQLite) are the two archetypes of the [Applications hub](README.md). +- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime path the Vault CRDs ride, and the shared `app_roles` module that `erp`/`dance-lessons-coach` use *instead* of webapp's inline declarations. +- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — creates the `webapp` database and `webapp_role` that webapp's `GRANT` depends on. +- [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) — the Gitea OIDC → Vault JWT → `tofu apply` pipeline behind `vault.yaml`. +- [naming-conventions](../lab-ecosystem/naming-conventions.md) — the `` join key tabulated in section 7. +- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) — why a throwaway diagnostic app is still deployed prod-like, complete with dynamic-creds wiring. diff --git a/vibe/guidebooks/lab-ecosystem/01-factory.md b/vibe/guidebooks/lab-ecosystem/01-factory.md index a8cc875..b28b5b3 100644 --- a/vibe/guidebooks/lab-ecosystem/01-factory.md +++ b/vibe/guidebooks/lab-ecosystem/01-factory.md @@ -52,6 +52,9 @@ The tested power-cut recovery sequence (Longhorn restore → Vault unseal → VS A Helm chart whose `templates/apps.yaml` loops over `values.gitea_applications` and emits one `Application` CRD per app. Each Application derives everything from the app name: `repoURL = https://gitea.arcodange.lab//`, `path = chart`, `namespace = ` (`CreateNamespace=true`), with `syncPolicy.automated` `prune: true` + `selfHeal: true` by default. +> [!TIP] +> **Deeper dive:** the [Applications guidebook](../applications/README.md) maps what these `Application` CRDs deploy — the common app-repo pattern (Dockerfile + `chart/` + optional `iac/` + CI) every app in the list below shares, and the two archetypes (Go + Postgres vs Rust + SQLite). + | App | Org override | Image Updater | |---|---|---| | `url-shortener` | — | — | @@ -113,6 +116,7 @@ flowchart LR ## Cross-references - [Lab ecosystem hub](README.md) — the whole-lab map this page sits under. +- [Applications guidebook](../applications/README.md) — the apps ArgoCD's app-of-apps deploys: the common app-repo pattern and the Go+Postgres / Rust+SQLite archetypes. - [02 · tools](02-tools.md) — what ArgoCD deploys into the `tools` namespace (incl. pgbouncer that consumes the PG `user_lookup()`). - [03 · cms](03-cms.md) — the CMS edge that `iac/cloudflare.tf` and `iac/ovh.tf` wire up. - [naming-conventions.md](naming-conventions.md) — the `` join key these pillars share. -- 2.49.1 From 7bf83e75ed75ccc2241d8945df6e27b8061bda57 Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 22:12:11 +0200 Subject: [PATCH 6/9] docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops) Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most data-critical app, cross-linked from the applications hub (bidirectional): - README.md : Dolibarr 22.0.4 on Postgres; data-criticality; overview diagram; the Vault-unseal-before-scale recovery ordering (CAUTION). - deployment.md : upstream image + custom entrypoint (MySQL->psql), the 50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init scripts (conf.php creds, table-ownership), ingress, CI. - backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery ordering (Longhorn -> Vault unseal -> erp scale-up). - operations.md : the read-only bin/arcodange CLI, static/company.json, Deno+Playwright tests, day-2 ops. erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams MCP-validated; zero dead links. Co-Authored-By: Claude Opus 4.8 --- vibe/guidebooks/README.md | 1 + vibe/guidebooks/applications/README.md | 4 +- vibe/guidebooks/erp/README.md | 87 +++++++++ vibe/guidebooks/erp/backup-and-recovery.md | 207 +++++++++++++++++++++ vibe/guidebooks/erp/deployment.md | 189 +++++++++++++++++++ vibe/guidebooks/erp/operations.md | 173 +++++++++++++++++ 6 files changed, 659 insertions(+), 2 deletions(-) create mode 100644 vibe/guidebooks/erp/README.md create mode 100644 vibe/guidebooks/erp/backup-and-recovery.md create mode 100644 vibe/guidebooks/erp/deployment.md create mode 100644 vibe/guidebooks/erp/operations.md diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md index 55e1ff2..0df40d0 100644 --- a/vibe/guidebooks/README.md +++ b/vibe/guidebooks/README.md @@ -39,6 +39,7 @@ flowchart LR | [Tools](tools/README.md) | Deep dive into the lab platform services in the `tools` namespace (Vault+VSO, Prometheus, Grafana, CrowdSec, poolers, Redis, Plausible, ClickHouse) | ✅ Active | | [CMS](cms/README.md) | Deep dive into the public Nuxt site arcodange.fr + its Cloudflare DNS/tunnel/Turnstile and Zoho email IaC | ✅ Active | | [Applications](applications/README.md) | The deployed apps and the common pattern they share — webapp (Go + Postgres) and url-shortener (Rust + SQLite); erp has its own guidebook | ✅ Active | +| [ERP](erp/README.md) | The lab’s Dolibarr ERP — its deployment on Postgres, its document storage + backup/restore, and the read-only ops CLI (the most data-critical app) | ✅ Active | ## Rules to contribute diff --git a/vibe/guidebooks/applications/README.md b/vibe/guidebooks/applications/README.md index 29c4ffc..96913dd 100644 --- a/vibe/guidebooks/applications/README.md +++ b/vibe/guidebooks/applications/README.md @@ -10,7 +10,7 @@ This guidebook maps the **deployed applications** — the workloads ArgoCD runs in their own `` namespace — and, more importantly, the **single repeatable pattern** every one of them follows. Once you know the pattern, every app reads as a variation on the same skeleton: a Gitea repo whose contents (Dockerfile + Helm chart + optional Vault IaC + CI) and whose `` name fully determine how it builds, deploys, gets its secrets, and is reached from the network. -Two apps are presented in depth as the canonical archetypes: [webapp](webapp.md) (Go + external Postgres) and [url-shortener](url-shortener.md) (Rust + embedded SQLite). Other apps in the cluster — `erp`, `dance-lessons-coach`, `telegram-gateway`, `plausible` — instantiate the same pattern; `erp` has its own guidebook (forthcoming) and is not linked here yet. +Two apps are presented in depth as the canonical archetypes: [webapp](webapp.md) (Go + external Postgres) and [url-shortener](url-shortener.md) (Rust + embedded SQLite). Other apps in the cluster — `erp`, `dance-lessons-coach`, `telegram-gateway`, `plausible` — instantiate the same pattern; `erp` has its own [ERP guidebook](../erp/README.md) because it carries far more moving parts than the two archetypes. ## The common app pattern @@ -102,7 +102,7 @@ flowchart LR | [webapp](webapp.md) | Canonical **Go + external Postgres** exemplar — `iac/` + Vault dynamic creds, scalable stateless pods | ✅ Active | | [url-shortener](url-shortener.md) | **Rust + embedded SQLite** counterpart — single replica on a Longhorn RWO PVC, no Vault | ✅ Active | -`erp` and the other apps (`dance-lessons-coach`, `telegram-gateway`, `plausible`) follow the same pattern; `erp` will be cross-linked here once its dedicated guidebook ships. +`erp` and the other apps (`dance-lessons-coach`, `telegram-gateway`, `plausible`) follow the same pattern; `erp` is documented in depth in its own [ERP guidebook](../erp/README.md). ## Maintenance rule diff --git a/vibe/guidebooks/erp/README.md b/vibe/guidebooks/erp/README.md new file mode 100644 index 0000000..ba8e1a8 --- /dev/null +++ b/vibe/guidebooks/erp/README.md @@ -0,0 +1,87 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **ERP** + +# ERP + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Applications hub](../applications/README.md) · [01 · factory](../lab-ecosystem/01-factory.md) +> **Downstream:** [Deployment](deployment.md) · [Backup & recovery](backup-and-recovery.md) · [Operations](operations.md) +> **Related:** [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) + +This guidebook maps **erp** — the lab's [Dolibarr **22.0.4**](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/Chart.yaml) accounting/business ERP and its **single most data-critical application**. It is a PHP/Apache workload built from the upstream `dolibarr/dolibarr` image, served internally at `erp.arcodange.lab` (Traefik `websecure` + `localIp@file` + a `letsencrypt`-resolver cert). Everything a reader needs to deploy it, keep its data safe, and operate it lives in the three child pages below; this page is the orientation map. + +## What makes erp special + +`erp` is the complex sibling of the [webapp / url-shortener archetypes](../applications/README.md). It carries the **same four-ingredient app pattern** (Dockerfile-less reuse of an upstream image, a `chart/`, an `iac/`, `.gitea/workflows`) but layers several things on top that the archetypes do not: + +| Trait | erp specifics | Why it matters | +|---|---|---| +| **Upstream image** | `dolibarr/dolibarr:22.0.4` — not a repo-built image | No custom Dockerfile; the chart adapts the upstream container at runtime | +| **Postgres, not MySQL** | Dolibarr classically assumes MySQL; erp runs on **PostgreSQL** (`DOLI_DB_TYPE: pgsql`) | A [custom entrypoint](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) rewrites the upstream `docker-run.sh` `mysql` invocation into `psql` before launch | +| **DB path** | pod → `pgbouncer.tools:5432` → the `erp` Postgres database on `pi2` | Shares the [tools pgbouncer pooler](../tools/secrets-and-vso.md) like the webapp archetype | +| **Vault wiring** | **dynamic** rotating DB creds (`postgres/creds/erp`) + **static** KV config (`kvv2` `erp/config`) via the shared [`app_roles` module](../tools/secrets-and-vso.md) | The pod cannot start without VSO-injected `DOLI_DB_USER` / `DOLI_DB_PASSWORD` | +| **Document persistence** | a **50Gi Longhorn RWX PVC** (`storageClassName: longhorn`, `accessModes: ReadWriteMany`, `helm.sh/resource-policy: keep`) mounting `/var/www/documents`, `/var/www/html/custom`, and `/var/backups` | Uploaded invoices/PDFs/attachments are real business records — losing them is the worst case | +| **Backup + ops** | its own [backup/restore subsystem](backup-and-recovery.md) plus a **read-only ops CLI** (`bin/arcodange`) | Data-criticality demands both an escape hatch for restores and a safe way to inspect live state | + +## Overview — how erp is wired + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef src fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef net fill:#b45309,stroke:#92400e,color:#fff + + CI["factory / erp CI
tofu apply (iac/)"]:::src + VAULT["Vault
postgres/creds/erp (dynamic)
kvv2 erp/config (static)"]:::store + ARGO["ArgoCD
syncs chart/ (ns erp)"]:::proc + POD["Dolibarr pod
dolibarr/dolibarr:22.0.4
custom entrypoint → psql"]:::proc + VSO["VSO
VaultAuth + VaultDynamicSecret
+ VaultStaticSecret"]:::proc + PGB["pgbouncer.tools:5432"]:::net + PG["Postgres erp db
(pi2)"]:::store + PVC["50Gi Longhorn RWX PVC
/var/www/documents"]:::store + BK["backup CronJob / runner
pg_dump → documents/admin/backup"]:::proc + + CI --> VAULT + ARGO --> POD + VAULT -. "creds + config" .-> VSO + VSO -- "vso-db-credentials + secretkv" --> POD + POD --> PGB --> PG + PVC -- "mounts /var/www/documents" --- POD + BK -- "dumps DB + writes to" --> PVC +``` + +1. **factory / erp CI** runs `tofu apply` over `iac/` to declare erp's Vault objects — a Postgres dynamic-secret role and a Kubernetes auth role — through the shared [`app_roles` module](../tools/secrets-and-vso.md), and seeds the static `kvv2` `erp/config` KV (admin login, instance UUID). +2. **ArgoCD** (factory's [app-of-apps](../lab-ecosystem/01-factory.md)) syncs the `chart/` into the `erp` namespace. +3. The **Dolibarr pod** comes up from `dolibarr/dolibarr:22.0.4`; its [custom entrypoint](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) rewrites the upstream `docker-run.sh` so SQL runs through `psql` instead of `mysql`. +4. **VSO** authenticates to **Vault** (the `auth` `VaultAuth` CRD), materialising `vso-db-credentials` (dynamic, rotating DB user/password from `postgres/creds/erp`) and `secretkv` (static config from `kvv2` `erp/config`); both are injected into the pod, and a credential rotation triggers a rollout restart. +5. The pod connects to the **`erp` Postgres database** through the [tools `pgbouncer.tools:5432` pooler](../tools/secrets-and-vso.md). +6. A **50Gi Longhorn RWX PVC** mounts `/var/www/documents` (plus `/var/www/html/custom` and `/var/backups`), holding every uploaded document and generated PDF. +7. The **backup subsystem** dumps the `erp` database with a version-matched `pg_dump` and lands the archive under `documents/admin/backup` on that same PVC — see [Backup & recovery](backup-and-recovery.md). + +> [!CAUTION] +> **Recovery ordering: Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no usable DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp`. If erp is scaled up while Vault is still sealed, the pod crash-loops with no database access. During a cluster rebuild, unseal Vault first, confirm VSO has reconciled the erp secrets, and only then scale erp. The full sequence (cluster bring-up → Vault unseal → storage → apps) is covered by [Backup & recovery](backup-and-recovery.md), the [storage concept](../lab-ecosystem/storage-and-recovery.md), the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md), and the cluster-wide CLUSTER_RECOVERY.md runbook. + +## Index + +| Page | What it covers | Status | +|---|---|---| +| [Deployment](deployment.md) | The chart, the upstream image + custom entrypoint, the Postgres-over-pgbouncer wiring, the Vault CRDs (dynamic creds + static config), and the ingress | ✅ Active | +| [Backup & recovery](backup-and-recovery.md) | The document PVC, the `pg_dump`-based backup subsystem, restore procedure, and where erp sits in cluster-recovery ordering | ✅ Active | +| [Operations](operations.md) | The read-only `bin/arcodange` ops CLI and day-to-day operational tasks (table-ownership fix-ups, liveness checks, audits) | ✅ Active | + +## Maintenance rule + +> [!IMPORTANT] +> **When the erp repo changes shape, these pages change in the same PR.** If you alter the chart structure, the custom entrypoint, the Vault wiring, the document PVC, the backup subsystem, or the ops CLI, update this hub and the relevant child page in the same change. A reference map that drifts from the real `chart/`, `iac/`, and `backup/` sends agents confidently down dead paths — and for the lab's most data-critical app that risk is highest here. + +## Cross-references + +- [Applications hub](../applications/README.md) — the common four-ingredient app pattern; erp is its complex sibling, beside the webapp and url-shortener archetypes. +- [01 · factory](../lab-ecosystem/01-factory.md) — the ArgoCD app-of-apps that emits erp's `Application` CRD. +- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the `app_roles` module + VSO runtime that delivers erp's dynamic DB creds and static config, and the pgbouncer pooler the pod connects through. +- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role erp runs on. +- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the 50Gi Longhorn RWX document PVC is provisioned and recovered. +- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must precede scaling erp back up. +- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) — why the lab keeps erp deployed prod-like and the data-criticality trade-offs behind it. diff --git a/vibe/guidebooks/erp/backup-and-recovery.md b/vibe/guidebooks/erp/backup-and-recovery.md new file mode 100644 index 0000000..e4528f5 --- /dev/null +++ b/vibe/guidebooks/erp/backup-and-recovery.md @@ -0,0 +1,207 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery** + +# Backup & recovery + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [ERP](README.md) · [Deployment](deployment.md) +> **Downstream:** [Operations](operations.md) +> **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) + +`erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild. + +## Backup mechanism + +The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents. + +The pipeline inside each run: + +1. **Detect version** — `psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from. +2. **Dump + compress** — `pg_dump -d erp --no-tablespaces --inserts | gzip > `. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses. +3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp__.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`. +4. **Prune** — `find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days. + +DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly. + +### Schedule, retention & artifacts + +| Property | Value | Source | +|---|---|---| +| Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) | +| Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` | +| Successful job history | `successfulJobsHistoryLimit: 3` | `spec` | +| Failed job history | `failedJobsHistoryLimit: 3` | `spec` | +| Retention | 15 days (`find -mtime +15 -delete`) | dump script | +| Dump image | `postgres:16.3` | `jobTemplate` container | +| Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script | +| Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts | +| Archive path | `/documents/admin/backup/pg_dump_erp__.sql.gz` | mount + dump script | +| Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` | +| Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` | + +### Ad-hoc & manual alternatives + +Two escape hatches exist for an on-demand dump outside the 04:00 schedule: + +| Tool | What it does | When to reach for it | +|---|---|---| +| [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API | +| [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup | + +> [!WARNING] +> **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server. + +## What is — and is NOT — in the dump + +The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records. + +| Data | Where it lives | Protected by | +|---|---|---| +| Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) | +| Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups | +| Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups | + +> [!IMPORTANT] +> A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered. + +## Restore + +The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first. + +Ordered procedure: + +1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only). +2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes. +3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB. +4. **[AGENT]** Watch the Job to completion and read its logs. +5. **[HUMAN]** Scale the `erp` Deployment back to **1**. +6. **[AGENT]** Validate erp is serving and the data is present. + +```bash +# [AGENT] read-only: cluster health + list backup archives on the PVC +kubectl get deploy,pods -n erp +kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz +``` + +```bash +# [HUMAN] prod-mutating: stop writes before restoring +kubectl scale deploy/erp -n erp --replicas=0 +kubectl rollout status deploy/erp -n erp --watch=false # expect 0 replicas +``` + +```bash +# [HUMAN] prod-mutating: run the restore Job +# default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC +ansible-playbook ansible/arcodange/erp/playbooks/restore.yml +# or pin an explicit archive: +ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \ + -e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz +``` + +```bash +# [AGENT] read-only: follow the restore Job and its logs +kubectl get job/dolibarr-restore -n erp -o wide +kubectl logs -n erp job/dolibarr-restore +``` + +```bash +# [HUMAN] prod-mutating: bring erp back up +kubectl scale deploy/erp -n erp --replicas=1 +kubectl rollout status deploy/erp -n erp +``` + +```bash +# [AGENT] read-only: validate erp is serving after restore +kubectl get pods -n erp +kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/ +``` + +> [!WARNING] +> **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish. + +## Recovery ordering (cluster rebuild) + +> [!CAUTION] +> **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed: +> +> 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online. +> 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config. +> 3. **Scale erp to 1** — only now does the pod come up with usable creds. +> 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above). +> +> This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook. + +## The ownership fix + +After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**: + +| Script | Mechanism | Use | +|---|---|---| +| [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table | +| [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY TO erp_role` only when it differs | The idempotent, chart-shipped fix | + +The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation: + +```bash +# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role +kubectl create job \ + --from=cronjob/pg-fix-table-ownership \ + pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \ + -n kube-system +``` + +Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md). + +## Flow + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef sched fill:#2563eb,stroke:#1e40af,color:#fff + classDef proc fill:#059669,stroke:#047857,color:#fff + classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef db fill:#b45309,stroke:#92400e,color:#fff + + VSO["VSO secret
vso-db-credentials
(postgres/creds/erp)"]:::store + CRON["CronJob dolibarr-backup
schedule 0 4 * * *"]:::sched + DUMPJOB["pg_dump Job
postgres:16.3"]:::proc + GZIP["gzip stream
--inserts --no-tablespaces"]:::proc + PVC["Longhorn RWX PVC
/documents/admin/backup
pg_dump_erp_*.sql.gz"]:::store + RESTOREJOB["Restore Job dolibarr-restore
postgres:16.3"]:::proc + PSQL["psql -f dump.sql"]:::proc + DB["erp Postgres DB
via pgbouncer.tools"]:::db + + CRON -- "spawns" --> DUMPJOB + DUMPJOB -- "pg_dump" --> GZIP + GZIP -- "writes archive" --> PVC + PVC -- "ls -t latest .sql.gz" --> RESTOREJOB + RESTOREJOB -- "tar -xzf then" --> PSQL + PSQL -- "loads into" --> DB + DUMPJOB -- "dumps from" --> DB + VSO -. "DB creds" .-> DUMPJOB + VSO -. "DB creds" .-> RESTOREJOB +``` + +1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`). +2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret. +3. The dump streams through **gzip** and the resulting `pg_dump_erp__.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned. +4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials. +5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first). + +## Gotchas + +> [!WARNING] +> - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump. +> - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`. +> - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1. +> - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild. +> - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md). + +## Cross-references + +- [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on. +- [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks. +- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered. +- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up. +- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs. +- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password. diff --git a/vibe/guidebooks/erp/deployment.md b/vibe/guidebooks/erp/deployment.md new file mode 100644 index 0000000..4c7ad07 --- /dev/null +++ b/vibe/guidebooks/erp/deployment.md @@ -0,0 +1,189 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Deployment** + +# Deployment + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [ERP hub](README.md) · [Applications hub](../applications/README.md) · [01 · factory](../lab-ecosystem/01-factory.md) +> **Downstream:** [Backup & recovery](backup-and-recovery.md) · [Operations](operations.md) +> **Related:** [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [webapp](../applications/webapp.md) + +This page maps how **erp** is deployed: the chart that wraps the **upstream Dolibarr image**, the runtime trick that makes a MySQL-assuming application speak **PostgreSQL**, the **50Gi document PVC** that holds every business record, the Vault CRDs that feed it credentials, and the OpenTofu + CI that declare its Vault objects. It is the most data-critical app in the lab; the `iac/` runs through the same `tofu apply` pipeline as every other app — see [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md). + +## 1 · App & image + +erp is **Dolibarr** pulled straight from the upstream `dolibarr/dolibarr` Docker Hub image — there is **no repo-built image** and no `Dockerfile`. The chart adapts the upstream container at runtime instead of forking it. + +| Field | Value | Source | +|---|---|---| +| Application | Dolibarr ERP/CRM (PHP / Apache) | [chart/Chart.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/Chart.yaml) | +| Version | **22.0.4** (chart `appVersion: "22.0.4"`) | [chart/Chart.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/Chart.yaml) | +| Image | `dolibarr/dolibarr:22.0.4` — upstream, `pullPolicy: IfNotPresent` | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| Image tag | `image.tag` empty → defaults to chart `appVersion` (`{{ .Values.image.tag \| default .Chart.AppVersion }}`) | [chart/templates/deployment.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/deployment.yaml) | +| Served at | `https://erp.arcodange.lab` (internal only) | [chart/templates/config.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/config.yaml) | +| Container command | `["/bin/bash", "/usr/local/bin/custom-entrypoint.sh", "apache2-foreground"]` | [chart/templates/deployment.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/deployment.yaml) | + +> [!NOTE] +> Because erp consumes an upstream image, there is **no `docker-build-and-push` workflow** in `.gitea/workflows/` — unlike [webapp](../applications/webapp.md), which builds and pushes its own image. erp's only workflow is the OpenTofu/Vault one (see [§8](#8--ci--vaultyaml)). + +## 2 · Postgres, not MySQL + +Dolibarr classically assumes MySQL, but erp runs on **PostgreSQL**. Two pieces make that work, both at startup, both inside the [custom entrypoint](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) which wraps the upstream `docker-run.sh`: + +1. **MySQL → psql rewrite.** When `DOLI_DB_TYPE == "pgsql"`, the entrypoint `sed`s the upstream `/usr/local/bin/docker-run.sh` in place, replacing its `mysql -u ... < ${file}` SQL invocation with `PGPASSWORD=... psql -U ... -h ... -p ... -d ... < ${file}`. +2. **Apache `ServerName`.** It strips the scheme from `DOLI_URL_ROOT` and sets the Apache `ServerName` in `000-default.conf` (and appends to `apache2.conf`) so the vhost matches `erp.arcodange.lab`. +3. It then `exec`s the original `docker-run.sh "$@"` (i.e. `apache2-foreground`). + +The non-secret database wiring lives in the `erp-config` ConfigMap, injected via `envFrom`: + +| Env var | Value | Meaning | +|---|---|---| +| `DOLI_DB_TYPE` | `pgsql` | Selects PostgreSQL — triggers the entrypoint rewrite | +| `DOLI_DB_HOST` | `pgbouncer.tools` | Connects through the [tools pgbouncer pooler](../tools/secrets-and-vso.md) | +| `DOLI_DB_HOST_PORT` | `5432` | Pooler port | +| `DOLI_DB_NAME` | `erp` | The per-app database (provisioned by [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)) | +| `DOLI_URL_ROOT` | `https://erp.arcodange.lab` | Drives the Apache `ServerName` | +| `DOLI_ENABLE_MODULES` | `Societe,Facture` | Third-parties + invoicing modules | +| `DOLI_COMPANY_NAME` | `Arcodange` | Seeded company name | +| `DOLI_COMPANY_COUNTRYCODE` | `FR` | Seeded country | +| `PHP_INI_DATE_TIMEZONE` | `Europe/Paris` | PHP timezone | +| `DOLI_AUTH` | `dolibarr` | Native Dolibarr auth | +| `DOLI_CRON` | `0` | In-container cron disabled | + +`DOLI_DB_USER` / `DOLI_DB_PASSWORD` are **not** in the ConfigMap — they come from Vault (see [§5](#5--vault-crds)). + +> [!WARNING] +> The psql rewrite is a **textual `sed` against an upstream file**. If a future Dolibarr image changes the exact `mysql ... < ${file}` line in `docker-run.sh`, the substitution silently stops matching and SQL imports fall back to `mysql` (which is absent) — startup SQL then fails. Re-verify the entrypoint pattern whenever the `appVersion` is bumped. + +## 3 · Persistence — the document PVC + +A single PVC named `erp` holds every business record. It is the most important object in the chart. + +| Field | Value | Source | +|---|---|---| +| Name | `erp` (`erp.fullname`) | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) | +| Access mode | `ReadWriteMany` (RWX) | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) | +| Size | `50Gi` | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) | +| StorageClass | `longhorn` | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) | +| Retention | annotation `helm.sh/resource-policy: keep` — survives a `helm uninstall` | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) | + +The Deployment mounts the **same PVC** at three paths via `subPath`: + +| Mount path | subPath | Holds | +|---|---|---| +| `/var/www/documents` | `documents` | **Invoices, attachments, generated PDFs — the critical business data** | +| `/var/www/html/custom` | `custom` | Custom/installed Dolibarr modules | +| `/var/backups` | `backups` | In-pod backup landing area | + +> [!CAUTION] +> **Losing this PVC loses all business documents.** `/var/www/documents` contains the only copy of uploaded invoices, attachments, and generated PDFs — these are real accounting records, not regenerable cache. The `helm.sh/resource-policy: keep` annotation protects it from a chart uninstall, but it does **not** protect against a Longhorn-volume loss or a node failure. Treat the PVC as primary data and rely on [Backup & recovery](backup-and-recovery.md) for off-volume copies. + +## 4 · Chart shape + +| Aspect | Value | Source | +|---|---|---| +| `replicaCount` | **1** (single replica) | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| Autoscaling | **disabled** (`autoscaling.enabled: false`; no HPA rendered) | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| Service | `ClusterIP`, port `80` → `targetPort: http` | [chart/templates/service.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/service.yaml) | +| Ingress host | `erp.arcodange.lab`, path `/` (`Prefix`) | [chart/templates/ingress.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/ingress.yaml) | +| Ingress entrypoint | Traefik `websecure` + `router.tls: "true"` | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| TLS cert | `certresolver: letsencrypt`, domain `arcodange.lab` / SAN `erp.arcodange.lab` | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| Middleware | `localIp@file` — **internal only**, no public `.fr` host | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| revisionHistoryLimit | `5` | [chart/templates/deployment.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/deployment.yaml) | + +> [!WARNING] +> **Single replica on an RWX PVC.** `replicaCount: 1` with autoscaling off means erp has **no redundancy** — a node or pod failure is a full outage until rescheduled. A credential rotation or config change triggers a rollout that briefly takes the only pod down. This is deliberate for a stateful, low-traffic internal app, but do not raise the replica count without first confirming Dolibarr tolerates concurrent writes to the shared `documents` volume. + +The Deployment carries `configmap-hash` / `configmap2-hash` / `configmap3-hash` annotations (sha256 of the three ConfigMaps) so a change to config or the init scripts forces a pod roll. + +## 5 · Vault CRDs + +erp cannot start without VSO-injected credentials. Three CRDs (from the chart) wire it to Vault — see [tools secrets-and-vso](../tools/secrets-and-vso.md) for the VSO runtime. + +| CRD | Name | What it does | +|---|---|---| +| `VaultAuth` | `auth` | Kubernetes auth — `mount: kubernetes`, `role: erp`, ServiceAccount `erp`, audience `vault`. Every other CRD references it via `vaultAuthRef: auth`. | +| `VaultStaticSecret` | `vault-kv-app` | `type: kv-v2`, `mount: kvv2`, `path: erp/config` → k8s Secret **`secretkv`**, `refreshAfter: 24h`. Injected via `envFrom` `secretRef`. Holds `DOLI_ADMIN_LOGIN`, `DOLI_ADMIN_PASSWORD`, `DOLI_INSTANCE_UNIQUE_ID`. | +| `VaultDynamicSecret` | `vso-db` | `mount: postgres`, `path: creds/erp` → k8s Secret **`vso-db-credentials`** (rotating DB user/password). `rolloutRestartTargets` the erp Deployment so a rotation rolls the pod. `DOLI_DB_USER` / `DOLI_DB_PASSWORD` are wired into the pod via `secretKeyRef`. | + +Credential delivery in the Deployment: + +- `envFrom: secretRef: secretkv` — static admin config + instance UUID. +- `env: DOLI_DB_USER` / `DOLI_DB_PASSWORD` ← `secretKeyRef` on `vso-db-credentials` (`username` / `password`). + +| Sources | [chart/templates/vaultauth.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/vaultauth.yaml) · [vaultsecret.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/vaultsecret.yaml) · [vaultdynamicsecret.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/vaultdynamicsecret.yaml) | +|---|---| + +## 6 · Init scripts (mounted from ConfigMaps) + +Three scripts ship in `chart/scripts/` and are mounted into the pod via ConfigMaps. The entrypoint runs at container start; the `before-starting.d/` scripts run before Apache. + +| Script | Mounted at | Role | +|---|---|---| +| [custom_entrypoint.sh](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) | `/usr/local/bin/custom-entrypoint.sh` (ConfigMap `dolibarr-custom-entrypoint-script`) | Wraps `docker-run.sh`: MySQL→psql `sed` rewrite + Apache `ServerName` from `DOLI_URL_ROOT` (see [§2](#2--postgres-not-mysql)) | +| [update_conf_db_credentials.sh](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_conf_db_credentials.sh) | `/var/www/scripts/before-starting.d/` (ConfigMap `dolibarr-before-start-scripts`) | `sed`s the Vault-injected `DOLI_DB_USER` / `DOLI_DB_PASSWORD` into Dolibarr's `conf.php` at startup, so the running app uses the freshly rotated creds | +| [update_ownership.sql](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | `/var/www/scripts/before-starting.d/update_table_ownership.sql` | `REASSIGN OWNED BY` the current `public`-schema owner → `erp_role`. Run if you hit read-only-filesystem / permission errors after a credential change | + +> [!CAUTION] +> **The ownership SQL must run after the DB role behind the dynamic creds changes.** Because `postgres/creds/erp` mints a **new** Postgres user on each rotation, freshly created tables can end up owned by a transient user. If the in-pod `update_table_ownership.sql` cannot write its temp file (`Read-only file system`), it is skipped and Dolibarr eventually loses query rights once Vault rotates creds. The fix is to run that SQL by hand against the `erp` database — see [Operations](operations.md). The script reassigns ownership to the stable **`erp_role`** created by [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md). + +## 7 · iac/ — Vault objects via the shared module + +erp's `iac/` declares only its Vault footprint; the Postgres database and `erp_role` themselves come from factory ([postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)). + +| Element | Value | Source | +|---|---|---| +| Shared module | `app_roles` from `arcodange-org/tools` (`hashicorp-vault/iac/modules/app_roles`, `ref=main`), `name = "erp"` | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) | +| What the module provisions | the `postgres/creds/erp` dynamic role + a Kubernetes auth `role erp` + the `kvv2` path prefix | [tools secrets-and-vso](../tools/secrets-and-vso.md) | +| Admin password | `random_password.admin_initial_password` (length 32) → `DOLI_ADMIN_PASSWORD` | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) | +| Instance ID | `random_uuid.dolibarr_id` with `lifecycle { prevent_destroy = true }` → `DOLI_INSTANCE_UNIQUE_ID` (encryption salt + module licensing) | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) | +| KV secret | `vault_kv_secret_v2` at `config` (i.e. `erp/config`), data = `DOLI_ADMIN_LOGIN` + `DOLI_ADMIN_PASSWORD` + `DOLI_INSTANCE_UNIQUE_ID` | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) | +| Backend | GCS bucket `arcodange-tf`, prefix `erp/main` | [iac/backend.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/backend.tf) | +| Vault provider | `address = https://vault.arcodange.lab`, `auth_login_jwt` `mount = gitea_jwt`, `role = gitea_cicd_erp`, provider `vault` `4.4.0` | [iac/providers.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/providers.tf) | + +The Postgres dynamic role created here `GRANT`s **`erp_role`** — the stable role created by factory ([postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)) — so every rotated DB user inherits the right schema privileges. This is the same KV secret the chart's `VaultStaticSecret` reads back as `secretkv`, closing the loop between `iac/` (writes config) and `chart/` (consumes it). + +> [!WARNING] +> **The OpenTofu state holds the plaintext admin password.** `random_password.admin_initial_password` is stored unencrypted in the GCS state at `arcodange-tf/erp/main`. Anyone with read access to that state bucket can read `DOLI_ADMIN_PASSWORD`. Treat the `erp/main` state prefix as a secret; do not copy it locally unprotected. The `random_uuid` instance ID is similarly in state but is guarded by `prevent_destroy` because losing it breaks decryption of stored data and invalidates purchased modules. + +## 8 · CI — vault.yaml + +| Element | Value | Source | +|---|---|---| +| Workflow | `Hashicorp Vault` (`.gitea/workflows/vault.yaml`) | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) | +| Triggers | `workflow_dispatch`, plus `push` / `pull_request` on `iac/*.tf` | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) | +| Job 1 | `gitea_vault_auth` — mints a Gitea OIDC JWT for Vault | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) | +| Job 2 | `tofu` — `dflook/terraform-apply` over `iac/`, `auto_approve: true`, **OpenTofu `1.8.2`** | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) | +| Secrets | `TERRAFORM_SSH_KEY` (SSH key to clone the `app_roles` module from `tools`) + `HOMELAB_CA_CERT` (Vault self-signed CA) + `GOOGLE_BACKEND_CREDENTIALS` (GCS state) | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) | + +This `tofu apply` follows the lab-wide pattern documented in [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md). There is **no application-image build step** — the chart is delivered by ArgoCD ([01 · factory](../lab-ecosystem/01-factory.md)), and the image is upstream. + +## 9 · `` convention mapping + +erp follows the lab's per-app naming convention — see [naming-conventions](../lab-ecosystem/naming-conventions.md). With ` = erp`: + +| `` slot | erp value | +|---|---| +| Repo | `arcodange-org/erp` | +| K8s namespace | `erp` | +| Internal host | `erp.arcodange.lab` | +| ServiceAccount | `erp` | +| Vault Kubernetes auth role | `erp` | +| Vault KV path | `kvv2` `erp/config` → Secret `secretkv` | +| Vault dynamic DB path | `postgres/creds/erp` → Secret `vso-db-credentials` | +| Postgres database | `erp` | +| Postgres stable role | `erp_role` | +| OpenTofu state prefix | GCS `arcodange-tf/erp/main` | +| Gitea CI Vault role | `gitea_cicd_erp` | +| Document PVC | `erp` (50Gi Longhorn RWX) | + +## Cross-references + +- [ERP hub](README.md) — the orientation map for the whole guidebook. +- [Backup & recovery](backup-and-recovery.md) — protecting the 50Gi document PVC and the `erp` database; cluster-recovery ordering (unseal Vault before scaling erp up). +- [Operations](operations.md) — day-to-day operational tasks, including running the table-ownership SQL by hand. +- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the `app_roles` module, the VSO runtime that materialises `secretkv` + `vso-db-credentials`, and the `pgbouncer.tools` pooler. +- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — provisions the `erp` database and the stable `erp_role` the dynamic creds inherit. +- [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md) — the shared `tofu apply` CI pattern erp's `vault.yaml` follows. +- [naming-conventions](../lab-ecosystem/naming-conventions.md) — the `` slots filled in [§9](#9--app-convention-mapping). +- [webapp](../applications/webapp.md) — the archetype that *does* build its own image; erp differs by reusing the upstream Dolibarr image. diff --git a/vibe/guidebooks/erp/operations.md b/vibe/guidebooks/erp/operations.md new file mode 100644 index 0000000..9e9b54b --- /dev/null +++ b/vibe/guidebooks/erp/operations.md @@ -0,0 +1,173 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Operations** + +# ERP Operations + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [ERP hub](README.md) · [Deployment](deployment.md) +> **Downstream:** [Backup & recovery](backup-and-recovery.md) +> **Related:** [Applications hub](../applications/README.md) · [Web app](../applications/webapp.md) · [Secrets & VSO](../tools/secrets-and-vso.md) · [Postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) + +This page covers day-2 operation of the Arcodange Dolibarr ERP: the read-only operations CLI, the static identity assets, the Playwright bootstrap test suite, and the recurring scaling / module-activation / storage chores. For how the workload is deployed onto the cluster, see [Deployment](deployment.md). For backups and disaster recovery, see [Backup & recovery](backup-and-recovery.md). + +--- + +## 1. The read-only ops CLI — `bin/arcodange` + +[`bin/arcodange`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/bin/arcodange) is a Bash dispatcher that gives a human-friendly entry point to safe, **strictly read-only** Dolibarr operations against `erp.arcodange.lab`. Every subcommand `exec`s a script under `.claude/skills//scripts/`; the dispatcher itself only locates the project root (via `git rev-parse --show-toplevel`, falling back to walking up from the script) and routes arguments. + +> [!IMPORTANT] +> The CLI authenticates with credentials read from [`.claude/skills/dolibarr/.env`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/bin/arcodange) — a **gitignored** file expected at **mode `600`**. The underlying API key belongs to the `ai_agent` service account, which has **no write permissions**: the CLI cannot mutate Dolibarr. Corrections always go through the Dolibarr web UI. + +### Command map + +| Command | Subcommand | What it does | +| --- | --- | --- | +| `ping` | — | `GET /status` — liveness probe + reports the running Dolibarr version | +| `whoami` | — | `GET /users/info` — confirms auth is the `ai_agent` service account | +| `invoice` | `list [--since YYYY-MM-DD]` | Table of KissMetrics customer invoices with payment state | +| `invoice` | `audit ` | JSON facts + PDF mandatory-mention audit for one invoice | +| `payments` | `state [--since YYYY-MM-DD]` | Per-invoice TTC vs payments reconciliation | +| `payments` | `timeline [--year\|--since\|--until]` | Payment timeline with cumulative balance | +| `payments` | `by-month [--year\|--all-clients]` | Monthly cash-receipt aggregation | +| `tva` | `summary [--year\|--since\|--until]` | CA3-ready monthly TVA summary (collectée − déductible) | +| `tva` | `collect` / `collect-detail` | TVA collectée by month × rate (CA3 A1/A4/E2) + per-line audit | +| `tva` | `deductible` / `deductible-detail` | TVA déductible by month × rate (CA3 19/20/17+24) + per-line audit | +| `thirdparty` | `audit ` | Country-aware completeness audit for one thirdparty | +| `thirdparty` | `audit-all [--clients-only\|--suppliers-only]` | Audit every visible thirdparty | +| `templates` | `list [--max-id N]` / `inspect ` | Enumerate / health-check recurring invoice templates | +| `bank` | `probe` / `balance` / `match` / `qonto-transactions` / `wise-transactions` / `curl` | Qonto + Wise bank data and Dolibarr reconciliation | +| `email` | `list` / `inspect ` / `curl` | Supplier-invoice ingestion from the Zoho mailbox | +| `snapshot` | `--out FILE` (or `--print-only`) | Bundle the full read-only state into one JSON dump (with `content_hash`) | +| `curl` | `` | Raw read-only call through `dol-curl.sh` (e.g. `arcodange curl /invoices/12`) | +| `help` | `[command]` | Full command tree, or per-command help | + +### Health checks first + +| Check | Command | Expected outcome | +| --- | --- | --- | +| Is Dolibarr up? | `bin/arcodange ping` | HTTP `200` + Dolibarr version string | +| Is auth wired? | `bin/arcodange whoami` | The `ai_agent` user record | +| Full state dump | `bin/arcodange snapshot --out /tmp/erp.json` | One JSON file with a `content_hash` | + +> [!TIP] +> `snapshot` is the fastest way to capture a point-in-time, read-only view of the ERP (invoices, payments, TVA, thirdparties, templates) for offline diffing or for attaching to an incident. It does not touch the database — it only reads. + +The CLI's per-domain credentials beyond Dolibarr (Qonto/Wise for `bank`, Zoho OAuth for `email`) also live in the same gitignored `.env`. The skills' `SKILL.md` files remain the source of business-logic documentation; the CLI is just the ergonomic front door. + +--- + +## 2. Static identity assets — `static/` + +[`static/`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static) holds the company's legal identity and branding, consumed by the Playwright bootstrap suite when it configures a fresh Dolibarr install. + +| Path | Purpose | +| --- | --- | +| [`static/config/company.json`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static/config/company.json) | Legal identity used for Dolibarr company setup and display | +| [`static/img/logo512.png`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static/img/logo512.png) | Company logo (referenced from `company.json` as `$IMG/logo512.png`) | +| [`static/img/loginBackground.jpeg`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static/img/loginBackground.jpeg) | Login-page background image | + +`company.json` carries two blocks — `info` (postal/contact identity) and `ID` (legal identity): + +| Field | Value | +| --- | --- | +| Raison sociale | Arcodange | +| Forme juridique | SAS (Société par actions simplifiée) | +| Adresse | 73 Boulevard de l'Yerres, 91000 Évry-Courcouronnes, France (FR) | +| Site / email | arcodange.fr · gabrielradureau@arcodange.fr | +| SIREN / SIRET | (legal registration identifiers) | +| NAF / APE | 62.02A | +| N° TVA | (intra-community VAT number) | +| Capital | (share capital) | +| RCS | R.C.S. Évry | +| Mois début d'exercice | Juillet | +| Logo | `$IMG/logo512.png` | + +> [!NOTE] +> The `$IMG` token in `company.json` resolves to `static/img/` via the test harness's `IMG_FOLDER` (see §3). The same image folder feeds the optional login-page background and logo upload during display setup. + +--- + +## 3. Bootstrap test suite — `test/` + +[`test/`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test) is a **Deno + Playwright** UI suite that drives a real browser through the Dolibarr first-install and admin configuration flows. It is not a unit-test runner — it is the scripted bootstrap that stands up a fresh Dolibarr instance and applies the company identity. + +| File | Role | +| --- | --- | +| [`test/main.ts`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/main.ts) | Entry point: launches Chromium (`fr-FR` locale), wires `globalCtx`, runs install + admin setup steps | +| [`test/deno.json`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/deno.json) | Imports `npm:playwright` and `jsr:@std/dotenv/load`; `checkJs: true` | +| [`test/.env.example`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/.env.example) | Template for `DOLIBARR_ADDRESS`, DB password, admin login, `ROOT_FOLDER` | +| [`test/scripts/admin/`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/scripts/admin) | `initialSetup.ts`, `companySetup.ts`, `displaySetup.ts`, `moduleSetup.ts` | + +### Run the suite + +1. Install Deno and the Playwright browsers: + - `curl -fsSL https://deno.land/install.sh | sh` + - `deno run --allow-all npm:playwright install` +2. Populate `test/.env` from the live cluster secrets (DB password from the `vso-db-credentials` secret; admin password from `secretkv`). See [Secrets & VSO](../tools/secrets-and-vso.md) for how those secrets land in the namespace. +3. Run: `deno run --allow-all main.ts`. + +### Lock the installer after install + +> [!CAUTION] +> Dolibarr's `install/` wizard stays reachable until an `install.lock` exists. After a successful first install, **always** create the lock — an unlocked installer is a live takeover risk on a production-like instance. + +The post-install step touches the lock file inside the pod and chowns it to `www-data`: + +```sh +kubectl -n erp exec $(kubectl get pod -n erp -l app.kubernetes.io/name=erp -o name) -- \ + /bin/bash -c '/usr/bin/touch /var/www/html/install.lock && /bin/chown www-data:www-data /var/www/html/install.lock' +``` + +`initialSetup.isUpgradeLocked()` checks for the same locked state before deciding whether to (re)run the installer, so the lock is both a safety gate and the suite's idempotency signal. + +--- + +## 4. Day-2 operations + +### Scaling — manual only + +| Setting | Value | Source | +| --- | --- | --- | +| `replicaCount` | `1` | [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | +| `autoscaling.enabled` | `false` | [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) | + +Dolibarr runs as a **single replica with no HorizontalPodAutoscaler**. The instance is backed by a `ReadWriteOnce` filesystem PVC, so scaling out is not a supported topology — scaling is a deliberate, manual `replicaCount` change in the chart values, applied through the normal [deployment](deployment.md) path. Treat the workload as a single-writer system. + +### After activating a new Dolibarr module — fix table ownership + +Activating a Dolibarr module creates new SQL tables, and Dolibarr's migration runner creates them under whatever role the live VSO-rotated credentials happen to map to. If that role is not `erp_role`, a subsequent credential rotation by Vault can leave the new tables unreadable. + +> [!IMPORTANT] +> After enabling any new module (e.g. via `moduleSetup.configureModule`), run the table-ownership reassignment so the new objects are owned by `erp_role`. The `REASSIGN OWNED BY … TO erp_role` logic lives in [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) and is also mounted into the pod entrypoint as `update_table_ownership.sql`. + +Apply it inside the pod: + +```sh +kubectl exec -n erp $(kubectl get pod -n erp -l app.kubernetes.io/name=erp -o name) -c erp -- \ + sh -c 'PGPASSWORD=${DOLI_DB_PASSWORD} psql -U ${DOLI_DB_USER} -h ${DOLI_DB_HOST} \ + -p ${DOLI_DB_HOST_PORT} ${DOLI_DB_NAME} -f /var/www/scripts/before-starting.d/update_table_ownership.sql' +``` + +If the pod logged `Read-only file system` for the `update_table_ownership.sql` step at startup (the entrypoint cannot write its temp file), the reassignment never ran — run the command above by hand. If the live DB user has already lost rights, run the same SQL with the **admin Postgres credentials** instead. The role model is described in [Postgres IaC](../factory-provisioning/opentofu/postgres-iac.md). + +### Watch PVC usage so backups do not fill the volume + +ERP data and (where co-located) backup artifacts share storage on a single PVC. If on-volume backup snapshots accumulate, they can exhaust the volume and take Dolibarr down. + +| Watch | Why | +| --- | --- | +| PVC used vs. capacity | A full volume crashes Dolibarr (no room for sessions/temp/migrations) | +| Backup artifact growth | Old dumps left on the volume eat the same space data needs | + +> [!WARNING] +> Monitor PVC usage and prune/offload old backup artifacts before they fill the **50Gi** volume. Backup retention, artifact layout, and the off-volume target are documented in [Backup & recovery](backup-and-recovery.md) — keep on-volume copies short-lived. See [Storage & recovery](../lab-ecosystem/storage-and-recovery.md) for the durable-copy model. + +--- + +## See also + +- [ERP hub](README.md) — overview and entry point for the ERP guidebook. +- [Deployment](deployment.md) — how the workload, chart, and credentials reach the cluster. +- [Backup & recovery](backup-and-recovery.md) — backup artifacts, retention, restore drills. +- [Secrets & VSO](../tools/secrets-and-vso.md) — how `vso-db-credentials` / `secretkv` land in the namespace. -- 2.49.1 From 2d76eb45c1b816e10f42998ef9a9836c2762f257 Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 22:22:09 +0200 Subject: [PATCH 7/9] docs(vibe): add new-tool and new-app runbooks (grounded in real PRs) Two agent-oriented runbooks under vibe/runbooks/ with [AGENT]/[HUMAN] step markers, grounded in real diffs: - new-tool.md : add a platform component to the tools repo so ArgoCD deploys it into the tools namespace (wrapper Chart.yaml + the tool library + a row in chart/values.yaml; optional iac/ for secrets). Mirrors the prometheus/crowdsec additions. - new-app.md : stand up a brand-new application across THREE repos (app + factory + tools) with the strict ordering dependency and the TERRAFORM_SSH_KEY pitfall. Phase-by-phase mapped to the dance-lessons-coach onboarding PRs (#89/#97/#98/#99/#100), factory #1/#2, tools #1; the FR doc/runbooks/new-web-app is linked as the detailed companion. 2 mermaid diagrams MCP-validated; zero dead links across the vibe tree. Co-Authored-By: Claude Opus 4.8 --- vibe/runbooks/README.md | 2 + vibe/runbooks/new-app.md | 269 ++++++++++++++++++++++++++++++++++++ vibe/runbooks/new-tool.md | 279 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 550 insertions(+) create mode 100644 vibe/runbooks/new-app.md create mode 100644 vibe/runbooks/new-tool.md diff --git a/vibe/runbooks/README.md b/vibe/runbooks/README.md index 3286606..1237ba4 100644 --- a/vibe/runbooks/README.md +++ b/vibe/runbooks/README.md @@ -45,6 +45,8 @@ flowchart LR | Runbook | Summary | Status | |---|---|---| | [_template](_template.md) | Skeleton for new agent-oriented runbooks (`[AGENT]`/`[HUMAN]` markers, copy-paste commands, verification + rollback) | ✅ Active | +| [Set up a new tool](new-tool.md) | Add a platform component to the `tools` repo so ArgoCD deploys it | ✅ | +| [Set up a new app](new-app.md) | Stand up a brand-new application — its own repo, chart, CI/CD with IaC, and database access, across the app + factory + tools repos | ✅ | > [!NOTE] > The first **concrete** runbook — a local sandbox game-day for the safe prod-like environment — ships with **PRD Phase 1** ([safe-prod-like-environment PRD](../PRD/safe-prod-like-environment/README.md)). Until then this folder holds the conventions and the template only. diff --git a/vibe/runbooks/new-app.md b/vibe/runbooks/new-app.md new file mode 100644 index 0000000..c6ef0a0 --- /dev/null +++ b/vibe/runbooks/new-app.md @@ -0,0 +1,269 @@ +[vibe](../README.md) > [Runbooks](README.md) > **Set up a new app** + +# Set up a new app + +> **Status:** ✅ Active +> **Audience:** platform operator + agents (English). For the detailed human-facing procedure see the French [new-web-app runbook](../../doc/runbooks/new-web-app/README.md). +> **Last Updated:** 2026-06-23 + +## TL;DR + +> [!TIP] +> Standing up a brand-new application touches **three repos** — the app's own Gitea repo, [`factory`](../../argocd/values.yaml), and [`tools`](../guidebooks/tools/secrets-and-vso.md) — with a **strict ordering dependency**. An agent may write every file and open every PR (`[AGENT]`), but each **merge/apply is `[HUMAN]`-gated**. The single rule that everything else hangs on: the **factory** Postgres DB+role and the **tools** Vault JWT role MUST be applied **before** the app's own `iac/` runs. Ship the app in **degraded mode first** (no DB/Vault), wire the platform sides, then turn on dynamic credentials last. The detailed companion is the French [new-web-app runbook](../../doc/runbooks/new-web-app/README.md); this page is its agent-oriented English mirror. + +> [!CAUTION] +> **Ordering is load-bearing — do not reorder the phases.** +> - The app's own `iac/` (Phase 6) calls the shared `app_roles` module, which issues `GRANT _role TO …` on every dynamic credential and authenticates to Vault as `gitea_cicd_`. So **both** of these must already exist: +> - the Postgres role `_role` + database `` → created by the **factory** side (Phase 4). +> - the Vault JWT role `gitea_cicd_` + policies `` / `-ops` → created by the **tools** side (Phase 5). +> - The app's `vault.yaml` CI needs the **`TERRAFORM_SSH_KEY`** Actions secret (the `tofu_module_reader` SSH key from Vault) or `terraform init` cannot clone the `app_roles` module over `git::ssh://`. This is the canonical pitfall — it sank the first `iac/` push and was fixed in [dance-lessons-coach PR #100](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/100). +> Apply Phases 4 and 5 **before** merging Phase 6. + +## Scope + +This runbook covers standing up a **brand-new application** end-to-end: its own Gitea repo, a Helm `chart/`, CI/CD with IaC (`iac/` + `.gitea/workflows/`), and database access — all deployed by factory **ArgoCD** into a dedicated namespace. Systems touched: **Gitea** (repo + Actions + container registry), **Postgres** (DB + owner role via factory), **Vault** (JWT CI role, policies, dynamic DB creds via tools + app), **k3s** (namespace, pod, SA), **ArgoCD** (Application sync + image-updater), and **Traefik** (ingress). + +It does **not** cover: writing the application code itself, the one-time platform foundations (Vault mounts, the Vault→Postgres connection, the `gitea_cicd` bootstrap JWT role, the `tofu_module_reader` bot, org-level Actions secrets — all already in place), or adding a non-application platform component (see [Set up a new tool](new-tool.md)). + +The reference onboarding is **`dance-lessons-coach`** (verified from its merged PRs), with **[webapp](../guidebooks/applications/webapp.md)** as the canonical app to clone. + +## Preconditions + +- [ ] Working in a worktree under `.claude/worktrees//` (never the trunk). +- [ ] You can create a Gitea repo under `arcodange-org` (default) or `arcodange` (for some apps). +- [ ] Local clones of `factory` and `tools` are available and on synced `main`. +- [ ] The `` name is chosen — **kebab-case, lowercase**. This is the **universal join key**: the same string is reused verbatim across Gitea, Postgres, Vault, Kubernetes, ArgoCD, GCS, and DNS. One typo silently breaks the chain. See [naming-conventions](../guidebooks/lab-ecosystem/naming-conventions.md) and the FR [conventions](../../doc/runbooks/new-web-app/conventions.md). +- [ ] The platform foundations exist (Vault mounts `kvv2`/`postgres`/`transit` + auth `kubernetes`, the Vault→Postgres connection via `credentials_editor`, the bootstrap `gitea_cicd` role, the `tofu_module_reader` SSH bot, and org Actions secrets `HOMELAB_CA_CERT` / `vault_oauth__sh_b64` / `PACKAGES_TOKEN`). + +## The three-repo onboarding (ordering) + +```mermaid +%%{init: {'theme':'base','themeVariables':{'fontSize':'14px'}}}%% +flowchart TB + classDef app fill:#2563eb,stroke:#1e40af,color:#ffffff + classDef plat fill:#059669,stroke:#047857,color:#ffffff + classDef tools fill:#7c3aed,stroke:#6d28d9,color:#ffffff + classDef run fill:#b45309,stroke:#92400e,color:#ffffff + + P1["Phase 1-3 · APP repo
chart/ degraded + Vault-ready (gated) + TLS
(serves, no DB/Vault yet)"]:::app + P4["Phase 4 · FACTORY repo
argocd/values.yaml + postgres/iac
→ DB <app> + role <app>_role"]:::plat + P5["Phase 5 · TOOLS repo
hashicorp-vault/iac
→ gitea_cicd_<app> + policies"]:::tools + P6["Phase 6 · APP repo
iac/ (app_roles module) + vault.yaml
+ TERRAFORM_SSH_KEY secret"]:::app + P7["Phase 7-8 · APP repo
vault.enabled=true + dockerimage.yaml
→ dynamic creds on, image rollout"]:::run + + P1 --> P4 + P1 --> P5 + P4 --> P6 + P5 --> P6 + P6 --> P7 +``` + +1. **Phases 1-3 (app repo):** ship the chart in degraded mode, make it Vault-ready behind a default-off gate, and set the right ingress — none of this needs the platform sides yet. +2. **Phase 4 (factory) and Phase 5 (tools)** are independent of each other but **both** must be applied before Phase 6. +3. **Phase 6 (app repo)** applies the app's own `iac/`, which depends on the role/JWT created in 4 and 5, and needs the `TERRAFORM_SSH_KEY` secret. +4. **Phases 7-8 (app repo)** flip `vault.enabled=true` for live dynamic DB creds, then add the image-build CI so ArgoCD's image-updater rolls out releases. + +## Procedure + +### Phase 0 — Choose the name and create the repo + +1. **[HUMAN]** Fix `` (kebab-case) and the Gitea org. Default org is **`arcodange-org`**; some apps live under **`arcodange`** (e.g. `dance-lessons-coach`, `telegram-gateway`). Create the empty repo under the chosen org. Inheriting org-level Actions secrets is why the org choice matters. + +### Phase 1 — App in degraded mode + +Mirrors [dance-lessons-coach PR #89](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/89). Clone the [webapp](../guidebooks/applications/webapp.md) pattern. + +2. **[AGENT]** Add a `Dockerfile` and a Helm `chart/` (`deployment`, `service`, `ingress`, `serviceaccount`, `configmap`, `_helpers.tpl`, `NOTES.txt`) with **no DB/Vault wiring**. Set: + - ingress host `.arcodange.lab` (internal) and/or `.arcodange.fr` (public) — TLS details land in Phase 3; + - a `nodeSelector` of `kubernetes.io/hostname: pi1` (network entrypoint, preserves the user IP, avoids NAT); + - `/healthz` (or the app's real path, e.g. `dance-lessons-coach` uses `/api/healthz`) for **both** liveness and readiness probes; + - leave any DB host empty so the pod serves in degraded mode. + + ```bash + # [AGENT] lint + render before opening the PR — safe, no cluster contact + helm lint chart/ + helm template chart/ --set image.repository=test --set image.tag=v1 + ``` + +3. **[HUMAN]** Open and merge the PR. Verify the app serves in degraded mode (binary + health endpoint reachable once ArgoCD picks it up in Phase 4+). + +### Phase 2 — Make the chart Vault-ready (gated, default off) + +Mirrors [dance-lessons-coach PR #97](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/97). + +4. **[AGENT]** Add `VaultAuth`, `VaultStaticSecret`, and `VaultDynamicSecret` templates, each **gated behind `.Values.vault.enabled`** (default `false`) so a plain `helm install` keeps working. The reference `values.yaml` exposes: + + ```yaml + # chart/values.yaml — gate + the three Vault join keys (all derived from ) + vault: + enabled: false + role: # k8s auth backend role (matches iac/main.tf) + kvv2Path: /config # KVv2 secret path + postgresPath: creds/ # postgres dynamic creds path + ``` + + The `VaultAuth` targets the k8s role `` with the app's ServiceAccount and audience `vault`; the `VaultDynamicSecret` reads `postgres/creds/` into a `db-credentials` Secret and `rolloutRestartTargets` the Deployment. + +5. **[HUMAN]** Open and merge the PR. The chart is now Vault-ready without activating any Vault dependency. + +### Phase 3 — Ingress / TLS + +Mirrors [dance-lessons-coach PR #98](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/98). Pick by host suffix: + +6. **[AGENT]** For a **`.lab`** host: `traefik.../router.entrypoints: websecure` + `router.tls: "true"` + `router.tls.certresolver: letsencrypt` (with `router.tls.domains.0.main: arcodange.lab` and `…sans: .arcodange.lab`) + `router.middlewares: localIp@file`. For a **`.fr`** host: `router.entrypoints: web` + `router.middlewares: kube-system-crowdsec@kubernetescrd`. (Convention: `.lab` = internal, websecure + localIp + letsencrypt; `.fr` = public, web + crowdsec.) + +7. **[HUMAN]** Merge the PR. + +### Phase 4 — FACTORY side (DB + role, ArgoCD enrollment) + +Mirrors [factory PR #1](https://gitea.arcodange.lab/arcodange-org/factory/pulls/1) (ArgoCD) and [factory PR #2](https://gitea.arcodange.lab/arcodange-org/factory/pulls/2) (Postgres). Link: [postgres-iac](../guidebooks/factory-provisioning/opentofu/postgres-iac.md), [ci-apply-flow](../guidebooks/factory-provisioning/opentofu/ci-apply-flow.md). + +8. **[AGENT]** Enroll `` in [`argocd/values.yaml`](../../argocd/values.yaml) under `gitea_applications`. The [apps template](../../argocd/templates/apps.yaml) defaults the org to `arcodange-org` (`{{- $org := default "arcodange-org" $app_attr.org -}}`), so add `org: arcodange` **only** if the app is not under `arcodange-org`. Add image-updater annotations for digest-based rollout: + + ```yaml + # argocd/values.yaml — under gitea_applications + : + org: arcodange # ← ONLY if not arcodange-org + annotations: + argocd-image-updater.argoproj.io/image-list: =gitea.arcodange.lab//:latest + argocd-image-updater.argoproj.io/.update-strategy: digest + ``` + +9. **[AGENT]** Add `""` to the `applications` list in [`postgres/iac/terraform.tfvars`](../../postgres/iac/terraform.tfvars). This creates the `` database, the non-login owner role `_role`, and the pgbouncer `user_lookup()` function. + + ```hcl + # postgres/iac/terraform.tfvars + applications = [ + "webapp", + "erp", + "crowdsec", + "plausible", + "dance-lessons-coach", + "", # ← add + ] + ``` + +10. **[HUMAN]** Merge both PRs. Factory CI (`postgres.yaml`) applies — the DB + role now exist. ArgoCD creates the Application and deploys the degraded chart into namespace ``. + +### Phase 5 — TOOLS side (Vault JWT role + policies) + +Mirrors [tools PR #1](https://gitea.arcodange.lab/arcodange-org/tools/pulls/1). Link: [tools secrets-and-vso](../guidebooks/tools/secrets-and-vso.md), [tools components](../guidebooks/tools/components.md). + +11. **[AGENT]** Add `{ name = "" }` to the `applications` list in [`tools/hashicorp-vault/iac/terraform.tfvars`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/terraform.tfvars). Via the `app_policy` / `app_roles` modules this creates the `gitea_cicd_` JWT role, the `` (runtime) and `-ops` (CI) policies, the `-ops` identity group, and the k8s auth role. + + ```hcl + # tools/hashicorp-vault/iac/terraform.tfvars + applications = [ + { name = "webapp" }, + { name = "erp" }, + { name = "" }, # ← add + # optional fields when needed: + # { name = "", ops_policies = ["…"], service_account_names = ["…"], service_account_namespaces = ["tools"] } + ] + ``` + +12. **[HUMAN]** Merge the PR. Tools CI (`vault.yaml`) applies — `gitea_cicd_` and the policies now exist. + +### Phase 6 — App IaC + Vault workflow + +Mirrors [dance-lessons-coach PR #99](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/99) and the [#100 fix](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/100). See [05-app-terraform](../../doc/runbooks/new-web-app/05-app-terraform.md) for the module contract. + +> [!CAUTION] +> **Phases 4 and 5 must already be applied** before merging this phase, or the first `tofu apply` fails (no `_role` to GRANT, or Vault auth fails on the missing `gitea_cicd_` role). + +13. **[AGENT]** Add the app's `iac/`: + - `providers.tf` — Vault provider with `auth_login_jwt { mount = "gitea_jwt", role = "gitea_cicd_" }`. + - `backend.tf` — GCS backend `bucket = "arcodange-tf"`, `prefix = "/main"`. + - `main.tf` — call the shared module (the exact source string used by every app): + + ```hcl + module "app_roles" { + source = "git::ssh://git@192.168.1.202:2222/arcodange-org/tools.git//hashicorp-vault/iac/modules/app_roles?depth=1&ref=main" + name = "" + } + ``` + + This provisions `postgres/creds/` (dynamic DB role inheriting `_role`) and the k8s auth role ``. Add any app-specific `kvv2//config` secrets alongside. + +14. **[AGENT]** Add `.gitea/workflows/vault.yaml` that authenticates via Gitea OIDC and runs `tofu apply iac/`. The `vault-action` step's `role:` and `providers.tf`'s `role` **must both** be `gitea_cicd_` (the copy-paste trap — `erp` still carries a stale `gitea_cicd_webapp`). The secrets block must read the SSH key: + + ```yaml + # .gitea/workflows/vault.yaml — vault-action secrets block + secrets: | + kvv1/google/credentials credentials | GOOGLE_BACKEND_CREDENTIALS ; + kvv1/gitea/tofu_module_reader ssh_private_key | TERRAFORM_SSH_KEY ; + ``` + +15. **[HUMAN]** Add the **`TERRAFORM_SSH_KEY`** secret (the `tofu_module_reader` SSH key, read from Vault at `kvv1/gitea/tofu_module_reader`) to the app repo's **Actions secrets**. Without it, `terraform init` cannot clone the `app_roles` module over `git::ssh://` — the canonical pitfall fixed in [PR #100](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/100). + +16. **[HUMAN]** Merge the PR. The app's `vault.yaml` runs `tofu apply` — `postgres/creds/` and the k8s role `` now exist. + +### Phase 7 — Turn on dynamic DB credentials + +17. **[AGENT]** Set `vault.enabled=true` in `chart/values.yaml` (and point the app's DB env at `pgbouncer.tools:5432`). On next ArgoCD sync, VSO authenticates with the k8s role ``, fetches dynamic Postgres creds from `postgres/creds/` into the `db-credentials` Secret, and the pod reaches the DB through **pgbouncer.tools** with a short-lived user that inherits `_role`. See [webapp](../guidebooks/applications/webapp.md) and [erp](../guidebooks/erp/README.md) for the consumption pattern. + +18. **[HUMAN]** Merge the PR. + +### Phase 8 — Image CI + deploy + +19. **[AGENT]** Add `.gitea/workflows/dockerimage.yaml` that builds the image and pushes it to the Gitea registry (`gitea.arcodange.lab//:latest` + branch tag), logging in with `PACKAGES_TOKEN`. No deploy step is needed — the ArgoCD image-updater annotations from Phase 4 watch `latest` (digest strategy) and roll it out. Skip this phase entirely for apps that run a public upstream image (e.g. `erp`/Dolibarr). + +20. **[HUMAN]** Merge the PR. + +## Verification + +The convention chain must resolve end-to-end (this is the same parity check the [safe-env PRD](../ADR/0001-safe-prod-like-environment.md) rehearses in the sandbox). All checks below are **[AGENT]** read-only: + +```bash +# [AGENT] Gitea repo exists under the chosen org +git ls-remote https://gitea.arcodange.lab// &>/dev/null && echo "repo OK" + +# [AGENT] Postgres DB + owner role exist (run from a host with psql access to the engine) +psql -h 192.168.1.202 -U credentials_editor -tAc \ + "SELECT datname FROM pg_database WHERE datname='';" +psql -h 192.168.1.202 -U credentials_editor -tAc \ + "SELECT rolname FROM pg_roles WHERE rolname='_role';" + +# [AGENT] Vault: dynamic role, policies, and CI JWT role exist +vault read postgres/roles/ +vault policy read +vault policy read -ops +vault read auth/gitea_jwt/role/gitea_cicd_ + +# [AGENT] ArgoCD Application is Synced + Healthy +kubectl --context -n argocd get application \ + -o jsonpath='{.status.sync.status}/{.status.health.status}' +# expected: Synced/Healthy + +# [AGENT] VSO created the db-credentials Secret + pod is Running + ingress resolves +kubectl --context -n get secret db-credentials +kubectl --context -n get pods +curl -fsS https://.arcodange.lab/healthz # or the app's real health path +``` + +Expected: repo present; PG `` DB + `_role` exist; Vault `postgres/creds/` + policies ``/`-ops` + `gitea_cicd_` exist; ArgoCD Application `Synced/Healthy`; the `db-credentials` Secret was created by VSO; the pod is `Running`; the ingress resolves. + +## Rollback + +Revert the per-repo PRs **in reverse order**: app → tools → factory. Tag each undo just like the procedure. + +1. **[HUMAN]** App repo: revert Phase 8 → 7 → 6 PRs. Reverting the Phase 6 `iac/` removes `postgres/creds/` and the k8s role on the next CI run; setting `vault.enabled=false` returns the chart to degraded mode. +2. **[HUMAN]** Tools repo: remove the `{ name = "" }` entry; tools CI prunes `gitea_cicd_` + policies. +3. **[HUMAN]** Factory repo: remove the `` entry from `argocd/values.yaml` — ArgoCD **prunes the Application** (and its namespace) — and remove `""` from `postgres/iac/terraform.tfvars` to drop the DB + role. +4. **[HUMAN]** For a full cluster-level recovery (power-cut, lost unseal key) consult `CLUSTER_RECOVERY.md`. + +> [!WARNING] +> Removing the Postgres entry **drops the database** `` and its data. Back up first if the app already holds state. + +## References + +- French human-operator procedure: [new-web-app runbook](../../doc/runbooks/new-web-app/README.md) + [conventions](../../doc/runbooks/new-web-app/conventions.md) (the universal `` join key). +- Exemplars: [webapp](../guidebooks/applications/webapp.md) (in-house image + DB) and [erp](../guidebooks/erp/README.md) (public image + DB). +- Platform mechanics: [tools secrets-and-vso](../guidebooks/tools/secrets-and-vso.md), [tools components](../guidebooks/tools/components.md), [postgres-iac](../guidebooks/factory-provisioning/opentofu/postgres-iac.md), [ci-apply-flow](../guidebooks/factory-provisioning/opentofu/ci-apply-flow.md), [naming-conventions](../guidebooks/lab-ecosystem/naming-conventions.md), [secrets-and-vault](../guidebooks/lab-ecosystem/secrets-and-vault.md). +- Companion runbook: [Set up a new tool](new-tool.md). +- Parity rehearsal: [safe-prod-like-environment ADR/PRD](../ADR/0001-safe-prod-like-environment.md). +- Factory files: [argocd/values.yaml](../../argocd/values.yaml), [argocd/templates/apps.yaml](../../argocd/templates/apps.yaml), [postgres/iac/terraform.tfvars](../../postgres/iac/terraform.tfvars). +- Reference PRs (verified, all merged): + - app `dance-lessons-coach`: [#89 degraded](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/89) · [#97 Vault-ready gate](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/97) · [#98 TLS ingress](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/98) · [#99 iac + workflow](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/99) · [#100 TERRAFORM_SSH_KEY fix](https://gitea.arcodange.lab/arcodange/dance-lessons-coach/pulls/100) + - factory: [#1 ArgoCD enroll + org override](https://gitea.arcodange.lab/arcodange-org/factory/pulls/1) · [#2 Postgres DB + role](https://gitea.arcodange.lab/arcodange-org/factory/pulls/2) + - tools: [#1 Vault JWT role + policy](https://gitea.arcodange.lab/arcodange-org/tools/pulls/1) diff --git a/vibe/runbooks/new-tool.md b/vibe/runbooks/new-tool.md new file mode 100644 index 0000000..414ff6f --- /dev/null +++ b/vibe/runbooks/new-tool.md @@ -0,0 +1,279 @@ +[vibe](../README.md) > [Runbooks](README.md) > **Set up a new tool** + +# Set up a new tool + +> **Status:** ✅ Active +> **Audience:** platform operator + agents (English). For the application-onboarding equivalent see [Set up a new app](new-app.md). +> **Last Updated:** 2026-06-23 + +## TL;DR + +> [!TIP] +> Adding a platform component means dropping a small **wrapper chart** into the `tools` repo and registering it in the app-of-apps. An agent can do the bulk of it: scaffold `tools//` (a wrapper `Chart.yaml` that depends on the upstream chart + the local `tool` library chart, the two `helm-chart*.yaml` templates, and a `values.yaml`), add one key under `tools:` in [`tools/chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml), and lint it locally. The **human approval gate** sits at two places: (1) any Vault/database wiring under `tools//iac/` and (2) opening + merging the PR — ArgoCD auto-syncs the new Application the moment it lands on `main`. + +## Scope + +This runbook covers adding a **new platform component** (monitoring, cache, security engine, connection pooler, analytics, …) to the [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) so the factory ArgoCD `tools` project renders an Application for it and deploys it into the **`tools` namespace**. + +Systems touched: Gitea (`tools` repo), ArgoCD (the `tools` AppProject), k3s (the helm-controller that materialises each `HelmChart` CR), and — only for secret-backed tools — Vault + the Vault Secrets Operator (VSO). + +This runbook does **not** cover standing up a brand-new business application (its own repo, chart, CI/CD, database). That is the [Set up a new app](new-app.md) runbook. It also does not cover the underlying app-of-apps wiring of the `tools` project itself — read the [tools guidebook](../guidebooks/tools/README.md) for how that works. + +## Preconditions + +- [ ] Working in a worktree under `.claude/worktrees//` of a `tools` repo clone (never the trunk). +- [ ] The tool deploys into the **`tools` namespace** (the `tools` AppProject only permits that destination). +- [ ] You know the **upstream Helm chart** (chart name + repo URL) and a **pinned version**, OR you have decided this tool needs **Kustomize + helm inflation** (charts that require post-render patching, like `clickhouse`/`plausible`). +- [ ] `helm` (with the upstream repo reachable) and, for the Kustomize path, `kustomize` available locally for the lint step. +- [ ] If the tool needs secrets or a database: confidence with the Vault `app_roles` module pattern and the `tofu-apply` CI flow — see the [tools secrets & VSO page](../guidebooks/tools/secrets-and-vso.md) and the [tofu CI apply flow](../guidebooks/factory-provisioning/opentofu/ci-apply-flow.md). + +## Procedure + +1. **[HUMAN]** Choose the tool name `` (kebab-case) and the deployment shape. + + Decide between the two supported shapes: + - **Wrapper chart (default).** A thin Helm chart that depends on the upstream chart at a pinned version and lets the local `tool` library chart emit a k3s `HelmChart` custom resource. Used by [`prometheus`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus) and [`crowdsec`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec). + - **Kustomize + helm inflation.** For charts that need post-render JSON6902 patches or extra `resources/`. Used by [`clickhouse`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/clickhouse) and [`plausible`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible). + + Pin the upstream chart **version** now — it goes verbatim into the next step. + +2. **[AGENT]** Scaffold `tools//` (wrapper-chart shape). + + Create four files. The `Chart.yaml` declares **two** dependencies — the local `tool` library chart (served from the Gitea Helm package registry) and the upstream chart pinned to your chosen version: + + ```yaml + # tools//Chart.yaml + apiVersion: v2 + name: + description: A Helm chart for Kubernetes + + dependencies: + - name: tool + version: 0.1.0 + repository: https://gitea.arcodange.lab/api/packages/arcodange-org/helm + - name: + version: + repository: https:// + type: application + version: 0.1.0 + ``` + + The two template files are one-liners that delegate to the `tool` library (they only render when `tool.kind` is `HelmChart`; under `SubChart` they are inert and the upstream chart is pulled as a normal dependency): + + ```yaml + # tools//templates/helm-chart.yaml + {{- if eq .Values.tool.kind "HelmChart" -}} + {{- include "tool.helm-chart.tpl" . -}} + {{- end -}} + ``` + + ```yaml + # tools//templates/helm-chart-config.yaml + {{- if eq .Values.tool.kind "HelmChart" -}} + {{- include "tool.helm-chart-config.tpl" . -}} + {{- end -}} + ``` + + The `values.yaml` carries the upstream values under a YAML anchor and re-references it from the `tool:` block. Web-facing tools set an ingress host `.arcodange.lab`; stateful tools set persistence with the longhorn storage class and resource requests/limits. The shape, taken from [`prometheus/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus/values.yaml): + + ```yaml + # tools//values.yaml + : &_config + # ── upstream values go here ── + # web-facing tools: expose an ingress host + ingress: + enabled: true + hosts: + - .arcodange.lab + # stateful tools: pin storage class + size + persistence: + enabled: true + storageClass: longhorn + size: 8Gi + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 500m + memory: 512Mi + + tool: + # kind 'SubChart': pull the upstream chart as a dependency and pass it the values below. + # kind 'HelmChart': let the tool library emit a k3s HelmChart CR instead. + kind: 'SubChart' + repo: https:// + chart: + version: + values: *_config + ``` + + > [!NOTE] + > Under `tool.kind: 'HelmChart'` the local [`tool` library chart](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/tool) emits a `helm.cattle.io/v1` `HelmChart` CR (and an optional `HelmChartConfig`) pinned to `namespace: tools` / `targetNamespace: tools`, and the k3s helm-controller installs the upstream chart. Under `'SubChart'` (the default that prometheus and crowdsec use) the upstream chart is just a Helm dependency rendered in-line. Pick `SubChart` unless you specifically need the helm-controller to own the release. + + For the **Kustomize shape** instead, skip the wrapper `Chart.yaml`/templates and create a `kustomization.yaml` that inflates the upstream chart plus any `resources/`, mirroring [`plausible/kustomization.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/kustomization.yaml): + + ```yaml + # tools//kustomization.yaml + apiVersion: kustomize.config.k8s.io/v1beta1 + kind: Kustomization + namespace: tools + + helmCharts: + - name: + repo: https:// + version: + releaseName: + valuesFile: Values.yaml + namespace: tools + + resources: + - resources/ingressroute.yaml + # patches: / patchesJson6902: ← post-render tweaks, see plausible for a worked example + ``` + +3. **[AGENT]** Register the tool in the app-of-apps. + + Add a single key for `` under `tools:` in [`tools/chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml): + + ```yaml + # tools/chart/values.yaml + tools: + pgbouncer: {} + hashicorp-vault: {} + crowdsec: {} + # …existing entries… + : {} + ``` + + The `chart/templates/apps.yaml` template ranges over `.Values.tools` and renders one ArgoCD `Application` per key, with `path: ` and `destination.namespace: tools` under the `tools` AppProject. The key **must match the directory name** you created in step 2. See the [tools guidebook](../guidebooks/tools/README.md) for how the app-of-apps meta-chart drives this. + +4. **[HUMAN]** If the tool needs **secrets** or a **database**, wire Vault + VSO and a tofu-apply workflow. + + This step mutates Vault (creates roles/secrets) and so is gated. Use [`crowdsec`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec) (dynamic Postgres role) and [`plausible`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible) (kvv2 static secrets) as the worked examples, and read the [tools secrets & VSO page](../guidebooks/tools/secrets-and-vso.md). + + a. Add `tools//iac/` — OpenTofu that configures Vault. For a dynamic Postgres role, reuse the shared `app_roles` module exactly as crowdsec does: + + ```hcl + # tools//iac/main.tf + module "app_roles" { + source = "git::ssh://git@192.168.1.202:2222/arcodange-org/tools.git//hashicorp-vault/iac/modules/app_roles?depth=1&ref=main" + name = "" + service_account_namespaces = ["tools"] + } + # for kvv2 static config, add vault_kv_secret_v2 resources (see plausible/iac/main.tf) + ``` + + Pair it with a `backend.tf` (GCS state at `prefix = "tools//main"`) and a `providers.tf` whose `auth_login_jwt` role is `gitea_cicd_` — both copied from crowdsec. + + b. Add the VSO CRDs to the chart templates so VSO mints a k8s Secret the workload consumes. A `serviceaccount.yaml`, a `VaultAuth` bound to a Vault `kubernetes` role named ``, and a `VaultDynamicSecret` (or `VaultStaticSecret` for kvv2) pointing at the Vault path: + + ```yaml + # tools//templates/vaultauth.yaml + apiVersion: secrets.hashicorp.com/v1beta1 + kind: VaultAuth + metadata: + name: + namespace: {{ .Release.Namespace }} + spec: + vaultConnectionRef: default + method: kubernetes + mount: kubernetes + kubernetes: + role: + serviceAccount: + audiences: + - vault + ``` + + ```yaml + # tools//templates/vaultdynamicsecret.yaml + apiVersion: secrets.hashicorp.com/v1beta1 + kind: VaultDynamicSecret + metadata: + name: -db-credentials + namespace: {{ .Release.Namespace }} + spec: + mount: postgres + path: creds/ + destination: + create: true + name: -db-credentials + rolloutRestartTargets: + - kind: Deployment + name: + vaultAuthRef: + ``` + + Then reference the VSO-created secret from the workload (env `valueFrom.secretKeyRef`), as crowdsec's `values.yaml` does for `DB_USER`/`DB_PASSWORD`. For the Kustomize shape, add these CRDs as files under `resources/` and list them in `kustomization.yaml` instead of `templates/`. + + c. Add a `.gitea/workflows/.yaml` that tofu-applies `/iac` on changes, mirroring [`crowdsec.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/.gitea/workflows/crowdsec.yaml): a path filter on `'/**/*.tf'`, a Gitea→Vault JWT auth job, and a `dflook/terraform-apply` step with `path: /iac`. See the [tofu CI apply flow](../guidebooks/factory-provisioning/opentofu/ci-apply-flow.md) for what that pipeline does end to end. + +5. **[AGENT]** Lint and render locally before opening the PR. + + For the wrapper-chart shape: + + ```bash + helm dependency update tools/ + helm lint tools/ + helm template tools/ | head -n 60 + # render the app-of-apps Application for : + helm template tools-apps tools/chart | grep -A12 "name: " + ``` + + For the Kustomize shape: + + ```bash + kustomize build --enable-helm tools/ | head -n 60 + ``` + +6. **[HUMAN]** Open a PR on the `tools` repo, get it reviewed, and merge. + + ```bash + git checkout -b arcodange/ + git add tools/ tools/chart/values.yaml + git commit -m "declare " + git push -u origin arcodange/ + ``` + + > [!IMPORTANT] + > The `tools` repo is on **Gitea**, not GitHub — open the PR with the `mcp__gitea__*` tools (load `select:mcp__gitea__pull_request_write` via `ToolSearch`), not `gh`. Once the PR merges to `main`, ArgoCD detects the new key in `chart/values.yaml`, renders the `` Application, and syncs it automatically. + +## Verification + +All read-only — an agent can run these after the PR merges and ArgoCD has reconciled. + +```bash +# 1. The ArgoCD Application for is Synced + Healthy +kubectl --context -n argocd get application \ + -o jsonpath='{.status.sync.status}/{.status.health.status}{"\n"}' +# expected: Synced/Healthy + +# 2. The pod is Running in the tools namespace +kubectl --context -n tools get pods -l app.kubernetes.io/name= +# expected: -… 1/1 Running + +# 3. Web-facing tools: the ingress is admitted and the host resolves +kubectl --context -n tools get ingress | grep +curl -sI https://.arcodange.lab | head -n1 # expected: HTTP/2 200 (or app login redirect) + +# 4. Secret-backed tools: VSO created the k8s Secret +kubectl --context -n tools get secret -db-credentials +# expected: the Secret exists with the keys the workload mounts +``` + +## Rollback + +- **[HUMAN]** Revert the `tools/chart/values.yaml` entry (remove the `:` key). On the next sync ArgoCD **prunes** the `` Application — `prune: true` is set in `apps.yaml` — which removes the deployed workload from the `tools` namespace. +- **[HUMAN]** In a follow-up PR, delete the `tools//` directory to remove the wrapper chart / Kustomize source. +- **[HUMAN]** For secret-backed tools, the Vault role/secret created by `tools//iac/` is **not** removed by ArgoCD. Destroy it explicitly (`tofu -chdir=tools//iac destroy`) or remove the IaC and let the workflow reconcile, and drop the `.gitea/workflows/.yaml` file. +- For a full cluster-level recovery (power cut, lost quorum) follow CLUSTER_RECOVERY.md. + +## References + +- [Tools guidebook](../guidebooks/tools/README.md) — how the app-of-apps meta-chart turns each `tools:` key into an ArgoCD Application. +- [Tools components](../guidebooks/tools/components.md) — the catalogue of platform components and what each provides. +- [Tools secrets & VSO](../guidebooks/tools/secrets-and-vso.md) — the Vault `app_roles` + VaultAuth/VaultDynamicSecret pattern used in step 4. +- [Tofu CI apply flow](../guidebooks/factory-provisioning/opentofu/ci-apply-flow.md) — what the `/iac` tofu-apply workflow does end to end. +- Real examples in the `tools` repo: [`prometheus`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus) and [`crowdsec`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec) (wrapper-chart shape), the shared [`tool` library chart](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/tool), and [`clickhouse`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/clickhouse)/[`plausible`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible) (Kustomize shape). +- [Set up a new app](new-app.md) — the sibling runbook for onboarding a business application (not a platform component). -- 2.49.1 From 1824a1885dca7b7722634d4ba2d83a8996971029 Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Tue, 23 Jun 2026 23:42:24 +0200 Subject: [PATCH 8/9] docs(vibe): add maintenance rule to the ansible + opentofu sub-hubs The two factory-provisioning sub-hubs were the only guidebook index pages without the "alter a documented component -> update its page in the same PR" reminder that every sibling hub carries. Add a scoped maintenance rule to each, pointing back to the factory-provisioning maintenance rule and the guidebooks' Rules to contribute, so no folder hub silently drifts. Co-Authored-By: Claude Opus 4.8 --- vibe/guidebooks/factory-provisioning/ansible/README.md | 7 +++++++ vibe/guidebooks/factory-provisioning/opentofu/README.md | 7 +++++++ 2 files changed, 14 insertions(+) diff --git a/vibe/guidebooks/factory-provisioning/ansible/README.md b/vibe/guidebooks/factory-provisioning/ansible/README.md index f5dd516..9ec705c 100644 --- a/vibe/guidebooks/factory-provisioning/ansible/README.md +++ b/vibe/guidebooks/factory-provisioning/ansible/README.md @@ -118,3 +118,10 @@ flowchart LR | 06 | [Recover](06-recover.md) | Longhorn + data restore (on-demand DR branch) | 🟡 | | — | [Inventory & variables](inventory.md) | `hosts.yml` groups, `group_vars/` layering, host→service mapping | ✅ | | — | [Roles reference](roles.md) | The seven `arcodange.factory.*` roles | ✅ | + +--- + +## Maintenance rule + +> [!IMPORTANT] +> **Alter a playbook, role, inventory entry, or `group_vars` → update the matching page here in the same change.** Adding a stage, renaming a role, bumping the K3s version or a `requirements.yml` dependency, or moving a host between groups all change what the pages above describe — edit the page in the PR that changes the code, never as a follow-up. This is the [factory-provisioning maintenance rule](../README.md#maintenance-rule) applied to the Ansible half; the guidebooks' full [Rules to contribute](../../README.md#rules-to-contribute) also apply. diff --git a/vibe/guidebooks/factory-provisioning/opentofu/README.md b/vibe/guidebooks/factory-provisioning/opentofu/README.md index 9362cd6..b7c08f7 100644 --- a/vibe/guidebooks/factory-provisioning/opentofu/README.md +++ b/vibe/guidebooks/factory-provisioning/opentofu/README.md @@ -93,3 +93,10 @@ flowchart TD | [factory iac](factory-iac.md) | `iac/` root — Gitea, Vault, Google/GCS backup, Cloudflare, OVH | ✅ | | [postgres iac](postgres-iac.md) | `postgres/iac/` root — PostgreSQL roles & databases on pi2 | ✅ | | [CI apply flow](ci-apply-flow.md) | Both Gitea workflows, the Vault-JWT exchange, auto-approve apply | ✅ | + +--- + +## Maintenance rule + +> [!IMPORTANT] +> **Alter a `.tf` resource, a provider version, a state backend, or a CI workflow → update the matching page here in the same change.** Adding a resource to `iac/`, changing the `postgres/iac/` application list, bumping a provider pin, or editing `iac.yaml`/`postgres.yaml` all change what the pages above describe — edit the page in the PR that changes the code, never as a follow-up. This is the [factory-provisioning maintenance rule](../README.md#maintenance-rule) applied to the OpenTofu half; the guidebooks' full [Rules to contribute](../../README.md#rules-to-contribute) also apply. -- 2.49.1 From 053b04337a72fb1eac8997e21e5455171ba53cbc Mon Sep 17 00:00:00 2001 From: Gabriel Radureau Date: Wed, 24 Jun 2026 10:55:43 +0200 Subject: [PATCH 9/9] chore: gitignore .claude/worktrees Per-session Claude Code checkouts live under .claude/worktrees// on the trunk; keep them out of git so the main checkout stays clean. Co-Authored-By: Claude Opus 4.8 --- .gitignore | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index def556c..6aa36f9 100644 --- a/.gitignore +++ b/.gitignore @@ -2,4 +2,7 @@ .terraform.* .DS_Store node_modules/ -.venv/ \ No newline at end of file +.venv/ + +# Claude Code worktrees (per-session checkouts under .claude/worktrees//) +.claude/worktrees/ \ No newline at end of file -- 2.49.1