docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)

Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00
parent b886f06824
commit dbe32161dc
16 changed files with 1571 additions and 0 deletions
--- a/vibe/guidebooks/factory-provisioning/ansible/04-tools.md
+++ b/vibe/guidebooks/factory-provisioning/ansible/04-tools.md
@@ -0,0 +1,125 @@
+[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **04 · Tools**
+
+# 04 · Tools — Vault + CrowdSec
+
+> [!NOTE]
+> **Status:** ✅ active · **Last Updated:** 2026-06-23
+> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
+> **Downstream:** [Roles reference](roles.md) — deep mechanics of the `hashicorp_vault` and `crowdsec` roles
+> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [05 · Backup](05-backup.md) · [03 · CI/CD](03-cicd.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
+
+Stage 4 installs the **operational tooling layer** on top of a running cluster: HashiCorp **Vault** (the lab's single secret store) and **CrowdSec** (the WAF/IPS that fronts Traefik). The entry point [`playbooks/04_tools.yml`](../../../../ansible/arcodange/factory/playbooks/04_tools.yml) is a one-line wrapper that imports [`playbooks/tools/tools.yml`](../../../../ansible/arcodange/factory/playbooks/tools/tools.yml), which in turn chains two sub-playbooks — `hashicorp_vault.yml` then `crowdsec.yml`. Both run against `localhost` (they drive the cluster through `kubectl` / `kubernetes.core`, not over SSH to the Pis).
+
+> [!IMPORTANT]
+> Vault is the chokepoint of the whole secret model. This page covers **what the playbook orchestrates**; the byte-level role internals (init, unseal, root-token minting, the OpenTofu OIDC backend) live in the [Roles reference](roles.md). Read [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) first for the conceptual model — the two auth backends, the unseal posture, and why there is no secret material in git.
+
+---
+
+## What stage 4 deploys
+
+| Sub-playbook | File | Builds | Role invoked |
+| --- | --- | --- | --- |
+| Vault | [`tools/hashicorp_vault.yml`](../../../../ansible/arcodange/factory/playbooks/tools/hashicorp_vault.yml) | Initialises + unseals Vault, wires the Gitea OIDC/JWT auth backends via OpenTofu, publishes the `vault_oauth__sh_b64` Gitea Action secret | `hashicorp_vault` |
+| CrowdSec | [`tools/crowdsec.yml`](../../../../ansible/arcodange/factory/playbooks/tools/crowdsec.yml) | A `VaultAuth` + `VaultStaticSecret` for the Turnstile captcha keys, a fresh bouncer API key, and the Traefik `crowdsec` middleware | `crowdsec` |
+
+---
+
+## Step 1 — `hashicorp_vault.yml`
+
+### The credential prompt
+
+The play opens with a single `vars_prompt` for the **Gitea admin password** (`gitea_admin_password`, marked `unsafe: true` because the password may contain shell-hostile characters like `{`). This is the only interactive input the stage needs — everything else is derived or minted on the fly.
+
+### Orchestration flow
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart TD
+  classDef prompt fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
+  classDef mint fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef vault fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
+  classDef revoke fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
+
+  P["vars_prompt:<br/>gitea_admin_password"]:::prompt
+  T["Mint temp GITEA_ADMIN_TOKEN<br/>(role gitea_token, replace=true)"]:::mint
+  R["Run hashicorp_vault role:<br/>init · unseal · OIDC backend · gitea secret"]:::vault
+  D["post_tasks:<br/>delete GITEA_ADMIN_TOKEN"]:::revoke
+
+  P --> T --> R --> D
+```
+
+1. **Mint a temporary token.** The `arcodange.factory.gitea_token` role generates a `GITEA_ADMIN_TOKEN` with scopes `write:admin,write:organization,write:repository,write:user` (and `gitea_token_replace: true`, so any stale token of the same name is rotated). It is stashed in the fact `vault_GITEA_ADMIN_TOKEN`.
+2. **Run the `hashicorp_vault` role.** Invoked with three derived vars: the Postgres admin credentials (read straight out of the Postgres host's docker-compose `environment` via `hostvars[groups.postgres[0]]`), the `gitea_admin_token` (= the temp token), and the prompted `gitea_admin_password`. The role does the heavy lifting — see below.
+3. **Revoke the temporary token.** A `post_tasks` block re-invokes `gitea_token` with `gitea_token_delete: true`, so the admin token never outlives the run.
+
+### What the `hashicorp_vault` role does
+
+The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) runs a fixed sequence; the OIDC backend setup is wrapped in a `block`/`always` so the freshly minted **root token is always revoked**, even on failure:
+
+| Phase | Task file | What happens |
+| --- | --- | --- |
+| **Init** | `init.yml` | First-time only. Checks `vault operator init -status`; if uninitialised, runs `vault operator init` with **1 key share / threshold 1** and writes the keys to `~/.arcodange/cluster-keys.json` (mode `600`). Idempotent on re-run. |
+| **Unseal** | `unseal.yml` | Reads `cluster-keys.json` and runs `vault operator unseal` on every server pod. Required on **every reboot** — Vault always restarts sealed. |
+| **Root token** | `new_root_token.yml` | Mints a one-shot root token via the `generate-root` OTP/nonce dance (using the unseal key), needed to authenticate the OpenTofu apply. |
+| **OIDC backend** | `gitea_oidc_auth.yml` | Drives a Playwright script to register/read the Gitea OAuth app, then runs **OpenTofu in a throwaway Docker volume** to provision the `gitea` (OIDC) + `gitea_jwt` (JWT) auth backends, the admin identity, and the `kvv1` static secrets. Finally writes the `vault_oauth__sh_b64` script to Gitea Actions secrets. |
+| **Revoke** | `revoke_token.yml` (in `always`) | Revokes the root token unconditionally. |
+
+> [!IMPORTANT]
+> The OpenTofu apply runs the [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside an ephemeral Docker volume (`docker volume create` → `tofu init` + `tofu apply` → `docker volume rm`), with the state in a GCS backend (`gs://arcodange-tf`, prefix `tools/hashicorp_vault/gitea_oidc`). The CA is mounted read-only via `VAULT_CACERT`. The destroy step is commented out by design — this provisions, it does not tear down.
+
+### The `vault_oauth__sh_b64` Gitea secret
+
+The last act of the role renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) (an OIDC authorization-code → access-token helper for CI), base64-encodes it, and publishes it as the **org-level** Gitea Action secret `vault_oauth__sh_b64`. Because Gitea Action secrets are scoped per owner, the role then **re-publishes the identical secret to each user-owned namespace** listed in `gitea_secret_propagation_users` — repos under a personal account cannot read org-level secrets. This is what lets a Gitea Actions workflow obtain the OIDC JWT that authenticates to Vault under the `gitea_cicd_<app>` role (the CI half of the [secret model](../../lab-ecosystem/secrets-and-vault.md)).
+
+> [!CAUTION]
+> The role has an **off-by-default** `vault_oidc_force_reset` flag. When set, it runs `vault auth disable gitea` **and** `gitea_jwt` before re-applying — which **wipes every `gitea_cicd_<app>` per-app JWT role** created by the tools-repo IaC. Leave it `false` unless you are deliberately rebuilding the OIDC backend from scratch (e.g. `bound_issuer` config drift).
+
+---
+
+## Step 2 — `crowdsec.yml`
+
+The CrowdSec sub-playbook is a thin wrapper that runs the `crowdsec` role to bolt a CrowdSec-bouncer middleware onto Traefik. The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) wires three things together.
+
+| Step | What it creates | Detail |
+| --- | --- | --- |
+| **Turnstile secret** | `ServiceAccount` + `VaultAuth` + `VaultStaticSecret` in `kube-system` | Authenticates via the Kubernetes auth backend (role `factory_crowdsec_conf`) and pulls the Cloudflare Turnstile keys from `kvv2` path `cms/factory/turnstile` into a K8s Secret (`refreshAfter: 30s`). |
+| **Bouncer key** | A CrowdSec LAPI bouncer named `traefik-plugin` | Runs `cscli bouncers add traefik-plugin` inside the LAPI pod; on collision it deletes and re-adds, so the run is repeatable. |
+| **Traefik middleware** | A `traefik.io/v1alpha1` `Middleware` named `crowdsec` | Stream mode, captcha provider `turnstile` (site/secret keys from the Turnstile secret), Redis cache, trusted-IP allow-lists. |
+
+After applying the middleware the role **cleans up `Failed` CrowdSec pods** and **bounces Traefik** (scale to 0 → back to 1, inside a `block`/`rescue`/`always` that guarantees Traefik returns to 1 replica no matter what) so the new middleware config is loaded.
+
+> [!NOTE]
+> The Turnstile keys come from the **CMS-managed** Vault path `cms/factory/turnstile` — they are provisioned outside this stage. CrowdSec only *reads* them here. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how `VaultStaticSecret` materialises a Vault path into a Kubernetes Secret.
+
+---
+
+## Gotchas
+
+> [!WARNING]
+> - **Vault must be unsealed before anything secret-dependent recovers.** Stage 4's unseal step reads `~/.arcodange/cluster-keys.json`; if that file is missing, init/unseal cannot proceed and the OpenTofu apply (which needs a live Vault) fails. The same file gates step 2 of the [power-cut recovery order](../../lab-ecosystem/storage-and-recovery.md).
+> - **Docker is required on the control node.** The OIDC backend provisioning shells out to `docker run … opentofu` and `docker volume`. The Playwright step also runs containerised. A control node without Docker will fail this stage.
+> - **`gitea_admin_password` is `unsafe`.** Do not strip the `unsafe: true` flag from the prompt — passwords with `{`/`}` are mangled by Jinja templating otherwise.
+> - **Re-running is safe by default.** Init and unseal are idempotent; the temp admin token and root token are both revoked on the way out. Only `vault_oidc_force_reset` makes a re-run destructive.
+> - **CrowdSec bounces Traefik.** The middleware step briefly scales Traefik to 0 — expect a short ingress blip during stage 4. The `always` block restores it to 1 even if the scale-down errors.
+
+---
+
+## Where stage 4 sits
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart LR
+  classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
+  classDef next fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+
+  s03["03 · CI/CD"]:::done
+  s04["04 · Tools<br/>Vault · CrowdSec"]:::here
+  s05["05 · Backup"]:::next
+
+  s03 --> s04 --> s05
+```
+
+1. **03 · CI/CD** registered the `act_runner` executors — a prerequisite, since the `vault_oauth__sh_b64` secret published here is consumed by those CI runners.
+2. **04 · Tools** (this page) stands up Vault and CrowdSec.
+3. **05 · Backup** is next — it schedules the cron dumps that protect the state the cluster now holds.