docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)

Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00
parent b886f06824
commit dbe32161dc
16 changed files with 1571 additions and 0 deletions
--- a/vibe/guidebooks/factory-provisioning/ansible/README.md
+++ b/vibe/guidebooks/factory-provisioning/ansible/README.md
@@ -0,0 +1,120 @@
+[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **Ansible**
+
+# Ansible — factory provisioning
+
+> [!NOTE]
+> **Status:** ✅ active · **Last Updated:** 2026-06-23
+> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
+> **Downstream:** [01 · System](01-system.md) · [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) · [04 · Tools](04-tools.md) · [05 · Backup](05-backup.md) · [06 · Recover](06-recover.md) · [Inventory & variables](inventory.md) · [Roles reference](roles.md)
+> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
+
+Ansible is the **imperative half** of the factory: it takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a running K3s cluster with Docker, Longhorn storage, Gitea CI runners, CrowdSec, and Vault. OpenTofu (the declarative half) then provisions everything that lives *outside* the cluster — see the [OpenTofu sub-hub](../opentofu/README.md).
+
+---
+
+## Collection layout
+
+Everything ships as a single Ansible **collection** committed under [`ansible/arcodange/factory/`](../../../../ansible/arcodange/factory). The collection root, not the repo root, is what `ansible-galaxy collection install` and the FQCN references (`arcodange.factory.<role>`) resolve against.
+
+| File | Path | What it declares |
+| --- | --- | --- |
+| `galaxy.yml` | [`ansible/arcodange/factory/galaxy.yml`](../../../../ansible/arcodange/factory/galaxy.yml) | Collection identity: **namespace `arcodange`**, **name `factory`**, **version `1.0.0`**. Together they form the FQCN prefix `arcodange.factory.*` used by every role and playbook import. |
+| `requirements.yml` | [`ansible/requirements.yml`](../../../../ansible/requirements.yml) | External dependencies pulled at install time (see table below). |
+| `ansible.cfg` | [`ansible/arcodange/factory/ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg) | `collections_path = ~/.ansible/collections` and `scp_if_ssh = True` for the SSH connection plugin. |
+| `inventory/` | [`ansible/arcodange/factory/inventory/`](../../../../ansible/arcodange/factory/inventory) | `hosts.yml` + `group_vars/`. Detailed in [Inventory & variables](inventory.md). |
+| `playbooks/` | [`ansible/arcodange/factory/playbooks/`](../../../../ansible/arcodange/factory/playbooks) | The numbered pipeline `01..05` plus the `recover/` branch. |
+| `roles/` | [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) | Seven reusable roles. Detailed in [Roles reference](roles.md). |
+
+### External dependencies (`requirements.yml`)
+
+| Dependency | Type | Why it is needed |
+| --- | --- | --- |
+| `geerlingguy.docker` | role | Installs and configures the Docker engine on each Pi. |
+| `ansible.posix` | collection | POSIX primitives (mounts, sysctl, `synchronize`). |
+| `community.crypto` | collection | Certificate/key generation for the step-ca PKI and Traefik. |
+| `community.docker` | collection | Manages containers and Compose stacks (Gitea, act_runner). |
+| `community.general` | collection | Broad utility modules used across the pipeline. |
+| `kubernetes.core` | collection | `k8s` / `helm` modules used by every K3s-facing task. Needs the `kubernetes` Python lib at runtime. |
+| `k3s-ansible` (`git+https://github.com/k3s-io/k3s-ansible.git`) | git role/collection | Upstream playbooks that install and cluster K3s itself. |
+
+> [!TIP]
+> The runtime Python libraries (`kubernetes`, `jmespath`, `dnspython`) that `kubernetes.core` and friends import are declared in the **repo-root `pyproject.toml`**, not in `requirements.yml`. `uv sync` installs them; `ansible-galaxy` installs the Galaxy/git content. Both steps are required.
+
+---
+
+## Invocation pattern
+
+The control node runs Ansible from a `uv`-managed venv. The `localhost` inventory entry sets `ansible_python_interpreter: "{{ ansible_playbook_python }}"`, so `uv run` is enough to put Ansible on the venv's Python — no hardcoded interpreter path. Full recipe lives in [`ansible/README.md`](../../../../ansible/README.md).
+
+1. **Sync the venv** — installs `ansible-core` plus the runtime Python deps:
+   ```sh
+   uv sync
+   ```
+2. **Install collection dependencies** — pulls the Galaxy + git content from `requirements.yml`:
+   ```sh
+   uv run ansible-galaxy collection install -r ansible/requirements.yml
+   ```
+3. **Run a stage** — point `-i` at the inventory directory and pass one numbered playbook:
+   ```sh
+   uv run ansible-playbook \
+     -i ansible/arcodange/factory/inventory \
+     ansible/arcodange/factory/playbooks/<NN_name>.yml
+   ```
+
+### The vault password (`ANSIBLE_VAULT_PASSWORD_FILE`)
+
+Encrypted vars are decrypted with a password that is **sourced from the cluster, not stored on disk**. `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable script that reads the K8s secret `arcodange-ansible-vault` from the `kube-system` namespace:
+
+```sh
+kubectl get secret -n kube-system arcodange-ansible-vault \
+  --template='{{index .data.pass | base64decode}}'
+```
+
+> [!IMPORTANT]
+> The same `arcodange-ansible-vault` secret in `kube-system` is consumed by the Gitea CI runners (needed for the Gitea mailer). Create it once with `kubectl create secret generic arcodange-ansible-vault --from-literal="pass=<ansible_vault_password>" -n kube-system`. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how this fits the broader secret model.
+
+---
+
+## The provisioning pipeline
+
+The numbered playbooks are meant to be run **in order** on a fresh cluster — each is a thin wrapper that `import_playbook`s a stage directory (e.g. `01_system.yml` → `system/system.yml`). The `recover/` playbooks are **not** part of the linear sequence; they are an on-demand branch used only during disaster recovery.
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
+flowchart LR
+  classDef stage fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
+  classDef recover fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
+
+  s01["01 · System<br/>Docker · K3s · Longhorn · DNS · SSL"]:::stage
+  s02["02 · Setup<br/>Gitea · Postgres · NFS backup"]:::stage
+  s03["03 · CI/CD<br/>act_runner registration"]:::stage
+  s04["04 · Tools<br/>CrowdSec · Vault"]:::stage
+  s05["05 · Backup<br/>cron reports · PVC/db dumps"]:::stage
+  rec["recover/*<br/>Longhorn + data restore"]:::recover
+
+  s01 --> s02 --> s03 --> s04 --> s05
+  s05 -. "on disaster" .-> rec
+  rec -. "rejoin pipeline" .-> s01
+```
+
+1. **`01 · System`** — base OS hardening on each Pi, then Docker, Longhorn disk prep + iSCSI, K3s install, CoreDNS, the step-ca cert issuer, and final K3s config (kubeconfig, Longhorn, Traefik).
+2. **`02 · Setup`** — deploys the cluster-resident services: Gitea, PostgreSQL (on `pi2`), and the NFS backup target.
+3. **`03 · CI/CD`** — fetches a Gitea runner-registration token and rolls out the `act_runner` Docker Compose stack on every non-Gitea Pi so CI jobs have executors.
+4. **`04 · Tools`** — installs the operational tooling layer: CrowdSec (WAF/IPS) and HashiCorp Vault.
+5. **`05 · Backup`** — schedules the cron-driven backup + email-report jobs and the Gitea / Postgres / K3s-PVC dump routines.
+6. **`recover/*` (on demand)** — invoked only after data loss to rebuild Longhorn and replay volume data; once recovered, the cluster re-enters the normal pipeline at `01 · System`.
+
+---
+
+## Index
+
+| # | Page | Covers | State |
+| --- | --- | --- | --- |
+| 01 | [System](01-system.md) | RPi hardening, Docker, K3s, Longhorn/iSCSI, CoreDNS, step-ca SSL | ✅ |
+| 02 | [Setup](02-setup.md) | Gitea, PostgreSQL, NFS backup target | ✅ |
+| 03 | [CI/CD](03-cicd.md) | Gitea `act_runner` registration & Compose deploy | ✅ |
+| 04 | [Tools](04-tools.md) | CrowdSec, HashiCorp Vault | ✅ |
+| 05 | [Backup](05-backup.md) | Cron report jobs, Gitea/Postgres/PVC dumps | ✅ |
+| 06 | [Recover](06-recover.md) | Longhorn + data restore (on-demand DR branch) | 🟡 |
+| — | [Inventory & variables](inventory.md) | `hosts.yml` groups, `group_vars/` layering, host→service mapping | ✅ |
+| — | [Roles reference](roles.md) | The seven `arcodange.factory.*` roles | ✅ |