Files
factory/vibe/guidebooks/factory-provisioning/README.md
Gabriel Radureau dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00

89 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
[vibe](../../README.md) > [Guidebooks](../README.md) > **Factory provisioning**
# Factory provisioning
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [Lab ecosystem guidebook](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md)
> **Related:** [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-prod-like-environment PRD](../../PRD/safe-prod-like-environment/README.md)
This guidebook is the deep dive into **how the `factory` repo turns three Raspberry Pis + a handful of cloud accounts into the running lab.** Where the [lab-ecosystem](../lab-ecosystem/README.md) map shows *which* components exist and how they join, this guidebook drills into the two provisioning **engines** that build and maintain them: the Ansible collection that the operator runs from the Mac, and the OpenTofu modules that Gitea CI applies. Every page below describes the engine *as it is wired right now* — playbook imports, role responsibilities, inventory placement, provider versions, state backends, and the CI flow that ties Tofu to Vault.
## Two engines, two trigger models
The factory splits provisioning along a hard line: **imperative, operator-driven host/cluster build** (Ansible) versus **declarative, CI-driven forge/cloud/database state** (OpenTofu). They never overlap on the same resource, and they run at different moments.
| Engine | Trigger | Runs from | Owns | Lives at |
|---|---|---|---|---|
| **Ansible** | One-shot, operator-run on demand | The Mac (control node) | The cluster + base layer + stateful services: k3s, Longhorn, Pi-hole, step-ca, PostgreSQL, Gitea, Vault, CrowdSec — plus the disaster-recovery playbooks | [`ansible/`](../../../ansible/) → [sub-hub](ansible/README.md) |
| **OpenTofu** | CI-applied on Gitea (path-filtered `push`/`pull_request` + `workflow_dispatch`) | Gitea act-runners | Forge/cloud edge state (Cloudflare, OVH, GCP, Gitea, Vault) and **per-app PostgreSQL databases** | [`iac/`](../../../iac/) + [`postgres/`](../../../postgres/) → [sub-hub](opentofu/README.md) |
> [!NOTE]
> Ansible is **imperative and human-gated** because it touches bare hosts and one-time bootstrap (disk prep, k3s install, Vault init). OpenTofu is **declarative and machine-gated** because its targets are reconcilable API objects (a DNS record, a bucket, a database) whose desired state belongs in version control and converges on every merge.
## How a green-field lab comes up
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef op fill:#1e3a8a,stroke:#1e40af,color:#fff
classDef eng fill:#059669,stroke:#047857,color:#fff
classDef host fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef store fill:#b45309,stroke:#92400e,color:#fff
OP["Operator<br>at the Mac"]:::op -->|"runs playbooks 01→05"| ANS["Ansible collection<br>arcodange.factory"]:::eng
ANS -->|"OS · k3s · Longhorn · base layer"| PIS["3× Raspberry Pi<br>pi1 / pi2 / pi3"]:::host
PIS -->|"hosts Gitea + act-runners"| CI["Gitea CI<br>act-runners"]:::store
CI -->|"path-filtered apply"| TOFU["OpenTofu<br>iac/ + postgres/iac/"]:::eng
TOFU -->|"forge · cloud · PG state"| EDGE["Cloudflare · OVH · GCP<br>Gitea · Vault · PostgreSQL"]:::store
TOFU -. "state in GCS gs://arcodange-tf" .- EDGE
```
1. The **operator**, working from the **Mac control node**, runs the numbered Ansible playbooks `01_system``05_backup` in order.
2. **Ansible** lays the OS, k3s (`v1.34.3+k3s1`), Longhorn, and the base layer (Pi-hole, step-ca, Vault, CrowdSec) plus the stateful out-of-cluster services (PostgreSQL + Gitea) onto the **three Raspberry Pis** (`pi1`/`pi2`/`pi3`).
3. Once `pi2` is hosting **Gitea** and `pi1`/`pi3` are running the **act-runners** (registered by `03_cicd`), the forge can run CI.
4. A push or merge to `factory` that touches `iac/**` or `postgres/**` triggers the corresponding **Gitea CI** workflow on those runners.
5. The CI job authenticates to Vault via Gitea OIDC JWT and runs **OpenTofu**, which reconciles the **forge/cloud/database edge** — Cloudflare, OVH, GCP, Gitea action-secrets, Vault KV/policies, and the per-app PostgreSQL objects.
6. All OpenTofu state is kept in **GCS** under `gs://arcodange-tf` (prefix `factory/main` for the cloud edge, `factory/postgres` for the databases), so each CI run reads and writes the authoritative state remotely.
## Master index
| Sub-hub | What it maps | Status |
|---|---|---|
| [Ansible](ansible/README.md) | The `arcodange.factory` collection: numbered playbooks `01``06`, the inventory + group_vars, and the reusable roles that build hosts, the cluster, and the stateful services | ✅ Active |
| [OpenTofu](opentofu/README.md) | The CI-applied IaC: the cloud/forge edge (`iac/`), the per-app PostgreSQL provisioning (`postgres/iac/`), and the Gitea-OIDC → Vault apply flow | ✅ Active |
### All pages
- **Ansible**
- [System (`01`)](ansible/01-system.md) — OS, DNS, SSL, disks, Docker, iSCSI, k3s, CoreDNS, cert-issuer, Longhorn/Traefik config
- [Setup (`02`)](ansible/02-setup.md) — PostgreSQL + Gitea docker-compose on `pi2` (and the optional backup-NFS share)
- [CI/CD (`03`)](ansible/03-cicd.md) — Gitea act-runner registration on `pi1`/`pi3` and the ArgoCD/Image-Updater install
- [Tools (`04`)](ansible/04-tools.md) — Vault + CrowdSec bootstrap into the cluster
- [Backup (`05`)](ansible/05-backup.md) — scheduled PostgreSQL / Gitea / k3s-PVC backups to `/mnt/backups`
- [Recover (`06`)](ansible/06-recover.md) — the Longhorn disaster-recovery playbooks (`recover/`)
- [Inventory & variables](ansible/inventory.md) — `hosts.yml` groups and the `group_vars` tree
- [Roles reference](ansible/roles.md) — `deploy_docker_compose`, the `gitea_*` family, `traefik_certs`, `playwright`, and the service sub-roles
- **OpenTofu**
- [factory iac](opentofu/factory-iac.md) — `iac/`: Cloudflare/OVH/GCP/Gitea/Vault edge + the `cloudflare_token` module
- [postgres iac](opentofu/postgres-iac.md) — `postgres/iac/`: per-app databases, roles, and the pgbouncer `user_lookup()` function
- [CI apply flow](opentofu/ci-apply-flow.md) — the Gitea workflows, OIDC-JWT → Vault auth, and the GCS state backend
## Maintenance rule
> [!IMPORTANT]
> **Alter a documented component → update its page in the same change.** If you change a playbook, a role, an inventory entry, a provider version, a Tofu resource, or the CI flow, the matching page in this guidebook MUST be edited in the same PR. A provisioning map that drifts from the code sends operators (and agents) down dead paths during a rebuild or a recovery — exactly when the map matters most.
## Why this guidebook earns its keep
The safe-prod-like-environment work rehearses **exactly these playbooks and Tofu modules** in a throwaway sandbox before they touch the real lab: the sandbox stands up the same `01``05` narrative and runs the same `iac/` + `postgres/iac/` apply, so the rehearsal only holds if this guidebook tracks the engines faithfully. See the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) for the decision and the [PRD](../../PRD/safe-prod-like-environment/README.md) (with its [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)) for what the sandbox must reproduce.
## Cross-references
- [Lab ecosystem guidebook](../lab-ecosystem/README.md) — the higher-altitude whole-lab map; this guidebook is its provisioning deep dive.
- [01 · factory](../lab-ecosystem/01-factory.md) — the four-pillar summary of the `factory` repo that this guidebook expands.
- [secrets-and-vault.md](../lab-ecosystem/secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and the dynamic PostgreSQL credentials these engines set up.
- [storage-and-recovery.md](../lab-ecosystem/storage-and-recovery.md) — Longhorn + GCS backup + the power-cut recovery the `06 · recover` playbooks serve.
- [naming-conventions.md](../lab-ecosystem/naming-conventions.md) — the `<app>` join key shared by the OpenTofu state prefixes and per-app PostgreSQL objects.
- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [PRD](../../PRD/safe-prod-like-environment/README.md) — the sandbox that rehearses these engines before they touch the real lab.