Files
factory/vibe/guidebooks/lab-ecosystem/01-factory.md
Gabriel Radureau dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00

124 lines
9.2 KiB
Markdown

[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **01 · factory**
# 01 · factory
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md)
> **Deeper dive:** [Factory provisioning guidebook](../factory-provisioning/README.md) — page-by-page walkthrough of the Ansible playbooks/roles and OpenTofu modules summarized here
> **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md)
`factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md).
## Pillar 1 — Ansible ([`ansible/`](../../../ansible/))
The collection lives at `ansible/arcodange/factory/`. The inventory groups the three Pis and pins the service placement; numbered playbooks run an ordered narrative from bare OS to backups; `recover/` holds the disaster-recovery playbooks.
### Inventory (`inventory/hosts.yml`)
| Group | Hosts | Purpose |
|---|---|---|
| `raspberries` | `pi1`, `pi2`, `pi3` (`192.168.1.201-203`) | All three Pis; `ansible_user: pi` |
| `postgres` | `pi2` | The PostgreSQL host (docker-compose, outside k3s) |
| `gitea` | children of `postgres` (→ `pi2`) | Gitea co-located with PG on `pi2` |
| `pihole` | `pi1`, `pi3` | Internal DNS resolvers |
| `step_ca` | `pi1`, `pi2`, `pi3` | Step-CA PKI for `*.arcodange.lab` (primary `pi1`, replicas `pi2`/`pi3`) |
| `local` | `localhost` + the Pis | Control-node-local tasks |
### Numbered playbooks (`playbooks/`)
| Playbook | Imports / does | Notes |
|---|---|---|
| `01_system` | `system/system.yml` → rpi base, DNS, SSL, prepare disks, Docker, iSCSI, **k3s install** (`--docker --disable traefik`), CoreDNS, cert-issuer, Longhorn/Traefik config | k3s `v1.34.3+k3s1` via upstream `k3s-ansible`; pi1 server, pi2/pi3 agents |
| `02_setup` | `setup/setup.yml` → PostgreSQL + Gitea docker-compose; optional backup-NFS share | Stands up the two out-of-cluster source-of-truth services on `pi2` |
| `03_cicd` | Gitea **act-runner** docker-compose on `pi1`/`pi3` (`raspberries:&local:!gitea`), plus the ArgoCD/Image-Updater install | See the ArgoCD caveat below |
| `04_tools` | `tools/tools.yml``hashicorp_vault.yml`, `crowdsec.yml` | Platform tooling that bootstraps the cluster's Vault + CrowdSec |
| `05_backup` | `backup/backup.yml``postgres.yml`, `gitea.yml`, `k3s_pvc.yml` to `/mnt/backups` | Scheduled PG/Gitea/PVC backups; cron-report wiring present |
### Recovery playbooks (`playbooks/recover/`)
| Playbook | When to use |
|---|---|
| `longhorn.yml` | Recover Longhorn after a power cut when **Volume CRDs still exist** (CSI driver registration loss) |
| `longhorn_data.yml` | Recover app data from **raw replica `.img` files** when Volume CRDs are gone (block-device level) |
The tested power-cut recovery sequence (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last) is documented in `CLUSTER_RECOVERY.md` at the lab root (outside this repo) and summarized in [storage-and-recovery.md](storage-and-recovery.md). Background on PVC recovery is in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
### Key roles
`deploy_docker_compose` (renders compose stacks), `gitea_repo` / `gitea_token` / `gitea_secret` / `gitea_sync` (Gitea repo/token/secret/mirror management), `traefik_certs`, `playwright`, plus sub-roles `step_ca`, `hashicorp_vault`, `crowdsec`, `pihole`.
## Pillar 2 — ArgoCD app-of-apps ([`argocd/`](../../../argocd/))
A Helm chart whose `templates/apps.yaml` loops over `values.gitea_applications` and emits one `Application` CRD per app. Each Application derives everything from the app name: `repoURL = https://gitea.arcodange.lab/<org>/<app>`, `path = chart`, `namespace = <app>` (`CreateNamespace=true`), with `syncPolicy.automated` `prune: true` + `selfHeal: true` by default.
| App | Org override | Image Updater |
|---|---|---|
| `url-shortener` | — | — |
| `tools` | — | explicit `prune`+`selfHeal` |
| `webapp` | — | ✅ digest strategy |
| `telegram-gateway` | `arcodange` | ✅ digest strategy |
| `erp` | — | — |
| `cms` | — | ✅ digest strategy |
| `dance-lessons-coach` | `arcodange` | ✅ digest strategy |
> [!NOTE]
> The chart also templates a `longhorn_backup_target` and the ArgoCD Image Updater config (`argocd.arcodange.lab`). **ArgoCD itself is not currently deployed in-cluster** — its install is commented out in `03_cicd`. This page documents the intended steady state; treat ArgoCD as "designed, not live" until that step is enabled.
## Pillar 3 — OpenTofu ([`iac/`](../../../iac/))
Manages the cloud/Gitea/Vault edge. State lives in **GCS** (`backend "gcs"`, bucket `arcodange-tf`, prefix `factory/main`). Tofu authenticates to Vault via **Gitea OIDC JWT** (mount `gitea_jwt`, role `gitea_cicd`).
| Provider | Used for |
|---|---|
| `go-gitea/gitea` (`0.6.0`) | Repos, users, action secrets (e.g. the restricted `tofu_module_reader` CI user, CMS secrets) |
| `vault` (`4.4.0`) | KV secrets + policies + k8s auth roles (e.g. Longhorn GCS-backup creds & policy) |
| `google` (`7.0.1`) | GCS backup bucket + service account + HMAC key for Longhorn |
| `cloudflare/cloudflare` (`~> 5`) | R2 bucket, API tokens, CMS edge wiring (detailed in [03 · cms](03-cms.md)) |
| `ovh/ovh` (`2.8.0`) | OAuth2 client + IAM policy for the `arcodange.fr` domain (registrar = OVH) |
`modules/cloudflare_token` is a reusable scoped-token factory. The whole module reuses the `<app>` name as the GCS state prefix (`<app>/main`) — see [naming-conventions.md](naming-conventions.md).
## Pillar 4 — per-app PostgreSQL ([`postgres/iac/`](../../../postgres/))
OpenTofu using the `cyrilgdn/postgresql` provider against PG on `192.168.1.202` (state prefix `factory/postgres`). It iterates over a `var.applications` set and, **per app**, creates:
| Resource | Name pattern | Purpose |
|---|---|---|
| Database | `<app>` | The app's database (`template0`, owned by the role) |
| Owner role (non-login) | `<app>_role` | Database owner; granted to dynamic users by Vault |
| Editor role (login) | `credentials_editor` | Shared admin role that can grant the per-app roles |
| `user_lookup()` function | per-`<app>` db | `SECURITY DEFINER` lookup for **pgbouncer** auth (granted to `pgbouncer_auth`, revoked from `public`) |
Current `applications` set: `webapp`, `erp`, `crowdsec`, `plausible`, `dance-lessons-coach`. Vault's PostgreSQL secrets engine then issues **dynamic** credentials on top of these roles — see [secrets-and-vault.md](secrets-and-vault.md). The pooler (`pgbouncer`) that consumes `user_lookup()` lives in the `tools` namespace — see [02 · tools](02-tools.md).
## Provisioning order
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
S1["01_system<br>OS + k3s + Longhorn"]:::proc --> S2["02_setup<br>PG + Gitea (pi2)"]:::proc --> S3["03_cicd<br>runners + ArgoCD"]:::proc --> S4["04_tools<br>Vault + CrowdSec"]:::proc --> S5["05_backup<br>PG/Gitea/PVC"]:::proc
IAC["iac/ + postgres/iac<br>(OpenTofu state in GCS)"]:::store -. "declares cloud/Gitea/Vault/PG" .- S2
```
1. **`01_system`** lays the OS, disks, Docker, and k3s with Longhorn + Traefik onto the three Pis.
2. **`02_setup`** stands up PostgreSQL and Gitea as docker-compose on `pi2` — the out-of-cluster source-of-truth services.
3. **`03_cicd`** registers the Gitea act-runners (and is where ArgoCD would install, currently commented out).
4. **`04_tools`** bootstraps the cluster's Vault and CrowdSec.
5. **`05_backup`** schedules PostgreSQL, Gitea, and k3s-PVC backups to `/mnt/backups`.
6. In parallel, **OpenTofu** (`iac/` and `postgres/iac/`) declares the cloud, Gitea, Vault, and PostgreSQL objects, keeping state in GCS.
## Cross-references
- [Lab ecosystem hub](README.md) — the whole-lab map this page sits under.
- [02 · tools](02-tools.md) — what ArgoCD deploys into the `tools` namespace (incl. pgbouncer that consumes the PG `user_lookup()`).
- [03 · cms](03-cms.md) — the CMS edge that `iac/cloudflare.tf` and `iac/ovh.tf` wire up.
- [naming-conventions.md](naming-conventions.md) — the `<app>` join key these pillars share.
- [secrets-and-vault.md](secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and dynamic PG creds.
- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn + GCS backup + power-cut recovery.
- [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md) · [conventions](../../../doc/runbooks/new-web-app/conventions.md) — the step-by-step procedure these pillars support.
- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs.
- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — recovery background.