docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/, explored from the actual playbooks/roles and tofu code: - Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu), a green-field bring-up flow, master index, maintenance rule. - ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca, crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers). - opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge + cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup), ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply). Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams MCP-validated; zero dead links. Authored by the Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
88
vibe/guidebooks/factory-provisioning/README.md
Normal file
88
vibe/guidebooks/factory-provisioning/README.md
Normal file
@@ -0,0 +1,88 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > **Factory provisioning**
|
||||
|
||||
# Factory provisioning
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Lab ecosystem guidebook](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md)
|
||||
> **Related:** [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-prod-like-environment PRD](../../PRD/safe-prod-like-environment/README.md)
|
||||
|
||||
This guidebook is the deep dive into **how the `factory` repo turns three Raspberry Pis + a handful of cloud accounts into the running lab.** Where the [lab-ecosystem](../lab-ecosystem/README.md) map shows *which* components exist and how they join, this guidebook drills into the two provisioning **engines** that build and maintain them: the Ansible collection that the operator runs from the Mac, and the OpenTofu modules that Gitea CI applies. Every page below describes the engine *as it is wired right now* — playbook imports, role responsibilities, inventory placement, provider versions, state backends, and the CI flow that ties Tofu to Vault.
|
||||
|
||||
## Two engines, two trigger models
|
||||
|
||||
The factory splits provisioning along a hard line: **imperative, operator-driven host/cluster build** (Ansible) versus **declarative, CI-driven forge/cloud/database state** (OpenTofu). They never overlap on the same resource, and they run at different moments.
|
||||
|
||||
| Engine | Trigger | Runs from | Owns | Lives at |
|
||||
|---|---|---|---|---|
|
||||
| **Ansible** | One-shot, operator-run on demand | The Mac (control node) | The cluster + base layer + stateful services: k3s, Longhorn, Pi-hole, step-ca, PostgreSQL, Gitea, Vault, CrowdSec — plus the disaster-recovery playbooks | [`ansible/`](../../../ansible/) → [sub-hub](ansible/README.md) |
|
||||
| **OpenTofu** | CI-applied on Gitea (path-filtered `push`/`pull_request` + `workflow_dispatch`) | Gitea act-runners | Forge/cloud edge state (Cloudflare, OVH, GCP, Gitea, Vault) and **per-app PostgreSQL databases** | [`iac/`](../../../iac/) + [`postgres/`](../../../postgres/) → [sub-hub](opentofu/README.md) |
|
||||
|
||||
> [!NOTE]
|
||||
> Ansible is **imperative and human-gated** because it touches bare hosts and one-time bootstrap (disk prep, k3s install, Vault init). OpenTofu is **declarative and machine-gated** because its targets are reconcilable API objects (a DNS record, a bucket, a database) whose desired state belongs in version control and converges on every merge.
|
||||
|
||||
## How a green-field lab comes up
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart LR
|
||||
classDef op fill:#1e3a8a,stroke:#1e40af,color:#fff
|
||||
classDef eng fill:#059669,stroke:#047857,color:#fff
|
||||
classDef host fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
classDef store fill:#b45309,stroke:#92400e,color:#fff
|
||||
|
||||
OP["Operator<br>at the Mac"]:::op -->|"runs playbooks 01→05"| ANS["Ansible collection<br>arcodange.factory"]:::eng
|
||||
ANS -->|"OS · k3s · Longhorn · base layer"| PIS["3× Raspberry Pi<br>pi1 / pi2 / pi3"]:::host
|
||||
PIS -->|"hosts Gitea + act-runners"| CI["Gitea CI<br>act-runners"]:::store
|
||||
CI -->|"path-filtered apply"| TOFU["OpenTofu<br>iac/ + postgres/iac/"]:::eng
|
||||
TOFU -->|"forge · cloud · PG state"| EDGE["Cloudflare · OVH · GCP<br>Gitea · Vault · PostgreSQL"]:::store
|
||||
TOFU -. "state in GCS gs://arcodange-tf" .- EDGE
|
||||
```
|
||||
|
||||
1. The **operator**, working from the **Mac control node**, runs the numbered Ansible playbooks `01_system` → `05_backup` in order.
|
||||
2. **Ansible** lays the OS, k3s (`v1.34.3+k3s1`), Longhorn, and the base layer (Pi-hole, step-ca, Vault, CrowdSec) plus the stateful out-of-cluster services (PostgreSQL + Gitea) onto the **three Raspberry Pis** (`pi1`/`pi2`/`pi3`).
|
||||
3. Once `pi2` is hosting **Gitea** and `pi1`/`pi3` are running the **act-runners** (registered by `03_cicd`), the forge can run CI.
|
||||
4. A push or merge to `factory` that touches `iac/**` or `postgres/**` triggers the corresponding **Gitea CI** workflow on those runners.
|
||||
5. The CI job authenticates to Vault via Gitea OIDC JWT and runs **OpenTofu**, which reconciles the **forge/cloud/database edge** — Cloudflare, OVH, GCP, Gitea action-secrets, Vault KV/policies, and the per-app PostgreSQL objects.
|
||||
6. All OpenTofu state is kept in **GCS** under `gs://arcodange-tf` (prefix `factory/main` for the cloud edge, `factory/postgres` for the databases), so each CI run reads and writes the authoritative state remotely.
|
||||
|
||||
## Master index
|
||||
|
||||
| Sub-hub | What it maps | Status |
|
||||
|---|---|---|
|
||||
| [Ansible](ansible/README.md) | The `arcodange.factory` collection: numbered playbooks `01`–`06`, the inventory + group_vars, and the reusable roles that build hosts, the cluster, and the stateful services | ✅ Active |
|
||||
| [OpenTofu](opentofu/README.md) | The CI-applied IaC: the cloud/forge edge (`iac/`), the per-app PostgreSQL provisioning (`postgres/iac/`), and the Gitea-OIDC → Vault apply flow | ✅ Active |
|
||||
|
||||
### All pages
|
||||
|
||||
- **Ansible**
|
||||
- [System (`01`)](ansible/01-system.md) — OS, DNS, SSL, disks, Docker, iSCSI, k3s, CoreDNS, cert-issuer, Longhorn/Traefik config
|
||||
- [Setup (`02`)](ansible/02-setup.md) — PostgreSQL + Gitea docker-compose on `pi2` (and the optional backup-NFS share)
|
||||
- [CI/CD (`03`)](ansible/03-cicd.md) — Gitea act-runner registration on `pi1`/`pi3` and the ArgoCD/Image-Updater install
|
||||
- [Tools (`04`)](ansible/04-tools.md) — Vault + CrowdSec bootstrap into the cluster
|
||||
- [Backup (`05`)](ansible/05-backup.md) — scheduled PostgreSQL / Gitea / k3s-PVC backups to `/mnt/backups`
|
||||
- [Recover (`06`)](ansible/06-recover.md) — the Longhorn disaster-recovery playbooks (`recover/`)
|
||||
- [Inventory & variables](ansible/inventory.md) — `hosts.yml` groups and the `group_vars` tree
|
||||
- [Roles reference](ansible/roles.md) — `deploy_docker_compose`, the `gitea_*` family, `traefik_certs`, `playwright`, and the service sub-roles
|
||||
- **OpenTofu**
|
||||
- [factory iac](opentofu/factory-iac.md) — `iac/`: Cloudflare/OVH/GCP/Gitea/Vault edge + the `cloudflare_token` module
|
||||
- [postgres iac](opentofu/postgres-iac.md) — `postgres/iac/`: per-app databases, roles, and the pgbouncer `user_lookup()` function
|
||||
- [CI apply flow](opentofu/ci-apply-flow.md) — the Gitea workflows, OIDC-JWT → Vault auth, and the GCS state backend
|
||||
|
||||
## Maintenance rule
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Alter a documented component → update its page in the same change.** If you change a playbook, a role, an inventory entry, a provider version, a Tofu resource, or the CI flow, the matching page in this guidebook MUST be edited in the same PR. A provisioning map that drifts from the code sends operators (and agents) down dead paths during a rebuild or a recovery — exactly when the map matters most.
|
||||
|
||||
## Why this guidebook earns its keep
|
||||
|
||||
The safe-prod-like-environment work rehearses **exactly these playbooks and Tofu modules** in a throwaway sandbox before they touch the real lab: the sandbox stands up the same `01`–`05` narrative and runs the same `iac/` + `postgres/iac/` apply, so the rehearsal only holds if this guidebook tracks the engines faithfully. See the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) for the decision and the [PRD](../../PRD/safe-prod-like-environment/README.md) (with its [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)) for what the sandbox must reproduce.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Lab ecosystem guidebook](../lab-ecosystem/README.md) — the higher-altitude whole-lab map; this guidebook is its provisioning deep dive.
|
||||
- [01 · factory](../lab-ecosystem/01-factory.md) — the four-pillar summary of the `factory` repo that this guidebook expands.
|
||||
- [secrets-and-vault.md](../lab-ecosystem/secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and the dynamic PostgreSQL credentials these engines set up.
|
||||
- [storage-and-recovery.md](../lab-ecosystem/storage-and-recovery.md) — Longhorn + GCS backup + the power-cut recovery the `06 · recover` playbooks serve.
|
||||
- [naming-conventions.md](../lab-ecosystem/naming-conventions.md) — the `<app>` join key shared by the OpenTofu state prefixes and per-app PostgreSQL objects.
|
||||
- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [PRD](../../PRD/safe-prod-like-environment/README.md) — the sandbox that rehearses these engines before they touch the real lab.
|
||||
94
vibe/guidebooks/factory-provisioning/ansible/01-system.md
Normal file
94
vibe/guidebooks/factory-provisioning/ansible/01-system.md
Normal file
@@ -0,0 +1,94 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **01 · System**
|
||||
|
||||
# 01 · System — base OS, Docker, K3s, Longhorn, DNS, SSL
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
|
||||
> **Downstream:** [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md)
|
||||
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
## What it does
|
||||
|
||||
`01 · System` takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a configured K3s cluster. The wrapper [`playbooks/01_system.yml`](../../../../ansible/arcodange/factory/playbooks/01_system.yml) does nothing but `import_playbook` the stage orchestrator [`playbooks/system/system.yml`](../../../../ansible/arcodange/factory/playbooks/system/system.yml), which in turn imports ten sub-playbooks **in strict order**. Each sub-play layers one capability: hostname/DNS hygiene, Pi-hole HA DNS, the step-ca PKI, the external backup disk, Docker, the iSCSI/dm-crypt prerequisites for Longhorn, K3s itself, CoreDNS forwarding, the cert-manager issuer, and finally the cluster config (Longhorn + Traefik).
|
||||
|
||||
All host-facing plays target `raspberries:&local` — the intersection of the `raspberries` group and the `local` group, which resolves to `pi1`/`pi2`/`pi3` (see [Inventory & variables](inventory.md)). The K3s server/agent split is decided at runtime: the **first host (alphabetically) becomes the server**, the rest become agents.
|
||||
|
||||
## Ordered steps
|
||||
|
||||
| # | Sub-playbook | Purpose | Key vars / versions |
|
||||
| --- | --- | --- | --- |
|
||||
| 1 | [`system/rpi.yml`](../../../../ansible/arcodange/factory/playbooks/system/rpi.yml) | Set each node's hostname to its `inventory_hostname`. On Pi-hole nodes (`pi1`/`pi3`) add `dnsmasq` to the `dip` group, then **stop & disable `dnsmasq`** to free port 53 for `pihole-FTL`. | `tags: never` (opt-in only) |
|
||||
| 2 | [`dns/dns.yml`](../../../../ansible/arcodange/factory/playbooks/dns/dns.yml) → [`dns/pihole.yml`](../../../../ansible/arcodange/factory/playbooks/dns/pihole.yml) | Install & configure **Pi-hole HA DNS** via the `pihole` role. Adds custom records mapping `.arcodange.lab` and `.arcodange.duckdns.org` to `pi1`. | `pihole_custom_dns` → `pi1.preferred_ip` |
|
||||
| 3 | [`ssl/ssl.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/ssl.yml) → [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml) | Install **step-ca** (the `step_ca` role) on all three Pis; fetch the root CA from `pi1`; build a **Gitea runner image that trusts the CA** (`runner-images:ubuntu-latest-ca`) and push it to the registry. | `step_ca_primary: pi1`, root at `/home/step/.step/certs/root_ca.crt` |
|
||||
| 4 | [`system/prepare_disks.yml`](../../../../ansible/arcodange/factory/playbooks/system/prepare_disks.yml) | Auto-detect the largest external (non-`mmcblk0`) USB partition, format it **ext4 with label `arcodange_500`**, mount at `/mnt/arcodange`, and persist in `fstab`. Skips format if the label already exists. **`pause` confirm before any format.** | `mount_point: /mnt/arcodange`, `disk_label: arcodange_500` |
|
||||
| 5 | [`system/system_docker.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_docker.yml) | Install Docker via `geerlingguy.docker`; write `daemon.json` with **json-file logging** (`max-size 10m`, `max-file 5`) and **`data-root: /mnt/arcodange/docker`** (only when the external disk is mounted). | `tags: never`; `storage-driver: overlay2` |
|
||||
| 6 | [`system/iscsi_longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/system/iscsi_longhorn.yml) | Install `open-iscsi` (+ enable `iscsid`) and `cryptsetup`, and load the **`dm_crypt`** kernel module (persisted in `/etc/modules`) — Longhorn's encrypted-volume prerequisites. Creates `/mnt/arcodange/longhorn`. | module `dm_crypt` |
|
||||
| 7 | [`system/system_k3s.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_k3s.yml) | Build the K3s inventory dynamically (first sorted host → `server`, rest → `agent`), install the `k3s-ansible` content, run `k3s.orchestration.site`, then **fetch the kubeconfig** to `~/.kube/config` (rewriting `127.0.0.1` → server IP). | **k3s `v1.34.3+k3s1`**; server args `--docker --disable traefik` |
|
||||
| 8 | [`system/k3s_dns.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_dns.yml) | Create the **`coredns-custom`** ConfigMap so cluster DNS forwards `arcodange.lab:53` to the Pi-hole IPs; also patch the main CoreDNS Corefile to forward to the same HA Pi-holes. | `pihole_ips` (extracted from hostvars) |
|
||||
| 9 | [`system/k3s_ssl.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_ssl.yml) | Deploy **cert-manager** + **step-issuer** as k3s static HelmCharts; create the `StepClusterIssuer` `step-ca` wired to the JWK provisioner and root CA. | cert-manager `v1.19.2`, step-issuer `1.9.11`, `caUrl: https://ssl-ca.arcodange.lab:8443`, **ARM64 `kube-rbac-proxy` override** |
|
||||
| 10 | [`system/k3s_config.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_config.yml) | Deploy **Longhorn** + **Traefik** as HelmCharts; issue the wildcard cert, set the default `TLSStore`, wire Gitea, the IP-allow-list middleware, and the CrowdSec bouncer plugin; then **delete the old Traefik** to force a redeploy. | Longhorn `v1.9.1`, Traefik `v37.4.0` (see detail below) |
|
||||
|
||||
## How the stages fit together
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%%
|
||||
flowchart TD
|
||||
classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4;
|
||||
classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
rpi["1 · rpi.yml<br>hostname + dnsmasq off"]:::host
|
||||
dns["2 · pihole<br>HA DNS"]:::host
|
||||
ssl["3 · step-ca<br>root CA + CA-trusting runner image"]:::host
|
||||
disk["4 · prepare_disks.yml<br>ext4 arcodange_500 -> /mnt/arcodange"]:::danger
|
||||
docker["5 · system_docker.yml<br>data-root on external disk"]:::host
|
||||
iscsi["6 · iscsi_longhorn.yml<br>open-iscsi + dm_crypt"]:::host
|
||||
k3s["7 · system_k3s.yml<br>k3s v1.34.3 (--disable traefik)"]:::cluster
|
||||
cdns["8 · k3s_dns.yml<br>coredns-custom -> Pi-hole"]:::cluster
|
||||
cmgr["9 · k3s_ssl.yml<br>cert-manager + step-issuer"]:::cluster
|
||||
cfg["10 · k3s_config.yml<br>Longhorn + Traefik + redeploy"]:::cluster
|
||||
|
||||
rpi --> dns --> ssl --> disk --> docker --> iscsi --> k3s --> cdns --> cmgr --> cfg
|
||||
```
|
||||
|
||||
1. **`rpi.yml`** fixes the hostname and, on Pi-hole nodes, stops `dnsmasq` so `pihole-FTL` can own port 53.
|
||||
2. **Pi-hole** comes up as the HA DNS authority for `arcodange.lab`.
|
||||
3. **step-ca** is installed; its root CA is fetched and baked into a Gitea runner image so CI can trust internal TLS.
|
||||
4. **`prepare_disks.yml`** formats and mounts the external USB disk at `/mnt/arcodange` (with a confirmation pause).
|
||||
5. **Docker** installs with its data-root pointed at that disk and capped logging.
|
||||
6. **iSCSI + dm_crypt** prerequisites land so Longhorn can attach (and encrypt) volumes.
|
||||
7. **K3s** installs with the first host as server, Docker as the container runtime, and Traefik disabled.
|
||||
8. **CoreDNS** is reconfigured to forward `arcodange.lab` to the Pi-holes.
|
||||
9. **cert-manager + step-issuer** wire the in-cluster issuer to step-ca.
|
||||
10. **`k3s_config.yml`** deploys Longhorn and a fully-customized Traefik, then deletes the old Traefik so the helm-controller redeploys with the new config.
|
||||
|
||||
## `k3s_config.yml` — Longhorn & Traefik detail
|
||||
|
||||
| Resource | Value | Notes |
|
||||
| --- | --- | --- |
|
||||
| Longhorn HelmChart | `v1.9.1` | `defaultSettings.defaultDataPath: /mnt/arcodange/longhorn` — volumes live on the external disk. |
|
||||
| Traefik HelmChart | `v37.4.0` | Deployed as a k3s static manifest (`traefik-v3.yaml`) with an inline `traefik-configmap`. |
|
||||
| Wildcard cert | `wildcard-arcodange-lab` | `Certificate` for `arcodange.lab` + `*.arcodange.lab`, issued by the `step-issuer` `StepClusterIssuer`. |
|
||||
| `TLSStore` `default` | `defaultCertificate: wildcard-arcodange-lab` | Makes the wildcard cert the cluster-wide default. |
|
||||
| Gitea exposure | `gitea-external` `ExternalName` Service → `pi2` port 3000 | Gitea runs **outside** K3s as Docker Compose on `pi2`; Traefik routes `gitea.arcodange.lab` to it. |
|
||||
| `localIp` middleware | `ipAllowList` | Restricts dashboard/Gitea routers to LAN + pod CIDR + the detected public IP. |
|
||||
| CrowdSec bouncer | plugin `v1.3.3` | Traefik experimental plugin `crowdsec-bouncer-traefik-plugin` (config completed in [04 · Tools](04-tools.md)). |
|
||||
| DuckDNS token | `traefik-duckdns-token` Secret → `DUCKDNS_TOKEN` | Consumed by the `letsencrypt` ACME DNS-challenge resolver via `envFrom`. |
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!CAUTION]
|
||||
> **Step 4 formats a disk — data loss is real.** `prepare_disks.yml` picks the **largest non-system partition** and runs `mkfs.ext4 -F` on it when the `arcodange_500` label is absent. The `run_once` `pause` prompt ("tapez 'oui' pour continuer") is the only guard, and a wrong USB stick plugged into the wrong Pi will be wiped. Confirm `target_device` in the debug output before answering. If a candidate already carries the label, the format is skipped and the disk is only (re)mounted.
|
||||
|
||||
> [!WARNING]
|
||||
> **K3s ships with `--disable traefik`.** The bundled Traefik is intentionally turned off in step 7 so step 10 can deploy its own fully-customized `v37.4.0`. If you re-enable the bundled Traefik or run `k3s_config.yml` out of order, two Traefiks will fight over the ingress ports.
|
||||
|
||||
> [!WARNING]
|
||||
> **ARM64 needs the `kube-rbac-proxy` image override.** step-issuer's default `gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0` is AMD64-only and **crash-loops on `pi3` (ARM64)**. `k3s_ssl.yml` overrides it to `quay.io/brancz/kube-rbac-proxy:v0.15.0`. Do not remove this override.
|
||||
|
||||
> [!WARNING]
|
||||
> **Traefik is force-redeployed.** The last play of `k3s_config.yml` deletes the `traefik` Deployment **and** the `helm-install-traefik` Job so the k3s helm-controller re-runs the install against the new manifest. Expect a brief ingress outage during this window; the play then waits for the new Deployment to come back before finishing.
|
||||
|
||||
> [!NOTE]
|
||||
> **`tags: never` plays are opt-in.** `rpi.yml` and `system_docker.yml` carry `tags: never`, so they are skipped unless you explicitly pass their tag (e.g. `--tags rpi` / `--tags ...`) or `--tags all`. The K3s/Longhorn/Traefik plays run on a normal invocation.
|
||||
82
vibe/guidebooks/factory-provisioning/ansible/02-setup.md
Normal file
82
vibe/guidebooks/factory-provisioning/ansible/02-setup.md
Normal file
@@ -0,0 +1,82 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **02 · Setup**
|
||||
|
||||
# 02 · Setup — Postgres, Gitea, NFS backup target
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [01 · System](01-system.md)
|
||||
> **Downstream:** [03 · CI/CD](03-cicd.md)
|
||||
> **Related:** [Inventory & variables](inventory.md) · [Roles reference](roles.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)
|
||||
|
||||
## What it does
|
||||
|
||||
`02 · Setup` deploys the **stateful services the rest of the platform leans on**: a PostgreSQL server and a Gitea instance — both running as **Docker Compose stacks on `pi2`, outside K3s** — plus the in-cluster NFS backup target. The wrapper [`playbooks/02_setup.yml`](../../../../ansible/arcodange/factory/playbooks/02_setup.yml) imports [`playbooks/setup/setup.yml`](../../../../ansible/arcodange/factory/playbooks/setup/setup.yml), which pings the Pis, then imports three sub-playbooks: `backup_nfs.yml` (tagged `never`), `postgres.yml`, and `gitea.yml`.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Postgres and Gitea do not run in Kubernetes.** They are Docker Compose stacks on `pi2` (the sole member of the `postgres` group, which `gitea` inherits as a child — see [Inventory & variables](inventory.md)). K3s only references them: Traefik exposes Gitea via an `ExternalName` Service, and the `pg-fix-table-ownership` CronJob reaches Postgres over the LAN. This keeps the two services available even when the cluster is being rebuilt.
|
||||
|
||||
## Ordered steps
|
||||
|
||||
| # | Sub-playbook | Purpose | Key vars / versions |
|
||||
| --- | --- | --- | --- |
|
||||
| 1 | [`setup/backup_nfs.yml`](../../../../ansible/arcodange/factory/playbooks/setup/backup_nfs.yml) | Provision the shared backup volume: a **Longhorn RWX PVC `backups-rwx` (50Gi)**, a Longhorn `RecurringJob`, a `busybox` deploy to spawn the share-manager, then mount the resulting NFS share at `/mnt/backups` on every Pi. | `tags: never`; `backup_size: 50Gi`, RecurringJob `thrice-a-month-backup` (`cron 0 5 */2 * *`, retain 2) |
|
||||
| 2 | [`setup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/setup/postgres.yml) | Deploy the Postgres Compose stack (`deploy_docker_compose` + `deploy_postgresql` role), create the `gitea` DB/user, create the **pgbouncer auth_user + `user_lookup()` functions** in both `postgres` and `gitea` DBs, publish the K8s Secret `postgres-admin-credentials`, and install the **`pg-fix-table-ownership` CronJob**. | **Postgres `16.3-alpine`**; container `postgres`; CronJob daily `0 3 * * *` |
|
||||
| 3 | [`setup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/setup/gitea.yml) | Deploy the Gitea Compose stack (`deploy_docker_compose` + `deploy_gitea` role), create admin `arcodange`, mint an API token via `gitea_token`, upload the avatar, register the SSH key, create org `arcodange-org`, then **delete the temp token**. | **Gitea `1.25.5`**; base URL `http://pi2:3000` |
|
||||
|
||||
## NFS backup target — how the share is born
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%%
|
||||
flowchart TD
|
||||
classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4;
|
||||
classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
|
||||
pvc["RWX PVC backups-rwx (50Gi)<br>longhorn-system"]:::cluster
|
||||
rj["RecurringJob thrice-a-month-backup<br>cron 0 5 */2 *"]:::cluster
|
||||
dep["busybox Deployment rwx-nfs<br>mounts the PVC"]:::cluster
|
||||
sm["Longhorn share-manager<br>(spawned by the mount)"]:::cluster
|
||||
svc["Service nfs-backups-rwx<br>ClusterIP :2049"]:::cluster
|
||||
mount["mount /mnt/backups on pi1/pi2/pi3<br>NFS vers=4.1"]:::host
|
||||
|
||||
pvc --> rj
|
||||
pvc --> dep --> sm --> svc --> mount
|
||||
```
|
||||
|
||||
1. A **ReadWriteMany Longhorn PVC** (`backups-rwx`, 50Gi) is created in `longhorn-system`.
|
||||
2. A **`RecurringJob`** is attached to the volume so Longhorn snapshots/backs it up on the `0 5 */2 * *` schedule.
|
||||
3. A **`busybox` Deployment (`rwx-nfs`)** mounts the PVC — the act of mounting an RWX volume makes Longhorn spawn an **NFS share-manager** pod.
|
||||
4. A stable **ClusterIP Service** (`nfs-backups-rwx`, port 2049) is created (or reused) to front the share-manager.
|
||||
5. Each Pi installs `nfs-common` and **mounts the share at `/mnt/backups`** (`vers=4.1`, `nofail`, `x-systemd.automount`), persisted in `fstab`.
|
||||
|
||||
## Postgres — what gets created
|
||||
|
||||
| Artifact | Where | Purpose |
|
||||
| --- | --- | --- |
|
||||
| Compose stack `arcodange_factory` | `pi2` Docker | Runs `postgres:16.3-alpine`, container `postgres`, port `5432`, data under `/home/pi/arcodange/docker_composes/postgres/data`. |
|
||||
| `gitea` DB + user | inside Postgres | Created by the `deploy_postgresql` role from `applications_databases.gitea` (`gitea_database`). |
|
||||
| pgbouncer `auth_user` (`pgbouncer_auth`) | `postgres` + `gitea` DBs | Login role used by the [pgbouncer pooler](../../lab-ecosystem/02-tools.md) for SCRAM lookups. |
|
||||
| `user_lookup(text)` function | `postgres` + `gitea` DBs | `SECURITY DEFINER` function over `pg_shadow`; `EXECUTE` granted only to `pgbouncer_auth`. |
|
||||
| K8s Secret `postgres-admin-credentials` | `kube-system` | Base64 admin user/password so the in-cluster CronJob can authenticate. |
|
||||
| CronJob `pg-fix-table-ownership` | `kube-system` | Runs `postgres:16.3` daily at **03:00**; discovers `%_role` roles, derives each DB by stripping `_role`, and re-`ALTER TABLE ... OWNER TO` every public table — repairing ownership after a restore. |
|
||||
|
||||
## Gitea — bootstrap sequence
|
||||
|
||||
1. **Compose deploy** via `deploy_docker_compose`, then the `deploy_gitea` role wires Gitea to the Postgres DB (host/db/user/password pulled from the compose env).
|
||||
2. **Admin user** `arcodange` (`arcodange@gmail.com`) is created with `--random-password --admin` if absent.
|
||||
3. **API token** is minted by the `gitea_token` role and used for the next HTTP calls.
|
||||
4. **Avatar** upload, **SSH public key** registration (idempotent), and **org `arcodange-org`** (full name "Arcodange") creation + avatar.
|
||||
5. **Cleanup** — a `post_tasks` invocation of `gitea_token` with `gitea_token_delete: true` removes the temporary token.
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> **The NFS play is `never`-tagged and order-sensitive.** `backup_nfs.yml` only runs when explicitly tagged, and several of its tasks (`Créer PVC RWX`, `Lancer un Deployment pour déclencher NFS`, `Attendre que le pod rwx-nfs soit Running`) are themselves `tags: never`. The RWX volume must already exist for the busybox deploy to spawn the share-manager; running the mount step before the share-manager is `Running` will hang on the `until` retry loop.
|
||||
|
||||
> [!WARNING]
|
||||
> **Postgres lives on `pi2` outside K3s.** Treat it as a single-host service: there is no Postgres pod to `kubectl get`. The cluster only sees the `postgres-admin-credentials` Secret and the `pg-fix-table-ownership` CronJob, both of which reach the DB over the LAN at `pi2:5432`. A `pi2` outage takes Postgres (and Gitea) down regardless of cluster health.
|
||||
|
||||
> [!CAUTION]
|
||||
> **`pg-fix-table-ownership` exists because restores break ownership.** After a Longhorn/data recovery, tables can come back owned by the wrong role and apps lose write access. The daily CronJob silently re-owns every `public` table to the `<db>_role` matching each `%_role` PostgreSQL role. If you add a database whose owning role does **not** follow the `<db>_role` naming convention, this job will not fix it — see [Naming conventions](../../lab-ecosystem/naming-conventions.md).
|
||||
|
||||
> [!NOTE]
|
||||
> **The admin password is random and printed once.** Gitea's admin is created with `--random-password`; capture it from the play output (or reset it via `docker exec`) — it is not stored in the inventory. The bootstrap API token is deliberately deleted at the end, so re-running the play re-mints a fresh one.
|
||||
34
vibe/guidebooks/factory-provisioning/ansible/03-cicd.md
Normal file
34
vibe/guidebooks/factory-provisioning/ansible/03-cicd.md
Normal file
@@ -0,0 +1,34 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **03 · CI/CD**
|
||||
|
||||
# 03 · CI/CD — Gitea Actions runners
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [02 · Setup](02-setup.md)
|
||||
> **Downstream:** [04 · Tools](04-tools.md)
|
||||
> **Related:** [Lab ecosystem · 01 factory (ArgoCD caveat)](../../lab-ecosystem/01-factory.md) · [Roles reference](roles.md) · [Inventory & variables](inventory.md)
|
||||
|
||||
## What it does
|
||||
|
||||
`03 · CI/CD` registers and deploys the **Gitea Actions runner (`act_runner`)** on every Pi that is *not* the Gitea host, so CI jobs have executors. The whole stage is one playbook, [`playbooks/03_cicd.yml`](../../../../ansible/arcodange/factory/playbooks/03_cicd.yml) — there is no stage subdirectory.
|
||||
|
||||
It targets `raspberries:&local:!gitea`, i.e. the raspberries that are local **minus** the `gitea` group. Since `gitea` resolves to `pi2`, the runner lands on **`pi1` and `pi3`** (see [Inventory & variables](inventory.md)).
|
||||
|
||||
## Steps
|
||||
|
||||
| # | Task / role | Purpose | Key detail |
|
||||
| --- | --- | --- | --- |
|
||||
| 1 | role `arcodange.factory.gitea_token` | Mint a `gitea_api_token` for later API use. | Reused across the collection (see [Roles reference](roles.md)). |
|
||||
| 2 | `gitea actions generate-runner-token` (delegated to the Gitea host) | Fetch a **runner registration token** by `docker exec`-ing into the `gitea` container. | `delegate_to: groups.gitea[0]` |
|
||||
| 3 | role `arcodange.factory.deploy_docker_compose` | Render the `act_runner` Compose stack with the registration token, instance URL, runner name, and labels. | image `gitea/act_runner:latest`; labels point at `runner-images:ubuntu-latest-ca` |
|
||||
| 4 | `community.docker.docker_compose_v2` (down→up loop) | Apply the stack: a `loop: [absent, present]` recreates the runner so token/label changes take effect. | cache dirs under `/mnt/arcodange/gitea-runner-*` |
|
||||
|
||||
The runner registers with `GITEA_INSTANCE_URL: http://<gitea-host>:3000`, names itself `arcodange_global_runner_<host>`, and advertises the **`ubuntu-latest` / `ubuntu-latest-ca`** labels — both mapped to the CA-trusting image built back in [01 · System](01-system.md). It mounts the Docker socket and the host CA store (`/etc/ssl/certs`, `/usr/local/share/ca-certificates`) so jobs trust internal TLS, and runs with `insecure: true` against the Gitea TLS endpoint.
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> **ArgoCD is present in design but not deployed.** The factory pipeline intends `03_cicd` to also bring up ArgoCD (the app-of-apps), but **that step is commented out / not currently deployed in-cluster** — this stage only deploys the Gitea runners. Treat ArgoCD as "designed, not live" until the install is enabled. See the [ArgoCD caveat in lab-ecosystem · 01 factory](../../lab-ecosystem/01-factory.md).
|
||||
|
||||
> [!WARNING]
|
||||
> **The registration token is single-use and host-delegated.** Step 2 generates a fresh token every run via the Gitea container, so the runner re-registers on each apply. If the Gitea host (`pi2`) is down, token generation fails and no runner can register.
|
||||
125
vibe/guidebooks/factory-provisioning/ansible/04-tools.md
Normal file
125
vibe/guidebooks/factory-provisioning/ansible/04-tools.md
Normal file
@@ -0,0 +1,125 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **04 · Tools**
|
||||
|
||||
# 04 · Tools — Vault + CrowdSec
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
|
||||
> **Downstream:** [Roles reference](roles.md) — deep mechanics of the `hashicorp_vault` and `crowdsec` roles
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [05 · Backup](05-backup.md) · [03 · CI/CD](03-cicd.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
Stage 4 installs the **operational tooling layer** on top of a running cluster: HashiCorp **Vault** (the lab's single secret store) and **CrowdSec** (the WAF/IPS that fronts Traefik). The entry point [`playbooks/04_tools.yml`](../../../../ansible/arcodange/factory/playbooks/04_tools.yml) is a one-line wrapper that imports [`playbooks/tools/tools.yml`](../../../../ansible/arcodange/factory/playbooks/tools/tools.yml), which in turn chains two sub-playbooks — `hashicorp_vault.yml` then `crowdsec.yml`. Both run against `localhost` (they drive the cluster through `kubectl` / `kubernetes.core`, not over SSH to the Pis).
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Vault is the chokepoint of the whole secret model. This page covers **what the playbook orchestrates**; the byte-level role internals (init, unseal, root-token minting, the OpenTofu OIDC backend) live in the [Roles reference](roles.md). Read [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) first for the conceptual model — the two auth backends, the unseal posture, and why there is no secret material in git.
|
||||
|
||||
---
|
||||
|
||||
## What stage 4 deploys
|
||||
|
||||
| Sub-playbook | File | Builds | Role invoked |
|
||||
| --- | --- | --- | --- |
|
||||
| Vault | [`tools/hashicorp_vault.yml`](../../../../ansible/arcodange/factory/playbooks/tools/hashicorp_vault.yml) | Initialises + unseals Vault, wires the Gitea OIDC/JWT auth backends via OpenTofu, publishes the `vault_oauth__sh_b64` Gitea Action secret | `hashicorp_vault` |
|
||||
| CrowdSec | [`tools/crowdsec.yml`](../../../../ansible/arcodange/factory/playbooks/tools/crowdsec.yml) | A `VaultAuth` + `VaultStaticSecret` for the Turnstile captcha keys, a fresh bouncer API key, and the Traefik `crowdsec` middleware | `crowdsec` |
|
||||
|
||||
---
|
||||
|
||||
## Step 1 — `hashicorp_vault.yml`
|
||||
|
||||
### The credential prompt
|
||||
|
||||
The play opens with a single `vars_prompt` for the **Gitea admin password** (`gitea_admin_password`, marked `unsafe: true` because the password may contain shell-hostile characters like `{`). This is the only interactive input the stage needs — everything else is derived or minted on the fly.
|
||||
|
||||
### Orchestration flow
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef prompt fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
|
||||
classDef mint fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef vault fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
|
||||
classDef revoke fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
P["vars_prompt:<br/>gitea_admin_password"]:::prompt
|
||||
T["Mint temp GITEA_ADMIN_TOKEN<br/>(role gitea_token, replace=true)"]:::mint
|
||||
R["Run hashicorp_vault role:<br/>init · unseal · OIDC backend · gitea secret"]:::vault
|
||||
D["post_tasks:<br/>delete GITEA_ADMIN_TOKEN"]:::revoke
|
||||
|
||||
P --> T --> R --> D
|
||||
```
|
||||
|
||||
1. **Mint a temporary token.** The `arcodange.factory.gitea_token` role generates a `GITEA_ADMIN_TOKEN` with scopes `write:admin,write:organization,write:repository,write:user` (and `gitea_token_replace: true`, so any stale token of the same name is rotated). It is stashed in the fact `vault_GITEA_ADMIN_TOKEN`.
|
||||
2. **Run the `hashicorp_vault` role.** Invoked with three derived vars: the Postgres admin credentials (read straight out of the Postgres host's docker-compose `environment` via `hostvars[groups.postgres[0]]`), the `gitea_admin_token` (= the temp token), and the prompted `gitea_admin_password`. The role does the heavy lifting — see below.
|
||||
3. **Revoke the temporary token.** A `post_tasks` block re-invokes `gitea_token` with `gitea_token_delete: true`, so the admin token never outlives the run.
|
||||
|
||||
### What the `hashicorp_vault` role does
|
||||
|
||||
The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) runs a fixed sequence; the OIDC backend setup is wrapped in a `block`/`always` so the freshly minted **root token is always revoked**, even on failure:
|
||||
|
||||
| Phase | Task file | What happens |
|
||||
| --- | --- | --- |
|
||||
| **Init** | `init.yml` | First-time only. Checks `vault operator init -status`; if uninitialised, runs `vault operator init` with **1 key share / threshold 1** and writes the keys to `~/.arcodange/cluster-keys.json` (mode `600`). Idempotent on re-run. |
|
||||
| **Unseal** | `unseal.yml` | Reads `cluster-keys.json` and runs `vault operator unseal` on every server pod. Required on **every reboot** — Vault always restarts sealed. |
|
||||
| **Root token** | `new_root_token.yml` | Mints a one-shot root token via the `generate-root` OTP/nonce dance (using the unseal key), needed to authenticate the OpenTofu apply. |
|
||||
| **OIDC backend** | `gitea_oidc_auth.yml` | Drives a Playwright script to register/read the Gitea OAuth app, then runs **OpenTofu in a throwaway Docker volume** to provision the `gitea` (OIDC) + `gitea_jwt` (JWT) auth backends, the admin identity, and the `kvv1` static secrets. Finally writes the `vault_oauth__sh_b64` script to Gitea Actions secrets. |
|
||||
| **Revoke** | `revoke_token.yml` (in `always`) | Revokes the root token unconditionally. |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The OpenTofu apply runs the [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside an ephemeral Docker volume (`docker volume create` → `tofu init` + `tofu apply` → `docker volume rm`), with the state in a GCS backend (`gs://arcodange-tf`, prefix `tools/hashicorp_vault/gitea_oidc`). The CA is mounted read-only via `VAULT_CACERT`. The destroy step is commented out by design — this provisions, it does not tear down.
|
||||
|
||||
### The `vault_oauth__sh_b64` Gitea secret
|
||||
|
||||
The last act of the role renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) (an OIDC authorization-code → access-token helper for CI), base64-encodes it, and publishes it as the **org-level** Gitea Action secret `vault_oauth__sh_b64`. Because Gitea Action secrets are scoped per owner, the role then **re-publishes the identical secret to each user-owned namespace** listed in `gitea_secret_propagation_users` — repos under a personal account cannot read org-level secrets. This is what lets a Gitea Actions workflow obtain the OIDC JWT that authenticates to Vault under the `gitea_cicd_<app>` role (the CI half of the [secret model](../../lab-ecosystem/secrets-and-vault.md)).
|
||||
|
||||
> [!CAUTION]
|
||||
> The role has an **off-by-default** `vault_oidc_force_reset` flag. When set, it runs `vault auth disable gitea` **and** `gitea_jwt` before re-applying — which **wipes every `gitea_cicd_<app>` per-app JWT role** created by the tools-repo IaC. Leave it `false` unless you are deliberately rebuilding the OIDC backend from scratch (e.g. `bound_issuer` config drift).
|
||||
|
||||
---
|
||||
|
||||
## Step 2 — `crowdsec.yml`
|
||||
|
||||
The CrowdSec sub-playbook is a thin wrapper that runs the `crowdsec` role to bolt a CrowdSec-bouncer middleware onto Traefik. The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) wires three things together.
|
||||
|
||||
| Step | What it creates | Detail |
|
||||
| --- | --- | --- |
|
||||
| **Turnstile secret** | `ServiceAccount` + `VaultAuth` + `VaultStaticSecret` in `kube-system` | Authenticates via the Kubernetes auth backend (role `factory_crowdsec_conf`) and pulls the Cloudflare Turnstile keys from `kvv2` path `cms/factory/turnstile` into a K8s Secret (`refreshAfter: 30s`). |
|
||||
| **Bouncer key** | A CrowdSec LAPI bouncer named `traefik-plugin` | Runs `cscli bouncers add traefik-plugin` inside the LAPI pod; on collision it deletes and re-adds, so the run is repeatable. |
|
||||
| **Traefik middleware** | A `traefik.io/v1alpha1` `Middleware` named `crowdsec` | Stream mode, captcha provider `turnstile` (site/secret keys from the Turnstile secret), Redis cache, trusted-IP allow-lists. |
|
||||
|
||||
After applying the middleware the role **cleans up `Failed` CrowdSec pods** and **bounces Traefik** (scale to 0 → back to 1, inside a `block`/`rescue`/`always` that guarantees Traefik returns to 1 replica no matter what) so the new middleware config is loaded.
|
||||
|
||||
> [!NOTE]
|
||||
> The Turnstile keys come from the **CMS-managed** Vault path `cms/factory/turnstile` — they are provisioned outside this stage. CrowdSec only *reads* them here. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how `VaultStaticSecret` materialises a Vault path into a Kubernetes Secret.
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> - **Vault must be unsealed before anything secret-dependent recovers.** Stage 4's unseal step reads `~/.arcodange/cluster-keys.json`; if that file is missing, init/unseal cannot proceed and the OpenTofu apply (which needs a live Vault) fails. The same file gates step 2 of the [power-cut recovery order](../../lab-ecosystem/storage-and-recovery.md).
|
||||
> - **Docker is required on the control node.** The OIDC backend provisioning shells out to `docker run … opentofu` and `docker volume`. The Playwright step also runs containerised. A control node without Docker will fail this stage.
|
||||
> - **`gitea_admin_password` is `unsafe`.** Do not strip the `unsafe: true` flag from the prompt — passwords with `{`/`}` are mangled by Jinja templating otherwise.
|
||||
> - **Re-running is safe by default.** Init and unseal are idempotent; the temp admin token and root token are both revoked on the way out. Only `vault_oidc_force_reset` makes a re-run destructive.
|
||||
> - **CrowdSec bounces Traefik.** The middleware step briefly scales Traefik to 0 — expect a short ingress blip during stage 4. The `always` block restores it to 1 even if the scale-down errors.
|
||||
|
||||
---
|
||||
|
||||
## Where stage 4 sits
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart LR
|
||||
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
|
||||
classDef next fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
|
||||
s03["03 · CI/CD"]:::done
|
||||
s04["04 · Tools<br/>Vault · CrowdSec"]:::here
|
||||
s05["05 · Backup"]:::next
|
||||
|
||||
s03 --> s04 --> s05
|
||||
```
|
||||
|
||||
1. **03 · CI/CD** registered the `act_runner` executors — a prerequisite, since the `vault_oauth__sh_b64` secret published here is consumed by those CI runners.
|
||||
2. **04 · Tools** (this page) stands up Vault and CrowdSec.
|
||||
3. **05 · Backup** is next — it schedules the cron dumps that protect the state the cluster now holds.
|
||||
107
vibe/guidebooks/factory-provisioning/ansible/05-backup.md
Normal file
107
vibe/guidebooks/factory-provisioning/ansible/05-backup.md
Normal file
@@ -0,0 +1,107 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **05 · Backup**
|
||||
|
||||
# 05 · Backup — daily cron dumps
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
|
||||
> **Downstream:** [06 · Recover](06-recover.md) — how these dumps are replayed
|
||||
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [04 · Tools](04-tools.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
Stage 5 installs three independent **cron-driven backup jobs** that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point [`playbooks/05_backup.yml`](../../../../ansible/arcodange/factory/playbooks/05_backup.yml) imports [`playbooks/backup/backup.yml`](../../../../ansible/arcodange/factory/playbooks/backup/backup.yml), which chains the three sub-playbooks, each passing `backup_root_dir: /mnt/backups`.
|
||||
|
||||
Every job follows the **same anatomy**: run a daily cron at **04:00**, write a date-stamped archive to `/mnt/backups/<kind>/`, prune anything older than **3 days**, and drop a matching `restore.sh` next to the backup script. `/mnt/backups` is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps.
|
||||
|
||||
> [!NOTE]
|
||||
> All three sub-playbooks **install** scripts and cron entries; they do not run a backup themselves (beyond a one-shot `test backup_cmd` smoke check that pipes to `/dev/null`). The actual backups fire from cron. To read failures, SSH to the host and use `sudo su` → `mails` (see [`backup/README.md`](../../../../ansible/arcodange/factory/playbooks/backup/README.md)).
|
||||
|
||||
---
|
||||
|
||||
## The three jobs
|
||||
|
||||
| Job | Sub-playbook | Host | Backup command | Artifact | Scripts dir |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| **Postgres** | [`backup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/backup/postgres.yml) | `postgres` | `docker exec <pg> pg_dumpall -U <user>` ∣ `gzip` | `backup_YYYYMMDD.sql.gz` | `…/docker_composes/postgres/scripts` |
|
||||
| **Gitea** | [`backup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/backup/gitea.yml) | `gitea` | `docker exec -u git <gitea> gitea dump --skip-log --skip-db --skip-package-data --type tar.gz` | `backup_YYYYMMDD.gitea.gz` | `…/docker_composes/gitea/scripts` |
|
||||
| **K3s PVC** | [`backup/k3s_pvc.yml`](../../../../ansible/arcodange/factory/playbooks/backup/k3s_pvc.yml) | `pi1` | `kubectl get pv,pvc` + `volumes.longhorn.io` + `settings.longhorn.io` (YAML) | `backup_YYYYMMDD.volumes` | `/opt/k3s_volumes` |
|
||||
|
||||
All three share: `keep_days: 3`, cron `minute: 0 hour: 4 user: root`, and `backup_dir: /mnt/backups/<kind>`.
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
|
||||
classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
|
||||
classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4;
|
||||
|
||||
C["cron · daily 04:00 · user root"]:::cron
|
||||
PG["postgres.yml<br/>pg_dumpall ∣ gzip"]:::job
|
||||
GT["gitea.yml<br/>gitea dump tar.gz"]:::job
|
||||
PV["k3s_pvc.yml<br/>PV · PVC · Longhorn CRDs"]:::job
|
||||
D["/mnt/backups/{postgres,gitea,k3s_pvc}/<br/>keep 3 days"]:::store
|
||||
L["Longhorn:<br/>snapshot · replicate · off-site"]:::ship
|
||||
|
||||
C --> PG --> D
|
||||
C --> GT --> D
|
||||
C --> PV --> D
|
||||
D --> L
|
||||
```
|
||||
|
||||
1. A single daily **04:00 root cron** triggers each job's `backup.sh`.
|
||||
2. **postgres.yml** runs `pg_dumpall` through `gzip`, **gitea.yml** streams a `gitea dump` tarball, **k3s_pvc.yml** serialises the volume metadata.
|
||||
3. Each writes a date-stamped archive into `/mnt/backups/<kind>/` and prunes files older than 3 days (`find … -mtime +3 -delete`).
|
||||
4. Because `/mnt/backups` is a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron.
|
||||
|
||||
---
|
||||
|
||||
## Job details
|
||||
|
||||
### Postgres — `postgres.yml`
|
||||
|
||||
The backup command is built from the Postgres host's docker-compose facts (`container_name`, `POSTGRES_USER`). `pg_dumpall` captures **all databases plus globals (roles)** in one logical dump, gzipped. The generated `restore.sh` takes an optional `YYYYMMDD` argument (defaults to the latest dump), `docker cp`s it into the container, gunzips, and replays with `psql -f`. If the restore misbehaves, the script reminds you to wipe the data dir before replaying.
|
||||
|
||||
### Gitea — `gitea.yml`
|
||||
|
||||
The dump runs as the `git` user with `--skip-db` (Postgres is backed up separately by the Postgres job) and `--skip-package-data`, streamed to stdout (`-f -`) so it never lands on the container's own disk. The `restore.sh` unpacks the tarball back into `/data/gitea` (config/data) and `/data/git/repositories` (repos), fixes `git:git` ownership, and **regenerates hooks** (`gitea admin regenerate hooks`) — without that step the restored repos have stale hook paths.
|
||||
|
||||
### K3s PVC — `k3s_pvc.yml`
|
||||
|
||||
This job does **not** back up volume *data* (Longhorn handles the bytes). It backs up the **Kubernetes objects** needed to re-bind those volumes: all `pv` + `pvc`, the **`volumes.longhorn.io` CRDs**, and `settings.longhorn.io`, concatenated into one `.volumes` YAML (`---`-separated). It writes the dump to both `/mnt/backups/k3s_pvc/` *and* a copy alongside the script. The `restore.sh` prefers a fallback dir (`/home/pi/arcodange/backups/k3s_pvc`) then the primary, picks the latest (or a dated) dump, and `kubectl apply`s it.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Backing up the Longhorn `volumes.longhorn.io` CRDs is what enables *fast* recovery.** With the Volume CRDs in the backup, recovery is a single `kubectl apply` that re-associates the surviving on-disk replicas with their PVs (see [06 · Recover → `longhorn.yml`](06-recover.md)). **Without** the Volume CRDs, a Longhorn reinstall assigns **new engine IDs**, cannot adopt the orphaned replica directories, and you fall through to the slow **block-device data recovery** (`longhorn_data.yml`). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). This is the prevention half of the [storage failure mode](../../lab-ecosystem/storage-and-recovery.md).
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> - **3-day retention is tight.** A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local `/mnt/backups` files are short-lived.
|
||||
> - **The smoke test runs the real dump.** Each play has a `test backup_cmd` task that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design.
|
||||
> - **Cron runs as `root`, scripts live in app dirs.** The `backup.sh`/`restore.sh` are written into the app's docker-compose `scripts/` dir (or `/opt/k3s_volumes`); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5.
|
||||
> - **Gitea restore needs the hook regeneration.** Skipping `gitea admin regenerate hooks` leaves repos with broken push hooks — the `restore.sh` already does it, so use the script rather than a manual untar.
|
||||
> - **Postgres and Gitea DB are backed up by *different* jobs.** Gitea dumps with `--skip-db`; its database rows come from the Postgres `pg_dumpall`. Restoring Gitea fully means restoring **both** archives.
|
||||
|
||||
---
|
||||
|
||||
## Where stage 5 sits
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart LR
|
||||
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
|
||||
classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
s04["04 · Tools"]:::done
|
||||
s05["05 · Backup<br/>Postgres · Gitea · K3s PVC"]:::here
|
||||
rec["recover/*<br/>(on disaster)"]:::rec
|
||||
|
||||
s04 --> s05
|
||||
s05 -. "feeds restore" .-> rec
|
||||
```
|
||||
|
||||
1. **04 · Tools** stood up Vault and CrowdSec — the secret store stage 5's dumps help protect.
|
||||
2. **05 · Backup** (this page) is the last linear stage: it schedules the daily dumps.
|
||||
3. The artifacts here are the **input** to the on-demand [06 · Recover](06-recover.md) branch — the `.volumes` dump in particular gates whether recovery is fast (CRDs present) or slow (block-device).
|
||||
149
vibe/guidebooks/factory-provisioning/ansible/06-recover.md
Normal file
149
vibe/guidebooks/factory-provisioning/ansible/06-recover.md
Normal file
@@ -0,0 +1,149 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover**
|
||||
|
||||
# 06 · Recover — Longhorn disaster recovery
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** 🟡 beta · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
|
||||
> **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume
|
||||
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
|
||||
|
||||
The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?**
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Decision — pick the right playbook before you start:**
|
||||
> - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted.
|
||||
> - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume.
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
|
||||
classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4;
|
||||
classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff;
|
||||
|
||||
Q{"Do the Longhorn<br/>Volume CRDs<br/>still exist?"}:::q
|
||||
F["longhorn.yml<br/>CSI/CRD recovery (fast)"]:::fast
|
||||
S{"Raw replica<br/>.img files<br/>survive?"}:::q
|
||||
D["longhorn_data.yml<br/>block-device recovery (slow)"]:::slow
|
||||
X["Data unrecoverable<br/>(replicas zeroed)"]:::dead
|
||||
|
||||
Q -- "yes" --> F
|
||||
Q -- "no" --> S
|
||||
S -- "yes" --> D
|
||||
S -- "no" --> X
|
||||
```
|
||||
|
||||
1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast.
|
||||
2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk.
|
||||
3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume.
|
||||
4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this.
|
||||
|
||||
> [!NOTE]
|
||||
> This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo.
|
||||
|
||||
---
|
||||
|
||||
## `longhorn.yml` — CSI/CRD recovery (CRDs present)
|
||||
|
||||
Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`.
|
||||
|
||||
| Phase | What it does |
|
||||
| --- | --- |
|
||||
| **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. |
|
||||
| **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. |
|
||||
| **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. |
|
||||
| **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. |
|
||||
| **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. |
|
||||
| **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. |
|
||||
| **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe.
|
||||
|
||||
---
|
||||
|
||||
## `longhorn_data.yml` — block-device data recovery (CRDs gone)
|
||||
|
||||
This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file** — `vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
|
||||
|
||||
```sh
|
||||
ansible-playbook -i inventory/hosts.yml \
|
||||
playbooks/recover/longhorn_data.yml \
|
||||
-e @vars/recovery_volumes.yml
|
||||
```
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
|
||||
classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
|
||||
classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4;
|
||||
|
||||
P0["Pre-flight + Phase 0:<br/>auto-discover largest replica dir (>16K)"]:::pre
|
||||
P1["Phase 1: back up untouched replica dir<br/>(safe copy before any op)"]:::merge
|
||||
P2["Phase 2: merge-longhorn-layers.py<br/>→ single .img · test-mount RO"]:::merge
|
||||
P3["Phase 3: create Volume CRD<br/>(scale down workload, clear stuck PVCs)"]:::k8s
|
||||
P5["Phase 5: attach via maintenance ticket<br/>→ /dev/longhorn/<pv>"]:::k8s
|
||||
P6["Phase 6: mkfs + rsync merged image<br/>into live block device"]:::merge
|
||||
P8["Phase 8: recreate PV (Retain) + PVC<br/>pinned by volumeName"]:::k8s
|
||||
P9["Phase 9: scale workload up · verify"]:::done
|
||||
|
||||
P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9
|
||||
```
|
||||
|
||||
1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this.
|
||||
2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`.
|
||||
3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound.
|
||||
4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise).
|
||||
5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/<pv>` appears on the source node, with the frontend enabled.
|
||||
6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions).
|
||||
7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound.
|
||||
8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod.
|
||||
|
||||
> [!CAUTION]
|
||||
> The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
|
||||
|
||||
> [!NOTE]
|
||||
> **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed.
|
||||
> - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that.
|
||||
> - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica.
|
||||
> - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`.
|
||||
> - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves.
|
||||
|
||||
---
|
||||
|
||||
## Why this is rehearsed
|
||||
|
||||
A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate.
|
||||
|
||||
---
|
||||
|
||||
## Where this branch sits
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart LR
|
||||
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
s05["05 · Backup<br/>(produces .volumes dump)"]:::done
|
||||
rec["recover/*<br/>longhorn.yml · longhorn_data.yml"]:::here
|
||||
s01["01 · System<br/>(rejoin pipeline)"]:::done
|
||||
|
||||
s05 -. "on disaster" .-> rec
|
||||
rec -. "once recovered" .-> s01
|
||||
```
|
||||
|
||||
1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays.
|
||||
2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone).
|
||||
3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green.
|
||||
120
vibe/guidebooks/factory-provisioning/ansible/README.md
Normal file
120
vibe/guidebooks/factory-provisioning/ansible/README.md
Normal file
@@ -0,0 +1,120 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **Ansible**
|
||||
|
||||
# Ansible — factory provisioning
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
|
||||
> **Downstream:** [01 · System](01-system.md) · [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) · [04 · Tools](04-tools.md) · [05 · Backup](05-backup.md) · [06 · Recover](06-recover.md) · [Inventory & variables](inventory.md) · [Roles reference](roles.md)
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
Ansible is the **imperative half** of the factory: it takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a running K3s cluster with Docker, Longhorn storage, Gitea CI runners, CrowdSec, and Vault. OpenTofu (the declarative half) then provisions everything that lives *outside* the cluster — see the [OpenTofu sub-hub](../opentofu/README.md).
|
||||
|
||||
---
|
||||
|
||||
## Collection layout
|
||||
|
||||
Everything ships as a single Ansible **collection** committed under [`ansible/arcodange/factory/`](../../../../ansible/arcodange/factory). The collection root, not the repo root, is what `ansible-galaxy collection install` and the FQCN references (`arcodange.factory.<role>`) resolve against.
|
||||
|
||||
| File | Path | What it declares |
|
||||
| --- | --- | --- |
|
||||
| `galaxy.yml` | [`ansible/arcodange/factory/galaxy.yml`](../../../../ansible/arcodange/factory/galaxy.yml) | Collection identity: **namespace `arcodange`**, **name `factory`**, **version `1.0.0`**. Together they form the FQCN prefix `arcodange.factory.*` used by every role and playbook import. |
|
||||
| `requirements.yml` | [`ansible/requirements.yml`](../../../../ansible/requirements.yml) | External dependencies pulled at install time (see table below). |
|
||||
| `ansible.cfg` | [`ansible/arcodange/factory/ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg) | `collections_path = ~/.ansible/collections` and `scp_if_ssh = True` for the SSH connection plugin. |
|
||||
| `inventory/` | [`ansible/arcodange/factory/inventory/`](../../../../ansible/arcodange/factory/inventory) | `hosts.yml` + `group_vars/`. Detailed in [Inventory & variables](inventory.md). |
|
||||
| `playbooks/` | [`ansible/arcodange/factory/playbooks/`](../../../../ansible/arcodange/factory/playbooks) | The numbered pipeline `01..05` plus the `recover/` branch. |
|
||||
| `roles/` | [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) | Seven reusable roles. Detailed in [Roles reference](roles.md). |
|
||||
|
||||
### External dependencies (`requirements.yml`)
|
||||
|
||||
| Dependency | Type | Why it is needed |
|
||||
| --- | --- | --- |
|
||||
| `geerlingguy.docker` | role | Installs and configures the Docker engine on each Pi. |
|
||||
| `ansible.posix` | collection | POSIX primitives (mounts, sysctl, `synchronize`). |
|
||||
| `community.crypto` | collection | Certificate/key generation for the step-ca PKI and Traefik. |
|
||||
| `community.docker` | collection | Manages containers and Compose stacks (Gitea, act_runner). |
|
||||
| `community.general` | collection | Broad utility modules used across the pipeline. |
|
||||
| `kubernetes.core` | collection | `k8s` / `helm` modules used by every K3s-facing task. Needs the `kubernetes` Python lib at runtime. |
|
||||
| `k3s-ansible` (`git+https://github.com/k3s-io/k3s-ansible.git`) | git role/collection | Upstream playbooks that install and cluster K3s itself. |
|
||||
|
||||
> [!TIP]
|
||||
> The runtime Python libraries (`kubernetes`, `jmespath`, `dnspython`) that `kubernetes.core` and friends import are declared in the **repo-root `pyproject.toml`**, not in `requirements.yml`. `uv sync` installs them; `ansible-galaxy` installs the Galaxy/git content. Both steps are required.
|
||||
|
||||
---
|
||||
|
||||
## Invocation pattern
|
||||
|
||||
The control node runs Ansible from a `uv`-managed venv. The `localhost` inventory entry sets `ansible_python_interpreter: "{{ ansible_playbook_python }}"`, so `uv run` is enough to put Ansible on the venv's Python — no hardcoded interpreter path. Full recipe lives in [`ansible/README.md`](../../../../ansible/README.md).
|
||||
|
||||
1. **Sync the venv** — installs `ansible-core` plus the runtime Python deps:
|
||||
```sh
|
||||
uv sync
|
||||
```
|
||||
2. **Install collection dependencies** — pulls the Galaxy + git content from `requirements.yml`:
|
||||
```sh
|
||||
uv run ansible-galaxy collection install -r ansible/requirements.yml
|
||||
```
|
||||
3. **Run a stage** — point `-i` at the inventory directory and pass one numbered playbook:
|
||||
```sh
|
||||
uv run ansible-playbook \
|
||||
-i ansible/arcodange/factory/inventory \
|
||||
ansible/arcodange/factory/playbooks/<NN_name>.yml
|
||||
```
|
||||
|
||||
### The vault password (`ANSIBLE_VAULT_PASSWORD_FILE`)
|
||||
|
||||
Encrypted vars are decrypted with a password that is **sourced from the cluster, not stored on disk**. `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable script that reads the K8s secret `arcodange-ansible-vault` from the `kube-system` namespace:
|
||||
|
||||
```sh
|
||||
kubectl get secret -n kube-system arcodange-ansible-vault \
|
||||
--template='{{index .data.pass | base64decode}}'
|
||||
```
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The same `arcodange-ansible-vault` secret in `kube-system` is consumed by the Gitea CI runners (needed for the Gitea mailer). Create it once with `kubectl create secret generic arcodange-ansible-vault --from-literal="pass=<ansible_vault_password>" -n kube-system`. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how this fits the broader secret model.
|
||||
|
||||
---
|
||||
|
||||
## The provisioning pipeline
|
||||
|
||||
The numbered playbooks are meant to be run **in order** on a fresh cluster — each is a thin wrapper that `import_playbook`s a stage directory (e.g. `01_system.yml` → `system/system.yml`). The `recover/` playbooks are **not** part of the linear sequence; they are an on-demand branch used only during disaster recovery.
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart LR
|
||||
classDef stage fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef recover fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
s01["01 · System<br/>Docker · K3s · Longhorn · DNS · SSL"]:::stage
|
||||
s02["02 · Setup<br/>Gitea · Postgres · NFS backup"]:::stage
|
||||
s03["03 · CI/CD<br/>act_runner registration"]:::stage
|
||||
s04["04 · Tools<br/>CrowdSec · Vault"]:::stage
|
||||
s05["05 · Backup<br/>cron reports · PVC/db dumps"]:::stage
|
||||
rec["recover/*<br/>Longhorn + data restore"]:::recover
|
||||
|
||||
s01 --> s02 --> s03 --> s04 --> s05
|
||||
s05 -. "on disaster" .-> rec
|
||||
rec -. "rejoin pipeline" .-> s01
|
||||
```
|
||||
|
||||
1. **`01 · System`** — base OS hardening on each Pi, then Docker, Longhorn disk prep + iSCSI, K3s install, CoreDNS, the step-ca cert issuer, and final K3s config (kubeconfig, Longhorn, Traefik).
|
||||
2. **`02 · Setup`** — deploys the cluster-resident services: Gitea, PostgreSQL (on `pi2`), and the NFS backup target.
|
||||
3. **`03 · CI/CD`** — fetches a Gitea runner-registration token and rolls out the `act_runner` Docker Compose stack on every non-Gitea Pi so CI jobs have executors.
|
||||
4. **`04 · Tools`** — installs the operational tooling layer: CrowdSec (WAF/IPS) and HashiCorp Vault.
|
||||
5. **`05 · Backup`** — schedules the cron-driven backup + email-report jobs and the Gitea / Postgres / K3s-PVC dump routines.
|
||||
6. **`recover/*` (on demand)** — invoked only after data loss to rebuild Longhorn and replay volume data; once recovered, the cluster re-enters the normal pipeline at `01 · System`.
|
||||
|
||||
---
|
||||
|
||||
## Index
|
||||
|
||||
| # | Page | Covers | State |
|
||||
| --- | --- | --- | --- |
|
||||
| 01 | [System](01-system.md) | RPi hardening, Docker, K3s, Longhorn/iSCSI, CoreDNS, step-ca SSL | ✅ |
|
||||
| 02 | [Setup](02-setup.md) | Gitea, PostgreSQL, NFS backup target | ✅ |
|
||||
| 03 | [CI/CD](03-cicd.md) | Gitea `act_runner` registration & Compose deploy | ✅ |
|
||||
| 04 | [Tools](04-tools.md) | CrowdSec, HashiCorp Vault | ✅ |
|
||||
| 05 | [Backup](05-backup.md) | Cron report jobs, Gitea/Postgres/PVC dumps | ✅ |
|
||||
| 06 | [Recover](06-recover.md) | Longhorn + data restore (on-demand DR branch) | 🟡 |
|
||||
| — | [Inventory & variables](inventory.md) | `hosts.yml` groups, `group_vars/` layering, host→service mapping | ✅ |
|
||||
| — | [Roles reference](roles.md) | The seven `arcodange.factory.*` roles | ✅ |
|
||||
111
vibe/guidebooks/factory-provisioning/ansible/inventory.md
Normal file
111
vibe/guidebooks/factory-provisioning/ansible/inventory.md
Normal file
@@ -0,0 +1,111 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Inventory & variables**
|
||||
|
||||
# Inventory & variables
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
|
||||
> **Downstream:** [Roles reference](roles.md)
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [PRD · isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md)
|
||||
|
||||
The inventory is the single source of truth for **which machines exist** and **which service each machine runs**. It is a directory inventory — [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml) plus a layered [`group_vars/`](../../../../ansible/arcodange/factory/inventory/group_vars) tree — passed to every playbook with `-i ansible/arcodange/factory/inventory`.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> This inventory describes **live production**. The three IPs `192.168.1.201-203` are the real Pis that run the public CMS, the Dolibarr ERP, and business email. A playbook pointed at this inventory mutates prod. The safe-environment work treats this file as the prod blast-radius and requires a **separate sandbox inventory + a prod-IP guard** before any sandbox apply — see the [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md) and the first row of the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md).
|
||||
|
||||
---
|
||||
|
||||
## Hosts
|
||||
|
||||
Defined in [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml). Three physical Pis are each reachable two ways — over the LAN (the canonical path) and through an internet port-forward managed at the firewall — plus the control node as `localhost`.
|
||||
|
||||
| Host | `ansible_host` | `preferred_ip` | Port | Reach |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `pi1` | `pi1.home` | `192.168.1.201` | 22 | LAN |
|
||||
| `pi2` | `pi2.home` | `192.168.1.202` | 22 | LAN |
|
||||
| `pi3` | `pi3.home` | `192.168.1.203` | 22 | LAN |
|
||||
| `internetPi1` | `rg-evry.changeip.co` | — | `51022` | WAN port-forward → `pi1` |
|
||||
| `internetPi2` | `rg-evry.changeip.co` | — | `52022` | WAN port-forward → `pi2` |
|
||||
| `internetPi3` | `rg-evry.changeip.co` | — | `53022` | WAN port-forward → `pi3` |
|
||||
| `localhost` | (local connection) | — | — | control node |
|
||||
|
||||
> [!NOTE]
|
||||
> The `internetPiN` entries share one DNS name (`rg-evry.changeip.co`) and differ only by SSH port (`5N022`). The hosts file documents the choice of `changeip.co` over `arcodange.duckdns.org`: changeip is **managed directly with the firewall** rather than depending on a DuckDNS registry update, so the forward is stable. `preferred_ip` is a custom hostvar (not a connection variable) — roles read it to build DNS records, the Gitea SSH domain, and the Pi-hole local-DNS table.
|
||||
|
||||
---
|
||||
|
||||
## Groups
|
||||
|
||||
Groups map machines to roles. The membership is small and deliberate; read the table as "this service runs on these hosts".
|
||||
|
||||
| Group | Members | Defined as | What it is for |
|
||||
| --- | --- | --- | --- |
|
||||
| `raspberries` | `pi1`, `pi2`, `pi3` + `internetPi1-3` | explicit hosts | Every Pi, LAN and WAN handles. Carries the shared `ansible_user: pi`. |
|
||||
| `local` | `localhost`, `pi1`, `pi2`, `pi3` | explicit hosts | The control-node-facing group; `localhost` runs `kubectl`/`tofu`/`docker` tasks that talk to the cluster. |
|
||||
| `postgres` | `pi2` | explicit host | The single PostgreSQL node. `pi2` is the database host. |
|
||||
| `gitea` | `pi2` (via `children: postgres`) | child of `postgres` | Gitea co-locates with its database, so the group simply inherits `postgres`. `groups.gitea[0]` resolves to `pi2` everywhere. |
|
||||
| `pihole` | `pi1`, `pi3` | explicit hosts | The HA DNS pair (Pi-hole + Gravity Sync). |
|
||||
| `step_ca` | `pi1`, `pi2`, `pi3` | explicit hosts | Every Pi runs a step-ca node (primary `pi1`, standbys `pi2`/`pi3`). |
|
||||
| `all` | everything (`children: raspberries`) | implicit + child | Ansible's universal group; `group_vars/all/` applies to all hosts. |
|
||||
|
||||
> [!TIP]
|
||||
> Because `gitea` is a **child of `postgres`** and `postgres` has exactly one host, every reference to `groups.gitea[0]` (the Gitea container, the API base URL `http://{{ groups.gitea[0] }}:3000`, the SSH domain) points at `pi2`. Move Postgres and Gitea follows automatically.
|
||||
|
||||
---
|
||||
|
||||
## Connection variables
|
||||
|
||||
| Variable | Where set | Value / effect |
|
||||
| --- | --- | --- |
|
||||
| `ansible_user` | `raspberries.vars` | `pi` — the SSH login on every Pi. |
|
||||
| `ansible_ssh_extra_args` | per-host (`pi1`/`pi2`/`pi3`) | `-o StrictHostKeyChecking=no` — Pis get reimaged, so host-key churn is expected; the check is disabled rather than forcing `known_hosts` edits. |
|
||||
| `ansible_port` | `internetPiN` | `51022` / `52022` / `53022` — the firewall's per-Pi SSH forwards. |
|
||||
| `ansible_connection` | `localhost` | `local` — run on the control node, no SSH. |
|
||||
| `ansible_python_interpreter` | `localhost` | `"{{ ansible_playbook_python }}"` — uses the `uv`-managed venv's Python, no hardcoded path. |
|
||||
|
||||
The control-node tooling chain (`scp_if_ssh = True`) is set in [`ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg); the `collections_path` lives there too.
|
||||
|
||||
---
|
||||
|
||||
## `group_vars/` layering
|
||||
|
||||
Variables are split by group so each service owns its own file. The path `group_vars/<group>/<file>.yml` is auto-loaded for every host in `<group>`.
|
||||
|
||||
| File | Scope | Declares |
|
||||
| --- | --- | --- |
|
||||
| [`all/common.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/common.yml) | all hosts | `user_home` — the control user's `$HOME`, looked up from the environment. |
|
||||
| [`all/ssh.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/ssh.yml) | all hosts | SSH-public-key discovery: `first_found` over `id_ed25519_arcodange.pub` → `id_ed25519.pub` → `id_rsa.pub`, then splits the file into `ssh_public_key`, `ssh_key_title`, `ssh_key_algorithm`. Roles push this key to authorized hosts. |
|
||||
| [`all/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/gitea.yml) | all hosts | `gitea_secret_propagation_users: [arcodange]` — user namespaces that must also receive org-level Gitea Action secrets (see the [`gitea_secret`](roles.md) role). |
|
||||
| [`gitea/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea.yml) | `gitea` | `gitea_version: 1.25.5`, the `gitea_database` triple, and the full Gitea Docker Compose: Postgres backend (`postgres:5432`), the `smtps`/orange.fr mailer, SSH on `2222:22`, `ROOT_URL https://gitea.arcodange.lab/`, registration disabled. SSH domain is built from `hostvars[groups.gitea[0]].preferred_ip`. |
|
||||
| [`gitea/gitea_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea_vault.yml) | `gitea` | **VAULTED.** The `gitea_vault.*` map — `GITEA__mailer__PASSWD` (consumed by the compose above) plus the `github_api_token` / `gitlab_api_token` read by the mirror roles. |
|
||||
| [`postgres/postgres.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/postgres/postgres.yml) | `postgres` | The Postgres Docker Compose — `postgres:16.3-alpine`, `5432:5432`, data under `/home/pi/arcodange/docker_composes/postgres/data` — plus the `pgbouncer` auth-user block. |
|
||||
| [`step_ca/step_ca.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca.yml) | `step_ca` | `step_ca_primary: pi1`, `step_ca_fqdn: ssl-ca.arcodange.lab`, the `step` user/home/dir, and `step_ca_listen_address: ":8443"`. |
|
||||
| [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml) | `step_ca` | **VAULTED.** `vault_step_ca_password` (the CA root password) and `vault_step_ca_jwk_password` (the cert-manager JWK provisioner password). |
|
||||
|
||||
> [!NOTE]
|
||||
> Encrypted files are conventionally suffixed `_vault.yml`. They are normal `group_vars` files whose **contents** are `ansible-vault`-encrypted; non-vault siblings hold the plaintext structure that references the vaulted keys (e.g. `gitea/gitea.yml` interpolates `gitea_vault.GITEA__mailer__PASSWD`).
|
||||
|
||||
---
|
||||
|
||||
## The vault model
|
||||
|
||||
Two distinct mechanisms share the word "vault" here — keep them apart:
|
||||
|
||||
1. **`ansible-vault`** encrypts the `*_vault.yml` files at rest in git (AES256). Decryption happens transparently at playbook runtime.
|
||||
2. **The vault password itself is never on disk.** `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable that fetches the password from the K8s secret `arcodange-ansible-vault` in the `kube-system` namespace:
|
||||
|
||||
```sh
|
||||
kubectl get secret -n kube-system arcodange-ansible-vault \
|
||||
--template='{{index .data.pass | base64decode}}'
|
||||
```
|
||||
|
||||
So decrypting any `*_vault.yml` requires `kubectl` access to the live cluster — the cluster *is* the key custodian. The setup recipe (and the `kubectl create secret` to seed it) lives in [`ansible/README.md`](../../../../ansible/README.md); how this fits the broader secret hierarchy is in [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md).
|
||||
|
||||
> [!CAUTION]
|
||||
> This is **not** HashiCorp Vault. HashiCorp Vault (`vault.arcodange.lab`) is a separate, cluster-resident service installed by the [`hashicorp_vault`](roles.md) role in the `04 · Tools` stage. The `arcodange-ansible-vault` K8s secret only holds the `ansible-vault` password and is also read by the Gitea CI runners for the mailer.
|
||||
|
||||
---
|
||||
|
||||
## Why this page matters for safe-prod
|
||||
|
||||
The variables above bind Ansible directly to live infrastructure: the host IPs, the prod Vault address, the prod Postgres superuser, and the prod Gitea forge. The safe-environment design maps each of these to a sandbox control — a parallel `inventory/sandbox/hosts.yml` with VM/cloud hosts, a pre-task guard that aborts on any `192.168.1.201-203` target unless `i_mean_prod=true`, and per-service overrides — detailed in the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md). Until that lands, **assume every run is a prod run**.
|
||||
186
vibe/guidebooks/factory-provisioning/ansible/roles.md
Normal file
186
vibe/guidebooks/factory-provisioning/ansible/roles.md
Normal file
@@ -0,0 +1,186 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Roles reference**
|
||||
|
||||
# Roles reference
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
|
||||
> **Downstream:** [Inventory & variables](inventory.md)
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
Roles live in two places, by reuse scope:
|
||||
|
||||
- **Shared roles** — reusable across stages — live in [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) and are referenced by FQCN `arcodange.factory.<role>`.
|
||||
- **Nested roles** — owned by one playbook stage — live under [`playbooks/<stage>/roles/`](../../../../ansible/arcodange/factory/playbooks) and are auto-discovered by that stage's playbook.
|
||||
|
||||
This page is split by **altitude**. Tier 1 covers the heavyweight platform-service roles (one subsection each); Tier 2 is a single table of the smaller building-block roles.
|
||||
|
||||
---
|
||||
|
||||
## Tier 1 — platform-service roles
|
||||
|
||||
### `hashicorp_vault`
|
||||
|
||||
[`playbooks/tools/roles/hashicorp_vault`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault) · runs on `localhost` in the `04 · Tools` stage. It initializes and unseals the cluster Vault and wires Gitea as an OIDC provider so CI jobs can authenticate to Vault.
|
||||
|
||||
The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) flow is:
|
||||
|
||||
1. **Init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/init.yml)) — first run only. Lists the Vault server pods in the `tools` namespace, checks `vault operator init -status`, and if uninitialized runs `vault operator init` with **`key-shares=1`, `key-threshold=1`** (defaults from [`defaults/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/defaults/main.yml)). The JSON output — unseal keys + initial root token — is written to `~/.arcodange/cluster-keys.json` (dir `0700`, file `0600`).
|
||||
2. **Unseal** ([`unseal.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/unseal.yml)) — required after every reboot. Reads the keys file and runs `vault operator unseal` for each server, then revokes the *initial* root token (idempotent — tolerates an already-revoked token).
|
||||
3. **Generate a fresh root token** ([`new_root_token.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/new_root_token.yml)) — runs the `generate-root` OTP/nonce dance using the unseal keys to mint a short-lived `vault_root_token`.
|
||||
4. **Set up Gitea OIDC** ([`gitea_oidc_auth.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/gitea_oidc_auth.yml)) — drives Gitea through the bundled [`playwright_setupGiteaApp.js`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/playwright_setupGiteaApp.js) (via the [`playwright`](#tier-2--building-block-roles) role) to create an OAuth2 app, then applies the bundled OpenTofu [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside a disposable `ghcr.io/opentofu/opentofu` container (state on a throwaway docker volume) to provision the Vault JWT/OIDC backend. Finally it renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) into the Gitea Actions secret **`vault_oauth__sh_b64`** (base64) at **org** scope, then propagates the same secret to each user in `gitea_secret_propagation_users` (Action secrets are per-owner, so user-owned repos can't read org secrets).
|
||||
5. **Revoke the temp root token** — the `always` block of `main.yml` revokes `vault_root_token` no matter how step 4 ended, so no long-lived root token survives the run.
|
||||
|
||||
| Var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `vault_unseal_keys_path` | `~/.arcodange/cluster-keys.json` | Where unseal keys + root token are stored. |
|
||||
| `vault_unseal_keys_shares` / `_key_threshold` | `1` / `1` | Single-key seal (lab posture; `threshold <= shares`). |
|
||||
| `vault_address` | `https://vault.arcodange.lab` | The cluster Vault endpoint. |
|
||||
| `gitea_admin_user` / `gitea_admin_password` | `arcodange@gmail.com` / (prompted) | Credentials Playwright uses to create the OAuth app. |
|
||||
| `vault_oidc_force_reset` | `false` | When `true`, `vault auth disable gitea` + `gitea_jwt` before re-applying. |
|
||||
|
||||
> [!CAUTION]
|
||||
> `vault_oidc_force_reset=true` is **destructive**: it disables and wipes **all** `gitea_cicd_*` per-app JWT roles created by the bundled tofu, every run. Default is off. Likewise, losing `~/.arcodange/cluster-keys.json` means the Vault can never be unsealed again — that file is the single point of failure for the whole secret plane (see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)).
|
||||
|
||||
### `step_ca`
|
||||
|
||||
[`playbooks/ssl/roles/step_ca`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca) · runs on the `step_ca` group (all three Pis) in the `01 · System` stage via [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml). It is the lab's internal ACME/CA for `*.arcodange.lab` certificates, run **active/standby**: primary `pi1`, replicas `pi2`/`pi3`. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/main.yml) imports five task files in order:
|
||||
|
||||
1. **install** — install the `step` / `step-ca` binaries.
|
||||
2. **init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/init.yml)) — primary only. `step ca init` (non-interactive, password file) with `creates:` guard so it is idempotent. The CA name is `Arcodange Lab CA`, DNS `ssl-ca.arcodange.lab`, listen `:8443`.
|
||||
3. **sync** ([`sync.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/sync.yml)) — replicates the CA from primary to standbys. It takes a **lockfile** on the primary (`.sync.lock`), computes a deterministic `tar | sha256sum` **checksum** of `~/.step`, compares it to the last checksum cached on the controller, and only `rsync`s (pull → controller → push to standbys) when the checksum changed. This is how the standbys hold an identical CA without a shared filesystem.
|
||||
4. **systemd** — install/enable the `step-ca` unit (the `restart step-ca` handler fires on cert/config change).
|
||||
5. **provisioners** ([`provisioners.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/provisioners.yml)) — primary only. Ensures a **JWK provisioner named `cert-manager`** exists: lists provisioners, generates the JWK keypair (`creates:` guard) under `~/.step/provisioners/`, and `step ca provisioner add`s it. This is what lets in-cluster cert-manager request certs from the CA.
|
||||
|
||||
| Var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `step_ca_primary` | `pi1` | The writable CA node; standbys sync from it. |
|
||||
| `step_ca_fqdn` | `ssl-ca.arcodange.lab` | CA DNS name; URL is `https://{fqdn}:8443`. |
|
||||
| `step_ca_provisioner_name` / `_type` | `cert-manager` / `JWK` | The cert-manager provisioner. |
|
||||
| `step_ca_force_reinit` | `false` | When `true`, stops the service and **wipes `~/.step`** before re-init. |
|
||||
|
||||
| Secret | Source |
|
||||
| --- | --- |
|
||||
| `vault_step_ca_password` | CA root password — from vaulted [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml). |
|
||||
| `vault_step_ca_jwk_password` | cert-manager JWK provisioner password — same vaulted file. |
|
||||
|
||||
> [!CAUTION]
|
||||
> `step_ca_force_reinit=true` **wipes the entire CA** (`~/.step`) on the primary and re-issues a new root — every previously issued `*.arcodange.lab` cert immediately becomes untrusted until clients reload the new root. Use only for a deliberate PKI rebuild.
|
||||
|
||||
### `crowdsec`
|
||||
|
||||
[`playbooks/tools/roles/crowdsec`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec) · runs on `localhost` in the `04 · Tools` stage. It wires CrowdSec's decisions into Traefik as a bouncer middleware with a Turnstile CAPTCHA. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) flow:
|
||||
|
||||
1. **Vault → K8s secret plumbing** — creates a `ServiceAccount` (`factory-ansible-tool-crowdsec-traefik-plugin`), a `VaultAuth` (kubernetes auth, role `factory_crowdsec_conf`), and a `VaultStaticSecret` that reads **`kvv2/cms/factory/turnstile`** into a K8s secret (`refreshAfter: 30s`). The Turnstile sitekey/secret come from there.
|
||||
2. **Bouncer key** — finds the CrowdSec LAPI pod in `tools` and runs `cscli bouncers add traefik-plugin` (deletes + re-adds on conflict) to obtain the bouncer API key.
|
||||
3. **CAPTCHA HTML** — `inject_captcha_html.yml` pushes `captcha.html` into the Traefik PVC; this task is **tagged `never`** (opt-in only) so the default run skips it.
|
||||
4. **Traefik Middleware** — applies a `traefik.io/v1alpha1` `Middleware` named **`crowdsec-bouncer`** (`crowdsec` in `kube-system`) configured with the bouncer key, stream mode, Turnstile (`captchaProvider: turnstile` + site/secret keys), and a **Redis cache at `redis.tools:6379`**.
|
||||
5. **Restart Traefik** — scales the Traefik Deployment to 0 then back to 1 (with a `rescue`/`always` guard guaranteeing it scales back up) to load the new middleware.
|
||||
|
||||
| Var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `traefik_pvc_name` | `traefik` | The PVC the (tagged-`never`) captcha.html inject targets. |
|
||||
|
||||
| Secret | Source |
|
||||
| --- | --- |
|
||||
| Turnstile sitekey + secret | Vault `kvv2/cms/factory/turnstile`, surfaced via `VaultStaticSecret`. |
|
||||
| Bouncer API key | Minted at runtime by `cscli bouncers add`. |
|
||||
|
||||
### `pihole`
|
||||
|
||||
[`playbooks/dns/roles/pihole`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole) · runs on the `pihole` group (`pi1`, `pi3`) in the `01 · System` stage. It configures **HA DNS**: two Pi-hole nodes kept in sync. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/main.yml) includes three task files:
|
||||
|
||||
1. **`ha_pihole_setup.yml`** — **waits for a manual Pi-hole install** (it prints the `curl … | sudo bash` command and `wait_for`s `/etc/pihole/pihole-FTL.db` for up to 10 minutes; Pi-hole itself is not installed by Ansible). It then patches [`pihole.toml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/ha_pihole_setup.yml) (listen port, `listeningMode = "ALL"`, enable `/etc/dnsmasq.d`) and writes three dnsmasq drop-ins: `10-custom-rules.conf` (wildcard `address=/fqdn/ip` from `pihole_custom_dns`), `20-rpis.conf` (`<host>.home` → `preferred_ip` for every Pi), and `99-upstream.conf` (explicit upstream from `pihole_upstream_dns`).
|
||||
2. **`gravity_setup.yml`** — sets up **Gravity Sync** between the two nodes: a `pihole_gravity` system user with a freshly **rotated ed25519 keypair** each run, cross-authorized `authorized_keys`, full **sudo** (`/etc/sudoers.d/gravity-sync`), the installer, and a generated `gravity-sync.conf` (each node points `REMOTE_HOST` at the other), then runs the sync.
|
||||
3. **`client_setup.yml`** — points DNS clients at the Pi-hole pair by editing `/etc/resolv.conf` (insert nameservers after `search`) and the active NetworkManager connections via `nmcli` (per-interface `ipv4.dns` + `dns-priority`, eth0 50 / wlan0 100).
|
||||
|
||||
| Var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `pihole_primary` | `pi1` | First node; the other is derived as the secondary. |
|
||||
| `pihole_ports` | `8081o,443os,…` | Web-interface listen ports. |
|
||||
| `pihole_custom_dns` | `{}` | FQDN→IP wildcard records (validated as IPv4). |
|
||||
| `pihole_upstream_dns` | `[8.8.8.8, 1.1.1.1, 8.8.4.4]` | Explicit upstreams (avoids DHCP-provided DNS). |
|
||||
|
||||
> [!WARNING]
|
||||
> This role is **not fully idempotent**: it depends on a human running the Pi-hole installer first, it **rotates the gravity SSH key on every run**, and it grants the `pihole_gravity` user passwordless **sudo ALL**. Treat reruns as state-changing, not no-ops.
|
||||
|
||||
### `deploy_docker_compose`
|
||||
|
||||
[`roles/deploy_docker_compose`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose) · shared. This is the **generic compose mechanism** every app deploy builds on. The caller passes a `dockercompose_content` dict; the [`tasks/main.yml`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose/tasks/main.yml):
|
||||
|
||||
1. Derives `app_name` from `dockercompose_content.name` and creates `/<root_path>/<partition>/<app_name>/` plus `data/` and `scripts/`.
|
||||
2. Writes the compose file with `to_nice_yaml` and **validates it** with `validate: 'docker compose -f %s config'` — a bad compose fails the task before anything is written live.
|
||||
3. Writes a small wrapper script `scripts/docker-compose` that runs `docker compose -f <the file> "$@"`, so the app can be driven without remembering the path.
|
||||
|
||||
| Var | Default | Meaning |
|
||||
| --- | --- | --- |
|
||||
| `app_name` | `(dockercompose_content.name)` | App directory name. |
|
||||
| `app_owner` / `app_group` | `pi` / `docker` | File ownership. |
|
||||
| `root_path` | `/home/pi/arcodange` | Base path; `partition` (`docker_composes`) nests under it. |
|
||||
|
||||
---
|
||||
|
||||
## Tier 2 — building-block roles
|
||||
|
||||
Smaller roles, mostly Gitea/forge plumbing and one-shot helpers. Shared roles live in [`roles/`](../../../../ansible/arcodange/factory/roles); `deploy_gitea`/`deploy_postgresql` are nested under [`playbooks/setup/roles/`](../../../../ansible/arcodange/factory/playbooks/setup/roles).
|
||||
|
||||
| Role | Purpose | Key vars / notes | Secrets |
|
||||
| --- | --- | --- | --- |
|
||||
| [`gitea_repo`](../../../../ansible/arcodange/factory/roles/gitea_repo) | Ensure a repo exists across Gitea + GitHub + GitLab and add **8h push mirrors** (`sync_on_commit: true`) to GitHub/GitLab. | Creates missing repos on each forge; mirror URLs + namespace IDs in [`vars/main.yml`](../../../../ansible/arcodange/factory/roles/gitea_repo/vars/main.yml). | `github_api_token`, `gitlab_api_token` (from `gitea_vault`). |
|
||||
| [`gitea_token`](../../../../ansible/arcodange/factory/roles/gitea_token) | Generate / replace / delete a Gitea access token via `docker exec … gitea admin user generate-access-token`. | Stores the raw token in the fact named by `gitea_token_fact_name`; `gitea_token_replace` / `gitea_token_delete` toggles; scopes default to `write:admin,organization,package,repository,user`. | The minted token itself (a fact, not persisted). |
|
||||
| [`gitea_secret`](../../../../ansible/arcodange/factory/roles/gitea_secret) | `PUT` a Gitea **Actions secret** at user or org scope. | `gitea_secret_name` / `_value`; `gitea_owner_type` (`user`\|`org`) selects the API path. | `gitea_api_token` (Authorization). |
|
||||
| [`gitea_sync`](../../../../ansible/arcodange/factory/roles/gitea_sync) | List repos on all **three forges**, diff them, and call `gitea_repo` for the repos missing somewhere. | Computes `repos_incomplete = all − common`; loops `gitea_repo` over the gaps. | GitHub/GitLab/Gitea API tokens. |
|
||||
| [`traefik_certs`](../../../../ansible/arcodange/factory/roles/traefik_certs) | Extract the live **`*.arcodange.lab`** cert from Traefik's `acme.json`. | `kubectl exec` into Traefik → `jq` the LetsEncrypt wildcard cert → `traefik_cert_pem` fact; no-op if already set. | — (reads in-cluster acme.json). |
|
||||
| [`playwright`](../../../../ansible/arcodange/factory/roles/playwright) | Run a Playwright browser-automation script in Docker. | Builds `playwright:<version>` (default `1.47.0`) from `files/`, runs the script with `playwright_env` injected as `-e`; default script `loginGitea.js`. Used by `hashicorp_vault` for the OIDC app setup. | Script-specific env (e.g. Gitea admin creds). |
|
||||
| [`deploy_gitea`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea) | Deploy Gitea: template [`app.ini.j2`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea/tasks/main.yml), `docker compose up`, then **health-check `:3000`** until ready. | Compose source is `/home/pi/arcodange/docker_composes/gitea`; admin user `arcodange`. | (consumes the vaulted Gitea compose env). |
|
||||
| [`deploy_postgresql`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql) | Deploy Postgres via compose, then per-app **create DB + user** ([`create_db_and_user.yml`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql/tasks/create_db_and_user.yml)). | Waits on `pg_isready`, loops `applications_databases` (`{app: {db_name, db_user, db_password}}`). | Per-app DB passwords from `applications_databases`. |
|
||||
|
||||
---
|
||||
|
||||
## Role dependency view
|
||||
|
||||
How the roles relate: shared building blocks feed the `setup`-stage app deploys, and a few platform-service roles include shared roles directly.
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef shared fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef setup fill:#1e4620,stroke:#22c55e,color:#f9fafb;
|
||||
classDef platform fill:#4a2c1e,stroke:#f59e0b,color:#f9fafb;
|
||||
|
||||
dc["deploy_docker_compose<br/>generic compose writer"]:::shared
|
||||
pw["playwright<br/>browser automation"]:::shared
|
||||
gt["gitea_token<br/>mint access token"]:::shared
|
||||
gs["gitea_secret<br/>PUT Actions secret"]:::shared
|
||||
gr["gitea_repo<br/>mirror to GitHub/GitLab"]:::shared
|
||||
gsync["gitea_sync<br/>diff 3 forges"]:::shared
|
||||
tc["traefik_certs<br/>extract lab cert"]:::shared
|
||||
|
||||
dpg["deploy_postgresql"]:::setup
|
||||
dgi["deploy_gitea"]:::setup
|
||||
|
||||
hv["hashicorp_vault"]:::platform
|
||||
sca["step_ca"]:::platform
|
||||
cs["crowdsec"]:::platform
|
||||
ph["pihole"]:::platform
|
||||
|
||||
gsync --> gr
|
||||
hv --> pw
|
||||
hv --> gs
|
||||
dc -. "used by app deploys" .-> dpg
|
||||
dc -. "used by app deploys" .-> dgi
|
||||
```
|
||||
|
||||
1. **`gitea_sync` → `gitea_repo`** — the sync role include-loops `gitea_repo` for each repo missing from one of the three forges.
|
||||
2. **`hashicorp_vault` → `playwright`** — Vault's OIDC setup drives Gitea through Playwright to create the OAuth app.
|
||||
3. **`hashicorp_vault` → `gitea_secret`** — the rendered `vault_oauth__sh_b64` is published as a Gitea Actions secret at org and user scope.
|
||||
4. **`deploy_docker_compose` → `deploy_postgresql` / `deploy_gitea`** — the generic compose writer is the substrate the `setup`-stage app deploys lean on.
|
||||
5. **`step_ca`, `crowdsec`, `pihole`** stand alone — they configure their own services (PKI, WAF, DNS) without including other roles.
|
||||
|
||||
---
|
||||
|
||||
## See also
|
||||
|
||||
- [Inventory & variables](inventory.md) — the groups (`gitea`, `postgres`, `step_ca`, `pihole`) these roles target, and the vaulted `group_vars` they read.
|
||||
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — where `hashicorp_vault`'s OIDC tokens and the `kvv2/cms/factory/turnstile` path fit the broader secret model.
|
||||
- [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) — how the compose `data/` dirs and the step-ca state relate to backup and disaster recovery.
|
||||
95
vibe/guidebooks/factory-provisioning/opentofu/README.md
Normal file
95
vibe/guidebooks/factory-provisioning/opentofu/README.md
Normal file
@@ -0,0 +1,95 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **OpenTofu**
|
||||
|
||||
# OpenTofu — factory provisioning
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
|
||||
> **Downstream:** [factory iac](factory-iac.md) · [postgres iac](postgres-iac.md) · [CI apply flow](ci-apply-flow.md)
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
OpenTofu is the **declarative half** of the factory: it provisions everything that lives *outside* the K3s cluster — Gitea repos & CI users, Vault policies, Cloudflare DNS, OVH domains, a GCS backup bucket, and the in-cluster PostgreSQL roles/databases. The imperative half (the cluster itself) is built by [Ansible](../ansible/README.md).
|
||||
|
||||
OpenTofu is pinned to **`1.8.2`** in CI (`OPENTOFU_VERSION`).
|
||||
|
||||
---
|
||||
|
||||
## Two independent state roots
|
||||
|
||||
There are **two separate Terraform/OpenTofu roots**, each with its own `backend.tf`, its own GCS state prefix, its own provider set, and its own CI workflow. They never share state and can be applied independently.
|
||||
|
||||
| Root | Code path | State backend (GCS) | Triggered by |
|
||||
| --- | --- | --- | --- |
|
||||
| **factory iac** | [`iac/`](../../../../iac) | `gs://arcodange-tf/factory/main` | changes under `iac/**` → [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) |
|
||||
| **postgres iac** | [`postgres/iac/`](../../../../postgres/iac) | `gs://arcodange-tf/factory/postgres` | changes under `postgres/**` → [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) |
|
||||
|
||||
> [!NOTE]
|
||||
> Both roots share the same GCS **bucket** (`arcodange-tf`) but live under **distinct prefixes** (`factory/main` vs `factory/postgres`), so their state objects never collide.
|
||||
|
||||
---
|
||||
|
||||
## Providers
|
||||
|
||||
| Provider | Version | Endpoint / scope | Auth |
|
||||
| --- | --- | --- | --- |
|
||||
| `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var |
|
||||
| `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` |
|
||||
| `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` (factory) / `GOOGLE_BACKEND_CREDENTIALS` (postgres backend) |
|
||||
| `cloudflare/cloudflare` | `~> 5` | DNS / IAM | `CLOUDFLARE_API_TOKEN` env var |
|
||||
| `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` |
|
||||
| `cyrilgdn/postgresql` | `1.24.0` | `192.168.1.202` (pi2), `superuser` | `POSTGRES_USERNAME` / `POSTGRES_PASSWORD` (TF vars) |
|
||||
|
||||
The first five providers belong to the **factory iac** root ([`iac/providers.tf`](../../../../iac/providers.tf)); the **postgres iac** root ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf)) declares only `postgresql` + `vault`. Both roots configure the `vault` provider identically (JWT, mount `gitea_jwt`, role `gitea_cicd`).
|
||||
|
||||
---
|
||||
|
||||
## The Vault-JWT auth model
|
||||
|
||||
Neither root carries long-lived Vault credentials. Instead CI mints a short-lived Gitea OIDC token and exchanges it for Vault access:
|
||||
|
||||
1. A first job decodes the base64 secret **`vault_oauth__sh_b64`** and runs it (`base64 -d | bash`), producing a **Gitea OIDC JWT** as a job output (`gitea_vault_jwt`).
|
||||
2. That JWT is exported into the apply job as **`TERRAFORM_VAULT_AUTH_JWT`**.
|
||||
3. The `vault` provider's `auth_login_jwt` block consumes it against mount `gitea_jwt` / role `gitea_cicd`, yielding a scoped Vault token used to read the per-provider secrets (Google creds, Gitea token, Cloudflare token, OVH app keys, Postgres creds).
|
||||
|
||||
See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the full Vault policy/mount design and [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough.
|
||||
|
||||
---
|
||||
|
||||
## CI apply flow
|
||||
|
||||
Both workflows share the same two-job shape: authenticate, then apply. The trigger paths differ (`iac/**` vs `postgres/**`) but the structure is identical.
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
|
||||
flowchart TD
|
||||
classDef trigger fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
|
||||
classDef job fill:#1e4620,stroke:#22c55e,color:#f0fdf4;
|
||||
classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
|
||||
|
||||
push["push / PR touching<br/> iac/** or postgres/**"]:::trigger
|
||||
auth["job: gitea_vault_auth<br/>decode vault_oauth__sh_b64<br/> mint Gitea OIDC JWT"]:::job
|
||||
tofu["job: tofu<br/>read Vault secrets via JWT<br/> set provider env vars"]:::job
|
||||
apply["dflook/terraform-apply@v1<br/> auto_approve: true"]:::danger
|
||||
|
||||
push --> auth
|
||||
auth -- "gitea_vault_jwt output" --> tofu
|
||||
tofu --> apply
|
||||
```
|
||||
|
||||
1. A **push or PR** that touches files under `iac/**` (factory) or `postgres/**` (postgres) starts the matching workflow; `workflow_dispatch` allows a manual run.
|
||||
2. The **`gitea_vault_auth`** job decodes `vault_oauth__sh_b64` and emits the Gitea OIDC JWT as `gitea_vault_jwt`.
|
||||
3. The **`tofu`** job (`needs: gitea_vault_auth`) sets `TERRAFORM_VAULT_AUTH_JWT` from that output, reads the provider secrets out of Vault, and prepares the homelab CA cert (`VAULT_CACERT`).
|
||||
4. The job runs **`dflook/terraform-apply@v1`** against the root's `path` (`iac` or `postgres/iac`) with **`auto_approve: true`**.
|
||||
|
||||
> [!CAUTION]
|
||||
> **Applies are auto-approve.** There is no manual plan-review gate — once a change to `iac/**` or `postgres/**` lands on `main`, CI applies it to the real Gitea, Vault, Cloudflare, OVH, GCS, and PostgreSQL targets without further confirmation. Treat every merge as a production change and review the diff *before* merging, not after. This trade-off is recorded in [ADR-0001 · safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md).
|
||||
|
||||
---
|
||||
|
||||
## Index
|
||||
|
||||
| Page | Covers | State |
|
||||
| --- | --- | --- |
|
||||
| [factory iac](factory-iac.md) | `iac/` root — Gitea, Vault, Google/GCS backup, Cloudflare, OVH | ✅ |
|
||||
| [postgres iac](postgres-iac.md) | `postgres/iac/` root — PostgreSQL roles & databases on pi2 | ✅ |
|
||||
| [CI apply flow](ci-apply-flow.md) | Both Gitea workflows, the Vault-JWT exchange, auto-approve apply | ✅ |
|
||||
114
vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md
Normal file
114
vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md
Normal file
@@ -0,0 +1,114 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **CI apply flow**
|
||||
|
||||
# CI apply flow
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml), [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml)
|
||||
> **Downstream:** [factory iac](factory-iac.md), [postgres iac](postgres-iac.md)
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md)
|
||||
|
||||
Two Gitea Actions workflows turn every commit that touches the OpenTofu code into a live `apply`. `IAC` ([`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml)) drives the factory infrastructure under [`iac/`](../../../../iac/); `Postgres` ([`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml)) drives the database stack under [`postgres/iac/`](../../../../postgres/). They share the same two-job shape: a short OIDC-auth job feeds a Vault JWT to a `tofu` job that reads secrets and runs `terraform apply`.
|
||||
|
||||
> [!CAUTION]
|
||||
> **`auto_approve: true` means every merge to `main` applies immediately — there is no plan-gate.** The `dflook/terraform-apply@v1` step skips the interactive approval, so any change that lands on `main` (or any matched `push`) rewrites real cloud and homelab state without a human reviewing the plan. Mitigations are entirely upstream of CI: (1) **mandatory code review** on the PR before merge, and (2) **least-privilege Vault policies** on the `gitea_cicd` role so a runaway apply can only touch the resources its token is scoped to. See [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md): the sandbox lane runs the *same* tofu but **plan-only** against a `sandbox/` state prefix and a throwaway DNS zone, so contributors can validate changes without an auto-apply.
|
||||
|
||||
## Triggers
|
||||
|
||||
Both workflows fire on the same three events; only the watched path globs differ.
|
||||
|
||||
| Event | `IAC` (factory) | `Postgres` |
|
||||
| --- | --- | --- |
|
||||
| `push` | `iac/*.tf`, `iac/*.tfvars`, `iac/**/*.tf`, `iac/**/*.tfvars` | `postgres/**/*.tf`, `postgres/**/*.tfvars` |
|
||||
| `pull_request` | same globs (YAML anchor `*tofuPaths`) | same globs (YAML anchor `*postgresTofuPaths`) |
|
||||
| `workflow_dispatch` | manual, no inputs | manual, no inputs |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> `concurrency` is keyed on `${{ github.ref }}-${{ github.workflow }}` with `cancel-in-progress: true`, so a newer push to the same branch cancels an in-flight run. A `pull_request` event triggers the workflow — but the `apply` still runs, so the safety contract is "review **before** merge", not "CI only plans on PRs".
|
||||
|
||||
## Job 1 — `gitea_vault_auth`
|
||||
|
||||
Mints a Gitea OIDC token that Vault will trust. The whole job is one step:
|
||||
|
||||
```bash
|
||||
echo -n "${{ secrets.vault_oauth__sh_b64 }}" | base64 -d | bash
|
||||
```
|
||||
|
||||
| Field | Value |
|
||||
| --- | --- |
|
||||
| Runner | `ubuntu-latest` |
|
||||
| Secret consumed | `vault_oauth__sh_b64` — a base64-encoded shell script |
|
||||
| Step id | `gitea_vault_jwt` |
|
||||
| Output | `gitea_vault_jwt` ← `steps.gitea_vault_jwt.outputs.id_token` |
|
||||
|
||||
The decoded script asks Gitea for an OIDC `id_token` and emits it as a step output. The `tofu` job declares `needs: [gitea_vault_auth]` so it receives `needs.gitea_vault_auth.outputs.gitea_vault_jwt`.
|
||||
|
||||
## Job 2 — `tofu`
|
||||
|
||||
| Field | `IAC` | `Postgres` |
|
||||
| --- | --- | --- |
|
||||
| Job name | `Tofu` | `Tofu - Postgres` |
|
||||
| `needs` | `gitea_vault_auth` | `gitea_vault_auth` |
|
||||
| `OPENTOFU_VERSION` | `1.8.2` | `1.8.2` |
|
||||
| `TERRAFORM_VAULT_AUTH_JWT` | `needs.gitea_vault_auth.outputs.gitea_vault_jwt` | same |
|
||||
| `VAULT_CACERT` | `${{ github.workspace }}/homelab.pem` | same |
|
||||
| Apply path | `iac` | `postgres/iac` |
|
||||
|
||||
Step order inside the job:
|
||||
|
||||
1. **read vault secret** — the shared `*vault_step` anchor (see below).
|
||||
2. **`actions/checkout@v4`** — pull the repo into the workspace.
|
||||
3. **prepare vault self signed cert** — `echo -n "${{ secrets.HOMELAB_CA_CERT }}" | base64 -d > $VAULT_CACERT`, writing the homelab CA to `homelab.pem` so the runner trusts `https://vault.arcodange.lab`.
|
||||
4. **terraform apply** — `dflook/terraform-apply@v1` with the path above and `auto_approve: true`.
|
||||
|
||||
### Vault secret reads (`*vault_step`)
|
||||
|
||||
The `read vault secret` step uses [`arcodange-org/vault-action`](https://gitea.arcodange.lab/arcodange-org/vault-action), authenticating with `method: jwt`, `path: gitea_jwt`, `role: gitea_cicd`, `url: https://vault.arcodange.lab`, `caCertificate: ${{ secrets.HOMELAB_CA_CERT }}`, and `jwtGiteaOIDC` set to the auth job's output. The secrets it exports into the job env differ per workflow:
|
||||
|
||||
| Workflow | Vault path | Selector | Exported as |
|
||||
| --- | --- | --- | --- |
|
||||
| `IAC` | `kvv1/google/credentials` | `credentials` | `GOOGLE_CREDENTIALS` |
|
||||
| `IAC` | `kvv1/admin/gitea` | `token` | `GITEA_TOKEN` |
|
||||
| `IAC` | `kvv1/admin/cloudflare` | `iam_token` | `CLOUDFLARE_API_TOKEN` |
|
||||
| `IAC` | `kvv1/admin/ovh/app` | `*` (all keys) | `OVH_*` |
|
||||
| `Postgres` | `kvv1/google/credentials` | `credentials` | `GOOGLE_BACKEND_CREDENTIALS` |
|
||||
| `Postgres` | `kvv1/postgres/credentials` | `*` (all keys) | `TF_VAR_postgres_*` |
|
||||
|
||||
`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS` authenticate the GCS state backend; the `TF_VAR_postgres_*` fan-out feeds the Postgres module's input variables directly. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how the `gitea_cicd` role and KV v1 mounts are provisioned.
|
||||
|
||||
## End-to-end flow
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart TD
|
||||
push["push / PR / workflow_dispatch<br>on iac/** or postgres/** .tf .tfvars"] --> auth["job: gitea_vault_auth<br>base64 -d | bash -> Gitea OIDC id_token"]
|
||||
auth -->|"gitea_vault_jwt output"| tofu["job: tofu<br>OPENTOFU_VERSION 1.8.2"]
|
||||
tofu --> readvault["read vault secret<br>vault-action jwt role gitea_cicd"]
|
||||
readvault -->|"GOOGLE_CREDENTIALS, TF_VAR_postgres_*, ..."| init["tofu init<br>GCS backend, state prefix"]
|
||||
init --> apply["dflook/terraform-apply@v1<br>auto_approve: true"]
|
||||
apply --> state["state updated in GCS<br>real cloud + homelab mutated"]
|
||||
|
||||
classDef trigger fill:#1f3a5f,stroke:#7fb0ff,color:#eaf2ff;
|
||||
classDef job fill:#3a2f5f,stroke:#b39dff,color:#f3eeff;
|
||||
classDef secret fill:#5f3a2f,stroke:#ffb38a,color:#fff1e8;
|
||||
classDef danger fill:#5f1f2f,stroke:#ff8a9d,color:#ffe8ec;
|
||||
class push trigger;
|
||||
class auth,tofu,init job;
|
||||
class readvault secret;
|
||||
class apply,state danger;
|
||||
```
|
||||
|
||||
1. A **push**, **pull_request**, or **workflow_dispatch** event matching the `iac/**` or `postgres/**` path globs starts the workflow.
|
||||
2. Job **`gitea_vault_auth`** runs `base64 -d | bash` on the `vault_oauth__sh_b64` secret to obtain a Gitea OIDC `id_token`, published as the `gitea_vault_jwt` output.
|
||||
3. Job **`tofu`** (gated by `needs: gitea_vault_auth`) starts on `ubuntu-latest` with `OPENTOFU_VERSION 1.8.2` and `TERRAFORM_VAULT_AUTH_JWT` set to that output.
|
||||
4. The **read vault secret** step exchanges the JWT (role `gitea_cicd`, path `gitea_jwt`) for the workflow's secrets and exports them as env vars (`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS`, `GITEA_TOKEN`, `CLOUDFLARE_API_TOKEN`, `OVH_*`, or `TF_VAR_postgres_*`).
|
||||
5. **`tofu init`** configures the GCS backend, binding the working dir to its state prefix using the Google credentials just read.
|
||||
6. **`dflook/terraform-apply@v1`** runs against `iac` (or `postgres/iac`) with `auto_approve: true` — no plan-gate.
|
||||
7. The **state** in GCS is updated and the real cloud + homelab resources are mutated to match the committed code.
|
||||
|
||||
## Related pages
|
||||
|
||||
- [factory iac](factory-iac.md) — what the `iac/` stack provisions (the `IAC` workflow's target).
|
||||
- [postgres iac](postgres-iac.md) — the `postgres/iac/` database stack (the `Postgres` workflow's target).
|
||||
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — the `gitea_cicd` role, OIDC trust, and KV mounts behind every secret read here.
|
||||
- [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) — the sandbox lane runs the same tofu plan-only against a `sandbox/` state prefix and a throwaway zone.
|
||||
148
vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md
Normal file
148
vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md
Normal file
@@ -0,0 +1,148 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **factory iac**
|
||||
|
||||
# factory iac — the `iac/` state root
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Code:** [`iac/`](../../../../iac) · **State backend:** `gs://arcodange-tf/factory/main` ([`iac/backend.tf`](../../../../iac/backend.tf))
|
||||
> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
|
||||
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [CI apply flow](ci-apply-flow.md) · [postgres iac](postgres-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
The `iac/` root provisions everything that lives **outside** the K3s cluster: the Cloudflare R2 backend that holds OpenTofu state itself, the per-service Cloudflare and OVH API tokens consumed by the [cms](https://gitea.arcodange.lab/arcodange-org/cms) repo, a restricted Gitea CI user for reading private module repos, and the GCS bucket that backs up Longhorn volumes. Each provisioned credential is written **both** to a Gitea Actions secret (where the consuming workflow expects it) **and** to a Vault path (the durable source of truth — see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)).
|
||||
|
||||
This root's state lives at `gs://arcodange-tf/factory/main` and is applied by [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) on any change under `iac/**` — see [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough.
|
||||
|
||||
---
|
||||
|
||||
## Providers
|
||||
|
||||
Declared in [`iac/providers.tf`](../../../../iac/providers.tf).
|
||||
|
||||
| Provider | Source | Version | Endpoint / scope | Auth |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `gitea` | `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var |
|
||||
| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` |
|
||||
| `google` | `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` env var |
|
||||
| `cloudflare` | `cloudflare/cloudflare` | `~> 5` | DNS / Pages / R2 / IAM | `CLOUDFLARE_API_TOKEN` env var |
|
||||
| `ovh` | `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` |
|
||||
|
||||
> [!NOTE]
|
||||
> The Cloudflare account ID is **not** hard-coded — it is resolved at plan time from `data.cloudflare_account.arcodange` filtered on the account name `arcodange@gmail.com` ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)) and exposed as `local.cloudflare_account_id`.
|
||||
|
||||
---
|
||||
|
||||
## Cloudflare — R2 backend bucket & service tokens
|
||||
|
||||
Defined in [`iac/cloudflare.tf`](../../../../iac/cloudflare.tf). Two tokens are minted through the [`modules/cloudflare_token`](#the-cloudflare_token-module) mechanism: one scoped to the R2 state bucket, one broad token handed to the cms repo.
|
||||
|
||||
| Resource | Type | Identity / scope | Secret destination |
|
||||
| --- | --- | --- | --- |
|
||||
| `cloudflare_r2_bucket.arcodange_tf` | R2 bucket | name `arcodange-tf`, jurisdiction `eu` | — (holds the *cms* repo's own OpenTofu state) |
|
||||
| `module.cf_r2_arcodange_tf_token` | module → `cloudflare_account_token` | account: `Workers R2 Storage Read`, `Account Settings Read`; bucket: `Workers R2 Storage Bucket Item Write` | `vault_kv_secret.cf_r2_arcodange_tf` → `kvv1/cloudflare/r2/arcodange-tf` (S3 access key, secret, `https://<account_id>.eu.r2.cloudflarestorage.com` endpoint) |
|
||||
| `vault_policy.cf_r2_arcodange_tf` | Vault policy | name `factory__cf_r2_arcodange_tf` | read on `kvv1/cloudflare/r2/arcodange-tf` **and** `kvv1/zoho/self_client` (the Zoho mail client is created manually) |
|
||||
| `module.cf_arcodange_cms_token` | module → `cloudflare_account_token` | account-scope: `Pages Write`, `Account DNS Settings Write`, `Account Settings Read`, `Zone Write`, `Zone Settings Write`, `DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write` | Gitea secrets `CLOUDFLARE_API_TOKEN` + `CLOUDFLARE_ACCOUNT_ID` on the `cms` repo; Vault `kvv1/cloudflare/cms/cf_arcodange_cms_token` |
|
||||
|
||||
The `cms` repo (`data.gitea_repo.cms`, owner `arcodange-org`) receives the broad token because it manages the public site end to end: Cloudflare Pages deploys, DNS records, zone settings, the Tunnel, and Turnstile.
|
||||
|
||||
> [!CAUTION]
|
||||
> Both tokens are minted with **`expires_on = null`** — they never expire. A leaked `cf_arcodange_cms_token` grants standing DNS/Pages/Tunnel/Turnstile write on the whole account until manually revoked. There is no automatic rotation; rotation means tainting the module's `cloudflare_account_token` and re-applying.
|
||||
|
||||
---
|
||||
|
||||
## OVH — OAuth2 client for the cms domain
|
||||
|
||||
Defined in [`iac/ovh.tf`](../../../../iac/ovh.tf). A `CLIENT_CREDENTIALS` OAuth2 client lets the cms workflow edit DNS nameservers for `arcodange.fr`, constrained by an IAM policy.
|
||||
|
||||
| Resource | Type | Scope |
|
||||
| --- | --- | --- |
|
||||
| `ovh_me_api_oauth2_client.cms` | OAuth2 client | name `cms repo`, flow `CLIENT_CREDENTIALS` — "arcodange.fr management" |
|
||||
| `ovh_iam_policy.cms` | IAM policy | name `cms_manager`; identity = the OAuth2 client; resources = account URN + `urn:v1:eu:resource:domain:arcodange.fr`; allow = a handful of `me/*` reads, all domain **READ** reference-actions (computed via `data.ovh_iam_reference_actions.domain`), plus `domain:apiovh:nameServer/edit` |
|
||||
| `gitea_repository_actions_secret.ovh_cms_client_id` | Gitea secret | `OVH_CLIENT_ID` on the `cms` repo |
|
||||
| `gitea_repository_actions_secret.ovh_cms_client_secret` | Gitea secret | `OVH_CLIENT_SECRET` on the `cms` repo |
|
||||
| `vault_kv_secret.ovh_cms_token` | Vault secret | `kvv1/ovh/cms/app` — `client_id`, `client_secret`, `urn` |
|
||||
|
||||
> [!NOTE]
|
||||
> The write surface is deliberately narrow: the policy grants **only** `nameServer/edit` for writes; everything else is read-only. This lets the cms pipeline point `arcodange.fr` at Cloudflare nameservers without exposing the broader OVH account.
|
||||
|
||||
---
|
||||
|
||||
## Gitea — restricted CI module-reader user
|
||||
|
||||
Defined in [`iac/gitea_tofu_ci_user.tf`](../../../../iac/gitea_tofu_ci_user.tf). A locked-down Gitea account whose SSH key lets CI clone private Terraform module repos without exposing a privileged token.
|
||||
|
||||
| Resource | Type | Notes |
|
||||
| --- | --- | --- |
|
||||
| `random_password.tofu` | password | length 32 — the user's login password |
|
||||
| `gitea_user.tofu` | Gitea user | username `tofu_module_reader`, email `tofu-module-reader@arcodange.fake`, `restricted = true`, `visibility = private`, `prohibit_login = false` |
|
||||
| `tls_private_key.tofu` | keypair | algorithm **ED25519** |
|
||||
| `gitea_public_key.tofu` | SSH key | public half attached to `tofu_module_reader` |
|
||||
| `vault_kv_secret.gitea_admin_token` | Vault secret | `kvv1/gitea/tofu_module_reader` — `ssh_private_key` + `ssh_public_key` |
|
||||
|
||||
> [!NOTE]
|
||||
> Despite the Terraform resource name `gitea_admin_token`, the stored payload is the **SSH keypair**, not an admin token. The user is `restricted`, so it can only read repos it is explicitly granted access to.
|
||||
|
||||
---
|
||||
|
||||
## Google / GCS — Longhorn backup target
|
||||
|
||||
Defined in [`iac/gcs_backup.tf`](../../../../iac/gcs_backup.tf). A GCS bucket plus an HMAC key wired into Vault so the in-cluster Longhorn controller can pull S3-compatible backup credentials. See [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for how this fits the cluster-recovery story.
|
||||
|
||||
| Resource | Type | Value |
|
||||
| --- | --- | --- |
|
||||
| `google_storage_bucket.longhorn_backup` | GCS bucket | name `arcodange-backup`, location `NAM4` (dual-region), `force_destroy = true`, `public_access_prevention = enforced` |
|
||||
| `google_service_account.longhorn_backup` | service account | account_id `longhorn-backup` |
|
||||
| `google_storage_bucket_iam_member.longhorn_backup` | IAM binding | `roles/storage.admin` on the bucket, member = the SA |
|
||||
| `google_storage_hmac_key.longhorn_backup` | HMAC key | S3-compatible access_id + secret for that SA |
|
||||
| `vault_kv_secret_v2.longhorn_gcs_backup` | Vault **KVv2** secret | mount `kvv2`, name `longhorn/gcs-backup`, `cas = 1`, `delete_all_versions = true` — `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_ENDPOINTS = https://storage.googleapis.com` |
|
||||
| `vault_policy.longhorn_gcs_backup` | Vault policy | name `longhorn-gcs-backup` — read on `kvv2/data/longhorn/gcs-backup` |
|
||||
| `vault_kubernetes_auth_backend_role.longhorn` | Vault k8s auth role | role `longhorn`, bound SA `longhorn-vault-secret-reader` in namespace `longhorn-system`, audience `vault`, policy `longhorn-gcs-backup` |
|
||||
|
||||
The bound service-account name `longhorn-vault-secret-reader` must match the `VaultAuth` manifest in-cluster — that's the handshake that lets Longhorn read the HMAC creds at runtime.
|
||||
|
||||
> [!WARNING]
|
||||
> The HMAC key is an **S3-compatible** credential and is weaker than a native GCS service-account key: it is a long-lived static secret with no key rotation built into this config, and `roles/storage.admin` grants full read/write/delete on the backup bucket. Combined with `force_destroy = true`, a state operation that destroys `arcodange-backup` will delete every Longhorn backup without prompting. Treat this bucket as critical and irreplaceable infrastructure.
|
||||
|
||||
---
|
||||
|
||||
## The `cloudflare_token` module
|
||||
|
||||
Source: [`iac/modules/cloudflare_token/`](../../../../iac/modules/cloudflare_token). This local module turns **human-readable permission names** into a working Cloudflare account token, so callers never hard-code permission-group UUIDs.
|
||||
|
||||
How it works ([`main.tf`](../../../../iac/modules/cloudflare_token/main.tf)):
|
||||
|
||||
1. It reads **all** available permission groups via `data.cloudflare_account_api_token_permission_groups_list`, then builds `local.permission_map`: `"<scope>:<name>" => id` (e.g. `"account:Pages Write" => <uuid>`), keyed by the last dotted segment of the group's scope.
|
||||
2. Caller-supplied names (`var.permissions.account` / `var.permissions.bucket`) are looked up against that map; any name with no match lands in `local.missing_permissions` and trips a **`precondition`** that fails the apply with a clear "Permissions introuvables" error.
|
||||
3. Policies are assembled dynamically — an `account` policy targeting `com.cloudflare.api.account.<id>` and, if `var.bucket` is set, a `bucket` policy targeting `com.cloudflare.edge.r2.bucket.<id>_<jurisdiction>_<name>`.
|
||||
4. The `cloudflare_account_token.token` resource sets `expires_on = null` and **ignores** drift on `expires_on` and `policies` (the upstream permission IDs are unstable). Instead, a `null_resource.cloudflare_account_token_replace` hashes the **sorted permission names** into its triggers, and `replace_triggered_by` forces a fresh token whenever the *names* change — surviving id churn while still rotating on a real permission change.
|
||||
5. Outputs ([`outputs.tf`](../../../../iac/modules/cloudflare_token/outputs.tf)): `token` (sensitive), `token_id`, `token_sha256`, and — when `var.bucket` is set — `r2_credentials` mapping `access_key_id = token.id` and `secret_access_key = sha256(token.value)` for S3-compatible R2 access.
|
||||
|
||||
---
|
||||
|
||||
## Vault layout: mixed KVv1 / KVv2
|
||||
|
||||
This root writes to **both** KV engines, which is easy to trip over.
|
||||
|
||||
| Path | Engine | Written by |
|
||||
| --- | --- | --- |
|
||||
| `kvv1/cloudflare/r2/arcodange-tf` | KVv1 (`vault_kv_secret`) | R2 backend token |
|
||||
| `kvv1/cloudflare/cms/cf_arcodange_cms_token` | KVv1 | cms Cloudflare token |
|
||||
| `kvv1/ovh/cms/app` | KVv1 | OVH OAuth2 client |
|
||||
| `kvv1/gitea/tofu_module_reader` | KVv1 | CI user SSH key |
|
||||
| `kvv2/longhorn/gcs-backup` | KVv2 (`vault_kv_secret_v2`) | Longhorn GCS HMAC |
|
||||
|
||||
> [!WARNING]
|
||||
> Most secrets here use the **KVv1** engine (`vault_kv_secret`), but the Longhorn backup secret uses **KVv2** (`vault_kv_secret_v2`). The policy paths differ accordingly — KVv2 reads target `kvv2/data/longhorn/gcs-backup` (note the `/data/` segment), whereas KVv1 policies read the literal path. Mixing the two engines means a policy copied from one secret to another will silently grant nothing. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the engine-level design.
|
||||
|
||||
---
|
||||
|
||||
## Outputs
|
||||
|
||||
The root exposes a single top-level `output "token"` (sensitive) = the cms Cloudflare token ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)). Everything else is delivered side-effect-style into Gitea secrets and Vault paths rather than as Terraform outputs.
|
||||
|
||||
---
|
||||
|
||||
## See also
|
||||
|
||||
- [CI apply flow](ci-apply-flow.md) — how `iac/**` changes reach `gs://arcodange-tf/factory/main` via the Vault-JWT exchange and auto-approve apply.
|
||||
- [postgres iac](postgres-iac.md) — the sibling root that provisions in-cluster PostgreSQL.
|
||||
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md).
|
||||
116
vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md
Normal file
116
vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md
Normal file
@@ -0,0 +1,116 @@
|
||||
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **postgres iac**
|
||||
|
||||
# postgres iac — the `postgres/iac/` state root
|
||||
|
||||
> [!NOTE]
|
||||
> **Status:** ✅ active · **Last Updated:** 2026-06-23
|
||||
> **Code:** [`postgres/iac/`](../../../../postgres/iac) · **State backend:** `gs://arcodange-tf/factory/postgres` ([`postgres/iac/backend.tf`](../../../../postgres/iac/backend.tf))
|
||||
> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
|
||||
> **Related:** [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [CI apply flow](ci-apply-flow.md) · [factory iac](factory-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
The `postgres/iac/` root provisions **PostgreSQL roles, databases, and the pgbouncer auth function** on the live cluster database — one strand of the per-application `<app>` join key described in [Naming conventions](../../lab-ecosystem/naming-conventions.md). For each application it creates a non-login owner role, an `<app>` database owned by that role, and a `user_lookup()` function that lets PgBouncer authenticate against `pg_shadow`. A single `credentials_editor` login role (whose password is stored in Vault) is granted admin over every per-app role so that downstream tooling can mint application credentials without superuser rights.
|
||||
|
||||
This root's state lives at `gs://arcodange-tf/factory/postgres` and is applied by [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) on any change under `postgres/**` — see [CI apply flow](ci-apply-flow.md).
|
||||
|
||||
> [!CAUTION]
|
||||
> This root runs as a **PostgreSQL superuser** ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf): `superuser = true`) pinned to the live database at **`192.168.1.202`** (pi2) **through PgBouncer**, with `sslmode = disable`. The provider can therefore **drop or alter live application databases** — an errant `terraform destroy` or a renamed `applications` entry will delete real data. And because the only route to Postgres is via PgBouncer on that host, **if PgBouncer is down OpenTofu cannot connect and no apply can run.** Treat every `postgres/**` merge as a production database change ([ADR-0001](../../../ADR/0001-safe-prod-like-environment.md)).
|
||||
|
||||
---
|
||||
|
||||
## Providers
|
||||
|
||||
Declared in [`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf).
|
||||
|
||||
| Provider | Source | Version | Connection | Auth |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `postgresql` | `cyrilgdn/postgresql` | `1.24.0` | host `192.168.1.202` (pi2), via PgBouncer, `sslmode = disable`, `superuser = true` | `var.POSTGRES_USERNAME` / `var.POSTGRES_PASSWORD` (TF vars from `TF_VAR_POSTGRES_*`, sourced from Vault in CI) |
|
||||
| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` |
|
||||
|
||||
The two `POSTGRES_*` variables are declared `sensitive` in the same file; CI populates them from Vault as `TF_VAR_POSTGRES_USERNAME` / `TF_VAR_POSTGRES_PASSWORD` (see [CI apply flow](ci-apply-flow.md)).
|
||||
|
||||
---
|
||||
|
||||
## The application set
|
||||
|
||||
Everything in this root fans out over one variable. `var.applications` is a `set(string)` ([`variables.tf`](../../../../postgres/iac/variables.tf)) whose members are listed in [`terraform.tfvars`](../../../../postgres/iac/terraform.tfvars):
|
||||
|
||||
| `applications` member |
|
||||
| --- |
|
||||
| `webapp` |
|
||||
| `erp` |
|
||||
| `crowdsec` |
|
||||
| `plausible` |
|
||||
| `dance-lessons-coach` |
|
||||
|
||||
Adding an app to that list creates a full role + database + lookup-function bundle on the next apply; **removing** one would `DROP` the live database (see the caution above).
|
||||
|
||||
---
|
||||
|
||||
## The `credentials_editor` role
|
||||
|
||||
Defined in [`postgres/iac/main.tf`](../../../../postgres/iac/main.tf). A single login role, granted admin over every per-app role, whose credentials downstream tooling uses to provision application logins.
|
||||
|
||||
| Resource | Type | Detail |
|
||||
| --- | --- | --- |
|
||||
| `random_password.credentials_editor` | password | length 24, `override_special = "-:!+<>"` |
|
||||
| `postgresql_role.credentials_editor` | role | `login = true`, `create_role = true`; `lifecycle { ignore_changes = [roles] }` so its grant membership isn't reverted |
|
||||
| `vault_kv_secret.postgres_admin_credentials` | Vault **KVv1** secret | `kvv1/postgres/credentials_editor/credentials` — `username` + `password` |
|
||||
|
||||
---
|
||||
|
||||
## Per-application resources
|
||||
|
||||
For each member of `var.applications`, `main.tf` creates the following (all `for_each` over the set):
|
||||
|
||||
| Resource | Type | What it creates |
|
||||
| --- | --- | --- |
|
||||
| `postgresql_role.app_role["<app>"]` | role | non-login role `<app>_role` (`login = false`) — owns the database |
|
||||
| `postgresql_grant_role.credentials_editor_app_role["<app>"]` | grant | `credentials_editor` → `<app>_role` **WITH ADMIN OPTION** |
|
||||
| `postgresql_database.app_db["<app>"]` | database | database `<app>`, owner `<app>_role`, `template = template0`, `alter_object_ownership = true` |
|
||||
| `postgresql_function.pgbouncer_user_lookup["<app>"]` | function | `user_lookup(i_username text)` in db `<app>` — see below |
|
||||
| `postgresql_grant.pgbouncer_user_lookup_public_revoke["<app>"]` | grant | revoke (empty `privileges`) of `user_lookup` from role `public` in schema `public` |
|
||||
| `postgresql_grant.pgbouncer_user_lookup["<app>"]` | grant | `EXECUTE` on `user_lookup` to role `pgbouncer_auth`; `depends_on` the public-revoke (the two grants can't run in parallel) |
|
||||
|
||||
So `webapp` yields role `webapp_role`, database `webapp`, function `webapp.user_lookup`, and the matching grants; likewise for `erp`, `crowdsec`, `plausible`, and `dance-lessons-coach`.
|
||||
|
||||
### The pgbouncer `user_lookup()` function
|
||||
|
||||
`postgresql_function.pgbouncer_user_lookup` defines a `plpgsql` function with **`security_definer = true`** and `parallel = "SAFE"`. It takes `i_username` (IN, text) and returns a record of `uname` + `phash`:
|
||||
|
||||
```sql
|
||||
BEGIN
|
||||
SELECT usename, passwd FROM pg_catalog.pg_shadow
|
||||
WHERE usename = i_username INTO uname, phash;
|
||||
RETURN;
|
||||
END;
|
||||
```
|
||||
|
||||
PgBouncer's `auth_query` calls this to fetch the stored password hash. Because reading `pg_shadow` is privileged, the function is `SECURITY DEFINER` (runs as its owner). Access is locked down in two steps: first **revoke** the default `public` execute grant, then **grant** `EXECUTE` only to the `pgbouncer_auth` role — the `pgbouncer_auth` role itself is expected to already exist on the server (it is not created by this root).
|
||||
|
||||
> [!NOTE]
|
||||
> The two grants are ordered with an explicit `depends_on`: `postgresql_grant.pgbouncer_user_lookup` waits for `postgresql_grant.pgbouncer_user_lookup_public_revoke` because the provider can't apply both grants on the same object concurrently.
|
||||
|
||||
---
|
||||
|
||||
## Vault layout
|
||||
|
||||
This root writes a single KVv1 secret.
|
||||
|
||||
| Path | Engine | Contents |
|
||||
| --- | --- | --- |
|
||||
| `kvv1/postgres/credentials_editor/credentials` | KVv1 (`vault_kv_secret`) | `username`, `password` of the `credentials_editor` login role |
|
||||
|
||||
---
|
||||
|
||||
## No outputs
|
||||
|
||||
There is **no `outputs.tf`** in this root. Nothing is exported as a Terraform output — the `credentials_editor` credentials are delivered into Vault, and the per-app roles/databases/functions are side effects on the live server. Consumers read the credentials from `kvv1/postgres/credentials_editor/credentials`, not from state outputs.
|
||||
|
||||
---
|
||||
|
||||
## See also
|
||||
|
||||
- [Naming conventions](../../lab-ecosystem/naming-conventions.md) — the `<app>` databases here are one strand of the per-application `<app>` join key (alongside namespaces, Vault paths, and repos).
|
||||
- [CI apply flow](ci-apply-flow.md) — how `postgres/**` changes reach `gs://arcodange-tf/factory/postgres` and where `TF_VAR_POSTGRES_*` come from.
|
||||
- [factory iac](factory-iac.md) — the sibling root for everything outside the cluster.
|
||||
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md).
|
||||
Reference in New Issue
Block a user