diff --git a/vibe/guidebooks/README.md b/vibe/guidebooks/README.md index 1c11363..fd2f104 100644 --- a/vibe/guidebooks/README.md +++ b/vibe/guidebooks/README.md @@ -35,6 +35,7 @@ flowchart LR | Guidebook | What it maps | Status | |---|---|---| | [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active | +| [Factory provisioning](factory-provisioning/README.md) | Deep dive into how factory provisions everything: Ansible playbooks + roles and OpenTofu | ✅ Active | ## Rules to contribute diff --git a/vibe/guidebooks/factory-provisioning/README.md b/vibe/guidebooks/factory-provisioning/README.md new file mode 100644 index 0000000..79a6eae --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/README.md @@ -0,0 +1,88 @@ +[vibe](../../README.md) > [Guidebooks](../README.md) > **Factory provisioning** + +# Factory provisioning + +> **Status:** ✅ Active +> **Last Updated:** 2026-06-23 +> **Upstream:** [Lab ecosystem guidebook](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md) +> **Related:** [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-prod-like-environment PRD](../../PRD/safe-prod-like-environment/README.md) + +This guidebook is the deep dive into **how the `factory` repo turns three Raspberry Pis + a handful of cloud accounts into the running lab.** Where the [lab-ecosystem](../lab-ecosystem/README.md) map shows *which* components exist and how they join, this guidebook drills into the two provisioning **engines** that build and maintain them: the Ansible collection that the operator runs from the Mac, and the OpenTofu modules that Gitea CI applies. Every page below describes the engine *as it is wired right now* — playbook imports, role responsibilities, inventory placement, provider versions, state backends, and the CI flow that ties Tofu to Vault. + +## Two engines, two trigger models + +The factory splits provisioning along a hard line: **imperative, operator-driven host/cluster build** (Ansible) versus **declarative, CI-driven forge/cloud/database state** (OpenTofu). They never overlap on the same resource, and they run at different moments. + +| Engine | Trigger | Runs from | Owns | Lives at | +|---|---|---|---|---| +| **Ansible** | One-shot, operator-run on demand | The Mac (control node) | The cluster + base layer + stateful services: k3s, Longhorn, Pi-hole, step-ca, PostgreSQL, Gitea, Vault, CrowdSec — plus the disaster-recovery playbooks | [`ansible/`](../../../ansible/) → [sub-hub](ansible/README.md) | +| **OpenTofu** | CI-applied on Gitea (path-filtered `push`/`pull_request` + `workflow_dispatch`) | Gitea act-runners | Forge/cloud edge state (Cloudflare, OVH, GCP, Gitea, Vault) and **per-app PostgreSQL databases** | [`iac/`](../../../iac/) + [`postgres/`](../../../postgres/) → [sub-hub](opentofu/README.md) | + +> [!NOTE] +> Ansible is **imperative and human-gated** because it touches bare hosts and one-time bootstrap (disk prep, k3s install, Vault init). OpenTofu is **declarative and machine-gated** because its targets are reconcilable API objects (a DNS record, a bucket, a database) whose desired state belongs in version control and converges on every merge. + +## How a green-field lab comes up + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart LR + classDef op fill:#1e3a8a,stroke:#1e40af,color:#fff + classDef eng fill:#059669,stroke:#047857,color:#fff + classDef host fill:#7c3aed,stroke:#6d28d9,color:#fff + classDef store fill:#b45309,stroke:#92400e,color:#fff + + OP["Operator
at the Mac"]:::op -->|"runs playbooks 01→05"| ANS["Ansible collection
arcodange.factory"]:::eng + ANS -->|"OS · k3s · Longhorn · base layer"| PIS["3× Raspberry Pi
pi1 / pi2 / pi3"]:::host + PIS -->|"hosts Gitea + act-runners"| CI["Gitea CI
act-runners"]:::store + CI -->|"path-filtered apply"| TOFU["OpenTofu
iac/ + postgres/iac/"]:::eng + TOFU -->|"forge · cloud · PG state"| EDGE["Cloudflare · OVH · GCP
Gitea · Vault · PostgreSQL"]:::store + TOFU -. "state in GCS gs://arcodange-tf" .- EDGE +``` + +1. The **operator**, working from the **Mac control node**, runs the numbered Ansible playbooks `01_system` → `05_backup` in order. +2. **Ansible** lays the OS, k3s (`v1.34.3+k3s1`), Longhorn, and the base layer (Pi-hole, step-ca, Vault, CrowdSec) plus the stateful out-of-cluster services (PostgreSQL + Gitea) onto the **three Raspberry Pis** (`pi1`/`pi2`/`pi3`). +3. Once `pi2` is hosting **Gitea** and `pi1`/`pi3` are running the **act-runners** (registered by `03_cicd`), the forge can run CI. +4. A push or merge to `factory` that touches `iac/**` or `postgres/**` triggers the corresponding **Gitea CI** workflow on those runners. +5. The CI job authenticates to Vault via Gitea OIDC JWT and runs **OpenTofu**, which reconciles the **forge/cloud/database edge** — Cloudflare, OVH, GCP, Gitea action-secrets, Vault KV/policies, and the per-app PostgreSQL objects. +6. All OpenTofu state is kept in **GCS** under `gs://arcodange-tf` (prefix `factory/main` for the cloud edge, `factory/postgres` for the databases), so each CI run reads and writes the authoritative state remotely. + +## Master index + +| Sub-hub | What it maps | Status | +|---|---|---| +| [Ansible](ansible/README.md) | The `arcodange.factory` collection: numbered playbooks `01`–`06`, the inventory + group_vars, and the reusable roles that build hosts, the cluster, and the stateful services | ✅ Active | +| [OpenTofu](opentofu/README.md) | The CI-applied IaC: the cloud/forge edge (`iac/`), the per-app PostgreSQL provisioning (`postgres/iac/`), and the Gitea-OIDC → Vault apply flow | ✅ Active | + +### All pages + +- **Ansible** + - [System (`01`)](ansible/01-system.md) — OS, DNS, SSL, disks, Docker, iSCSI, k3s, CoreDNS, cert-issuer, Longhorn/Traefik config + - [Setup (`02`)](ansible/02-setup.md) — PostgreSQL + Gitea docker-compose on `pi2` (and the optional backup-NFS share) + - [CI/CD (`03`)](ansible/03-cicd.md) — Gitea act-runner registration on `pi1`/`pi3` and the ArgoCD/Image-Updater install + - [Tools (`04`)](ansible/04-tools.md) — Vault + CrowdSec bootstrap into the cluster + - [Backup (`05`)](ansible/05-backup.md) — scheduled PostgreSQL / Gitea / k3s-PVC backups to `/mnt/backups` + - [Recover (`06`)](ansible/06-recover.md) — the Longhorn disaster-recovery playbooks (`recover/`) + - [Inventory & variables](ansible/inventory.md) — `hosts.yml` groups and the `group_vars` tree + - [Roles reference](ansible/roles.md) — `deploy_docker_compose`, the `gitea_*` family, `traefik_certs`, `playwright`, and the service sub-roles +- **OpenTofu** + - [factory iac](opentofu/factory-iac.md) — `iac/`: Cloudflare/OVH/GCP/Gitea/Vault edge + the `cloudflare_token` module + - [postgres iac](opentofu/postgres-iac.md) — `postgres/iac/`: per-app databases, roles, and the pgbouncer `user_lookup()` function + - [CI apply flow](opentofu/ci-apply-flow.md) — the Gitea workflows, OIDC-JWT → Vault auth, and the GCS state backend + +## Maintenance rule + +> [!IMPORTANT] +> **Alter a documented component → update its page in the same change.** If you change a playbook, a role, an inventory entry, a provider version, a Tofu resource, or the CI flow, the matching page in this guidebook MUST be edited in the same PR. A provisioning map that drifts from the code sends operators (and agents) down dead paths during a rebuild or a recovery — exactly when the map matters most. + +## Why this guidebook earns its keep + +The safe-prod-like-environment work rehearses **exactly these playbooks and Tofu modules** in a throwaway sandbox before they touch the real lab: the sandbox stands up the same `01`–`05` narrative and runs the same `iac/` + `postgres/iac/` apply, so the rehearsal only holds if this guidebook tracks the engines faithfully. See the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) for the decision and the [PRD](../../PRD/safe-prod-like-environment/README.md) (with its [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)) for what the sandbox must reproduce. + +## Cross-references + +- [Lab ecosystem guidebook](../lab-ecosystem/README.md) — the higher-altitude whole-lab map; this guidebook is its provisioning deep dive. +- [01 · factory](../lab-ecosystem/01-factory.md) — the four-pillar summary of the `factory` repo that this guidebook expands. +- [secrets-and-vault.md](../lab-ecosystem/secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and the dynamic PostgreSQL credentials these engines set up. +- [storage-and-recovery.md](../lab-ecosystem/storage-and-recovery.md) — Longhorn + GCS backup + the power-cut recovery the `06 · recover` playbooks serve. +- [naming-conventions.md](../lab-ecosystem/naming-conventions.md) — the `` join key shared by the OpenTofu state prefixes and per-app PostgreSQL objects. +- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [PRD](../../PRD/safe-prod-like-environment/README.md) — the sandbox that rehearses these engines before they touch the real lab. diff --git a/vibe/guidebooks/factory-provisioning/ansible/01-system.md b/vibe/guidebooks/factory-provisioning/ansible/01-system.md new file mode 100644 index 0000000..8e4b03f --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/01-system.md @@ -0,0 +1,94 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **01 · System** + +# 01 · System — base OS, Docker, K3s, Longhorn, DNS, SSL + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) +> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +## What it does + +`01 · System` takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a configured K3s cluster. The wrapper [`playbooks/01_system.yml`](../../../../ansible/arcodange/factory/playbooks/01_system.yml) does nothing but `import_playbook` the stage orchestrator [`playbooks/system/system.yml`](../../../../ansible/arcodange/factory/playbooks/system/system.yml), which in turn imports ten sub-playbooks **in strict order**. Each sub-play layers one capability: hostname/DNS hygiene, Pi-hole HA DNS, the step-ca PKI, the external backup disk, Docker, the iSCSI/dm-crypt prerequisites for Longhorn, K3s itself, CoreDNS forwarding, the cert-manager issuer, and finally the cluster config (Longhorn + Traefik). + +All host-facing plays target `raspberries:&local` — the intersection of the `raspberries` group and the `local` group, which resolves to `pi1`/`pi2`/`pi3` (see [Inventory & variables](inventory.md)). The K3s server/agent split is decided at runtime: the **first host (alphabetically) becomes the server**, the rest become agents. + +## Ordered steps + +| # | Sub-playbook | Purpose | Key vars / versions | +| --- | --- | --- | --- | +| 1 | [`system/rpi.yml`](../../../../ansible/arcodange/factory/playbooks/system/rpi.yml) | Set each node's hostname to its `inventory_hostname`. On Pi-hole nodes (`pi1`/`pi3`) add `dnsmasq` to the `dip` group, then **stop & disable `dnsmasq`** to free port 53 for `pihole-FTL`. | `tags: never` (opt-in only) | +| 2 | [`dns/dns.yml`](../../../../ansible/arcodange/factory/playbooks/dns/dns.yml) → [`dns/pihole.yml`](../../../../ansible/arcodange/factory/playbooks/dns/pihole.yml) | Install & configure **Pi-hole HA DNS** via the `pihole` role. Adds custom records mapping `.arcodange.lab` and `.arcodange.duckdns.org` to `pi1`. | `pihole_custom_dns` → `pi1.preferred_ip` | +| 3 | [`ssl/ssl.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/ssl.yml) → [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml) | Install **step-ca** (the `step_ca` role) on all three Pis; fetch the root CA from `pi1`; build a **Gitea runner image that trusts the CA** (`runner-images:ubuntu-latest-ca`) and push it to the registry. | `step_ca_primary: pi1`, root at `/home/step/.step/certs/root_ca.crt` | +| 4 | [`system/prepare_disks.yml`](../../../../ansible/arcodange/factory/playbooks/system/prepare_disks.yml) | Auto-detect the largest external (non-`mmcblk0`) USB partition, format it **ext4 with label `arcodange_500`**, mount at `/mnt/arcodange`, and persist in `fstab`. Skips format if the label already exists. **`pause` confirm before any format.** | `mount_point: /mnt/arcodange`, `disk_label: arcodange_500` | +| 5 | [`system/system_docker.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_docker.yml) | Install Docker via `geerlingguy.docker`; write `daemon.json` with **json-file logging** (`max-size 10m`, `max-file 5`) and **`data-root: /mnt/arcodange/docker`** (only when the external disk is mounted). | `tags: never`; `storage-driver: overlay2` | +| 6 | [`system/iscsi_longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/system/iscsi_longhorn.yml) | Install `open-iscsi` (+ enable `iscsid`) and `cryptsetup`, and load the **`dm_crypt`** kernel module (persisted in `/etc/modules`) — Longhorn's encrypted-volume prerequisites. Creates `/mnt/arcodange/longhorn`. | module `dm_crypt` | +| 7 | [`system/system_k3s.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_k3s.yml) | Build the K3s inventory dynamically (first sorted host → `server`, rest → `agent`), install the `k3s-ansible` content, run `k3s.orchestration.site`, then **fetch the kubeconfig** to `~/.kube/config` (rewriting `127.0.0.1` → server IP). | **k3s `v1.34.3+k3s1`**; server args `--docker --disable traefik` | +| 8 | [`system/k3s_dns.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_dns.yml) | Create the **`coredns-custom`** ConfigMap so cluster DNS forwards `arcodange.lab:53` to the Pi-hole IPs; also patch the main CoreDNS Corefile to forward to the same HA Pi-holes. | `pihole_ips` (extracted from hostvars) | +| 9 | [`system/k3s_ssl.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_ssl.yml) | Deploy **cert-manager** + **step-issuer** as k3s static HelmCharts; create the `StepClusterIssuer` `step-ca` wired to the JWK provisioner and root CA. | cert-manager `v1.19.2`, step-issuer `1.9.11`, `caUrl: https://ssl-ca.arcodange.lab:8443`, **ARM64 `kube-rbac-proxy` override** | +| 10 | [`system/k3s_config.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_config.yml) | Deploy **Longhorn** + **Traefik** as HelmCharts; issue the wildcard cert, set the default `TLSStore`, wire Gitea, the IP-allow-list middleware, and the CrowdSec bouncer plugin; then **delete the old Traefik** to force a redeploy. | Longhorn `v1.9.1`, Traefik `v37.4.0` (see detail below) | + +## How the stages fit together + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%% +flowchart TD + classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4; + classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + rpi["1 · rpi.yml
hostname + dnsmasq off"]:::host + dns["2 · pihole
HA DNS"]:::host + ssl["3 · step-ca
root CA + CA-trusting runner image"]:::host + disk["4 · prepare_disks.yml
ext4 arcodange_500 -> /mnt/arcodange"]:::danger + docker["5 · system_docker.yml
data-root on external disk"]:::host + iscsi["6 · iscsi_longhorn.yml
open-iscsi + dm_crypt"]:::host + k3s["7 · system_k3s.yml
k3s v1.34.3 (--disable traefik)"]:::cluster + cdns["8 · k3s_dns.yml
coredns-custom -> Pi-hole"]:::cluster + cmgr["9 · k3s_ssl.yml
cert-manager + step-issuer"]:::cluster + cfg["10 · k3s_config.yml
Longhorn + Traefik + redeploy"]:::cluster + + rpi --> dns --> ssl --> disk --> docker --> iscsi --> k3s --> cdns --> cmgr --> cfg +``` + +1. **`rpi.yml`** fixes the hostname and, on Pi-hole nodes, stops `dnsmasq` so `pihole-FTL` can own port 53. +2. **Pi-hole** comes up as the HA DNS authority for `arcodange.lab`. +3. **step-ca** is installed; its root CA is fetched and baked into a Gitea runner image so CI can trust internal TLS. +4. **`prepare_disks.yml`** formats and mounts the external USB disk at `/mnt/arcodange` (with a confirmation pause). +5. **Docker** installs with its data-root pointed at that disk and capped logging. +6. **iSCSI + dm_crypt** prerequisites land so Longhorn can attach (and encrypt) volumes. +7. **K3s** installs with the first host as server, Docker as the container runtime, and Traefik disabled. +8. **CoreDNS** is reconfigured to forward `arcodange.lab` to the Pi-holes. +9. **cert-manager + step-issuer** wire the in-cluster issuer to step-ca. +10. **`k3s_config.yml`** deploys Longhorn and a fully-customized Traefik, then deletes the old Traefik so the helm-controller redeploys with the new config. + +## `k3s_config.yml` — Longhorn & Traefik detail + +| Resource | Value | Notes | +| --- | --- | --- | +| Longhorn HelmChart | `v1.9.1` | `defaultSettings.defaultDataPath: /mnt/arcodange/longhorn` — volumes live on the external disk. | +| Traefik HelmChart | `v37.4.0` | Deployed as a k3s static manifest (`traefik-v3.yaml`) with an inline `traefik-configmap`. | +| Wildcard cert | `wildcard-arcodange-lab` | `Certificate` for `arcodange.lab` + `*.arcodange.lab`, issued by the `step-issuer` `StepClusterIssuer`. | +| `TLSStore` `default` | `defaultCertificate: wildcard-arcodange-lab` | Makes the wildcard cert the cluster-wide default. | +| Gitea exposure | `gitea-external` `ExternalName` Service → `pi2` port 3000 | Gitea runs **outside** K3s as Docker Compose on `pi2`; Traefik routes `gitea.arcodange.lab` to it. | +| `localIp` middleware | `ipAllowList` | Restricts dashboard/Gitea routers to LAN + pod CIDR + the detected public IP. | +| CrowdSec bouncer | plugin `v1.3.3` | Traefik experimental plugin `crowdsec-bouncer-traefik-plugin` (config completed in [04 · Tools](04-tools.md)). | +| DuckDNS token | `traefik-duckdns-token` Secret → `DUCKDNS_TOKEN` | Consumed by the `letsencrypt` ACME DNS-challenge resolver via `envFrom`. | + +## Gotchas + +> [!CAUTION] +> **Step 4 formats a disk — data loss is real.** `prepare_disks.yml` picks the **largest non-system partition** and runs `mkfs.ext4 -F` on it when the `arcodange_500` label is absent. The `run_once` `pause` prompt ("tapez 'oui' pour continuer") is the only guard, and a wrong USB stick plugged into the wrong Pi will be wiped. Confirm `target_device` in the debug output before answering. If a candidate already carries the label, the format is skipped and the disk is only (re)mounted. + +> [!WARNING] +> **K3s ships with `--disable traefik`.** The bundled Traefik is intentionally turned off in step 7 so step 10 can deploy its own fully-customized `v37.4.0`. If you re-enable the bundled Traefik or run `k3s_config.yml` out of order, two Traefiks will fight over the ingress ports. + +> [!WARNING] +> **ARM64 needs the `kube-rbac-proxy` image override.** step-issuer's default `gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0` is AMD64-only and **crash-loops on `pi3` (ARM64)**. `k3s_ssl.yml` overrides it to `quay.io/brancz/kube-rbac-proxy:v0.15.0`. Do not remove this override. + +> [!WARNING] +> **Traefik is force-redeployed.** The last play of `k3s_config.yml` deletes the `traefik` Deployment **and** the `helm-install-traefik` Job so the k3s helm-controller re-runs the install against the new manifest. Expect a brief ingress outage during this window; the play then waits for the new Deployment to come back before finishing. + +> [!NOTE] +> **`tags: never` plays are opt-in.** `rpi.yml` and `system_docker.yml` carry `tags: never`, so they are skipped unless you explicitly pass their tag (e.g. `--tags rpi` / `--tags ...`) or `--tags all`. The K3s/Longhorn/Traefik plays run on a normal invocation. diff --git a/vibe/guidebooks/factory-provisioning/ansible/02-setup.md b/vibe/guidebooks/factory-provisioning/ansible/02-setup.md new file mode 100644 index 0000000..383ef58 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/02-setup.md @@ -0,0 +1,82 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **02 · Setup** + +# 02 · Setup — Postgres, Gitea, NFS backup target + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [01 · System](01-system.md) +> **Downstream:** [03 · CI/CD](03-cicd.md) +> **Related:** [Inventory & variables](inventory.md) · [Roles reference](roles.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) + +## What it does + +`02 · Setup` deploys the **stateful services the rest of the platform leans on**: a PostgreSQL server and a Gitea instance — both running as **Docker Compose stacks on `pi2`, outside K3s** — plus the in-cluster NFS backup target. The wrapper [`playbooks/02_setup.yml`](../../../../ansible/arcodange/factory/playbooks/02_setup.yml) imports [`playbooks/setup/setup.yml`](../../../../ansible/arcodange/factory/playbooks/setup/setup.yml), which pings the Pis, then imports three sub-playbooks: `backup_nfs.yml` (tagged `never`), `postgres.yml`, and `gitea.yml`. + +> [!IMPORTANT] +> **Postgres and Gitea do not run in Kubernetes.** They are Docker Compose stacks on `pi2` (the sole member of the `postgres` group, which `gitea` inherits as a child — see [Inventory & variables](inventory.md)). K3s only references them: Traefik exposes Gitea via an `ExternalName` Service, and the `pg-fix-table-ownership` CronJob reaches Postgres over the LAN. This keeps the two services available even when the cluster is being rebuilt. + +## Ordered steps + +| # | Sub-playbook | Purpose | Key vars / versions | +| --- | --- | --- | --- | +| 1 | [`setup/backup_nfs.yml`](../../../../ansible/arcodange/factory/playbooks/setup/backup_nfs.yml) | Provision the shared backup volume: a **Longhorn RWX PVC `backups-rwx` (50Gi)**, a Longhorn `RecurringJob`, a `busybox` deploy to spawn the share-manager, then mount the resulting NFS share at `/mnt/backups` on every Pi. | `tags: never`; `backup_size: 50Gi`, RecurringJob `thrice-a-month-backup` (`cron 0 5 */2 * *`, retain 2) | +| 2 | [`setup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/setup/postgres.yml) | Deploy the Postgres Compose stack (`deploy_docker_compose` + `deploy_postgresql` role), create the `gitea` DB/user, create the **pgbouncer auth_user + `user_lookup()` functions** in both `postgres` and `gitea` DBs, publish the K8s Secret `postgres-admin-credentials`, and install the **`pg-fix-table-ownership` CronJob**. | **Postgres `16.3-alpine`**; container `postgres`; CronJob daily `0 3 * * *` | +| 3 | [`setup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/setup/gitea.yml) | Deploy the Gitea Compose stack (`deploy_docker_compose` + `deploy_gitea` role), create admin `arcodange`, mint an API token via `gitea_token`, upload the avatar, register the SSH key, create org `arcodange-org`, then **delete the temp token**. | **Gitea `1.25.5`**; base URL `http://pi2:3000` | + +## NFS backup target — how the share is born + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%% +flowchart TD + classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4; + classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + + pvc["RWX PVC backups-rwx (50Gi)
longhorn-system"]:::cluster + rj["RecurringJob thrice-a-month-backup
cron 0 5 */2 *"]:::cluster + dep["busybox Deployment rwx-nfs
mounts the PVC"]:::cluster + sm["Longhorn share-manager
(spawned by the mount)"]:::cluster + svc["Service nfs-backups-rwx
ClusterIP :2049"]:::cluster + mount["mount /mnt/backups on pi1/pi2/pi3
NFS vers=4.1"]:::host + + pvc --> rj + pvc --> dep --> sm --> svc --> mount +``` + +1. A **ReadWriteMany Longhorn PVC** (`backups-rwx`, 50Gi) is created in `longhorn-system`. +2. A **`RecurringJob`** is attached to the volume so Longhorn snapshots/backs it up on the `0 5 */2 * *` schedule. +3. A **`busybox` Deployment (`rwx-nfs`)** mounts the PVC — the act of mounting an RWX volume makes Longhorn spawn an **NFS share-manager** pod. +4. A stable **ClusterIP Service** (`nfs-backups-rwx`, port 2049) is created (or reused) to front the share-manager. +5. Each Pi installs `nfs-common` and **mounts the share at `/mnt/backups`** (`vers=4.1`, `nofail`, `x-systemd.automount`), persisted in `fstab`. + +## Postgres — what gets created + +| Artifact | Where | Purpose | +| --- | --- | --- | +| Compose stack `arcodange_factory` | `pi2` Docker | Runs `postgres:16.3-alpine`, container `postgres`, port `5432`, data under `/home/pi/arcodange/docker_composes/postgres/data`. | +| `gitea` DB + user | inside Postgres | Created by the `deploy_postgresql` role from `applications_databases.gitea` (`gitea_database`). | +| pgbouncer `auth_user` (`pgbouncer_auth`) | `postgres` + `gitea` DBs | Login role used by the [pgbouncer pooler](../../lab-ecosystem/02-tools.md) for SCRAM lookups. | +| `user_lookup(text)` function | `postgres` + `gitea` DBs | `SECURITY DEFINER` function over `pg_shadow`; `EXECUTE` granted only to `pgbouncer_auth`. | +| K8s Secret `postgres-admin-credentials` | `kube-system` | Base64 admin user/password so the in-cluster CronJob can authenticate. | +| CronJob `pg-fix-table-ownership` | `kube-system` | Runs `postgres:16.3` daily at **03:00**; discovers `%_role` roles, derives each DB by stripping `_role`, and re-`ALTER TABLE ... OWNER TO` every public table — repairing ownership after a restore. | + +## Gitea — bootstrap sequence + +1. **Compose deploy** via `deploy_docker_compose`, then the `deploy_gitea` role wires Gitea to the Postgres DB (host/db/user/password pulled from the compose env). +2. **Admin user** `arcodange` (`arcodange@gmail.com`) is created with `--random-password --admin` if absent. +3. **API token** is minted by the `gitea_token` role and used for the next HTTP calls. +4. **Avatar** upload, **SSH public key** registration (idempotent), and **org `arcodange-org`** (full name "Arcodange") creation + avatar. +5. **Cleanup** — a `post_tasks` invocation of `gitea_token` with `gitea_token_delete: true` removes the temporary token. + +## Gotchas + +> [!WARNING] +> **The NFS play is `never`-tagged and order-sensitive.** `backup_nfs.yml` only runs when explicitly tagged, and several of its tasks (`Créer PVC RWX`, `Lancer un Deployment pour déclencher NFS`, `Attendre que le pod rwx-nfs soit Running`) are themselves `tags: never`. The RWX volume must already exist for the busybox deploy to spawn the share-manager; running the mount step before the share-manager is `Running` will hang on the `until` retry loop. + +> [!WARNING] +> **Postgres lives on `pi2` outside K3s.** Treat it as a single-host service: there is no Postgres pod to `kubectl get`. The cluster only sees the `postgres-admin-credentials` Secret and the `pg-fix-table-ownership` CronJob, both of which reach the DB over the LAN at `pi2:5432`. A `pi2` outage takes Postgres (and Gitea) down regardless of cluster health. + +> [!CAUTION] +> **`pg-fix-table-ownership` exists because restores break ownership.** After a Longhorn/data recovery, tables can come back owned by the wrong role and apps lose write access. The daily CronJob silently re-owns every `public` table to the `_role` matching each `%_role` PostgreSQL role. If you add a database whose owning role does **not** follow the `_role` naming convention, this job will not fix it — see [Naming conventions](../../lab-ecosystem/naming-conventions.md). + +> [!NOTE] +> **The admin password is random and printed once.** Gitea's admin is created with `--random-password`; capture it from the play output (or reset it via `docker exec`) — it is not stored in the inventory. The bootstrap API token is deliberately deleted at the end, so re-running the play re-mints a fresh one. diff --git a/vibe/guidebooks/factory-provisioning/ansible/03-cicd.md b/vibe/guidebooks/factory-provisioning/ansible/03-cicd.md new file mode 100644 index 0000000..51b6aa7 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/03-cicd.md @@ -0,0 +1,34 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **03 · CI/CD** + +# 03 · CI/CD — Gitea Actions runners + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [02 · Setup](02-setup.md) +> **Downstream:** [04 · Tools](04-tools.md) +> **Related:** [Lab ecosystem · 01 factory (ArgoCD caveat)](../../lab-ecosystem/01-factory.md) · [Roles reference](roles.md) · [Inventory & variables](inventory.md) + +## What it does + +`03 · CI/CD` registers and deploys the **Gitea Actions runner (`act_runner`)** on every Pi that is *not* the Gitea host, so CI jobs have executors. The whole stage is one playbook, [`playbooks/03_cicd.yml`](../../../../ansible/arcodange/factory/playbooks/03_cicd.yml) — there is no stage subdirectory. + +It targets `raspberries:&local:!gitea`, i.e. the raspberries that are local **minus** the `gitea` group. Since `gitea` resolves to `pi2`, the runner lands on **`pi1` and `pi3`** (see [Inventory & variables](inventory.md)). + +## Steps + +| # | Task / role | Purpose | Key detail | +| --- | --- | --- | --- | +| 1 | role `arcodange.factory.gitea_token` | Mint a `gitea_api_token` for later API use. | Reused across the collection (see [Roles reference](roles.md)). | +| 2 | `gitea actions generate-runner-token` (delegated to the Gitea host) | Fetch a **runner registration token** by `docker exec`-ing into the `gitea` container. | `delegate_to: groups.gitea[0]` | +| 3 | role `arcodange.factory.deploy_docker_compose` | Render the `act_runner` Compose stack with the registration token, instance URL, runner name, and labels. | image `gitea/act_runner:latest`; labels point at `runner-images:ubuntu-latest-ca` | +| 4 | `community.docker.docker_compose_v2` (down→up loop) | Apply the stack: a `loop: [absent, present]` recreates the runner so token/label changes take effect. | cache dirs under `/mnt/arcodange/gitea-runner-*` | + +The runner registers with `GITEA_INSTANCE_URL: http://:3000`, names itself `arcodange_global_runner_`, and advertises the **`ubuntu-latest` / `ubuntu-latest-ca`** labels — both mapped to the CA-trusting image built back in [01 · System](01-system.md). It mounts the Docker socket and the host CA store (`/etc/ssl/certs`, `/usr/local/share/ca-certificates`) so jobs trust internal TLS, and runs with `insecure: true` against the Gitea TLS endpoint. + +## Gotchas + +> [!WARNING] +> **ArgoCD is present in design but not deployed.** The factory pipeline intends `03_cicd` to also bring up ArgoCD (the app-of-apps), but **that step is commented out / not currently deployed in-cluster** — this stage only deploys the Gitea runners. Treat ArgoCD as "designed, not live" until the install is enabled. See the [ArgoCD caveat in lab-ecosystem · 01 factory](../../lab-ecosystem/01-factory.md). + +> [!WARNING] +> **The registration token is single-use and host-delegated.** Step 2 generates a fresh token every run via the Gitea container, so the runner re-registers on each apply. If the Gitea host (`pi2`) is down, token generation fails and no runner can register. diff --git a/vibe/guidebooks/factory-provisioning/ansible/04-tools.md b/vibe/guidebooks/factory-provisioning/ansible/04-tools.md new file mode 100644 index 0000000..aed81b3 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/04-tools.md @@ -0,0 +1,125 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **04 · Tools** + +# 04 · Tools — Vault + CrowdSec + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [Roles reference](roles.md) — deep mechanics of the `hashicorp_vault` and `crowdsec` roles +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [05 · Backup](05-backup.md) · [03 · CI/CD](03-cicd.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Stage 4 installs the **operational tooling layer** on top of a running cluster: HashiCorp **Vault** (the lab's single secret store) and **CrowdSec** (the WAF/IPS that fronts Traefik). The entry point [`playbooks/04_tools.yml`](../../../../ansible/arcodange/factory/playbooks/04_tools.yml) is a one-line wrapper that imports [`playbooks/tools/tools.yml`](../../../../ansible/arcodange/factory/playbooks/tools/tools.yml), which in turn chains two sub-playbooks — `hashicorp_vault.yml` then `crowdsec.yml`. Both run against `localhost` (they drive the cluster through `kubectl` / `kubernetes.core`, not over SSH to the Pis). + +> [!IMPORTANT] +> Vault is the chokepoint of the whole secret model. This page covers **what the playbook orchestrates**; the byte-level role internals (init, unseal, root-token minting, the OpenTofu OIDC backend) live in the [Roles reference](roles.md). Read [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) first for the conceptual model — the two auth backends, the unseal posture, and why there is no secret material in git. + +--- + +## What stage 4 deploys + +| Sub-playbook | File | Builds | Role invoked | +| --- | --- | --- | --- | +| Vault | [`tools/hashicorp_vault.yml`](../../../../ansible/arcodange/factory/playbooks/tools/hashicorp_vault.yml) | Initialises + unseals Vault, wires the Gitea OIDC/JWT auth backends via OpenTofu, publishes the `vault_oauth__sh_b64` Gitea Action secret | `hashicorp_vault` | +| CrowdSec | [`tools/crowdsec.yml`](../../../../ansible/arcodange/factory/playbooks/tools/crowdsec.yml) | A `VaultAuth` + `VaultStaticSecret` for the Turnstile captcha keys, a fresh bouncer API key, and the Traefik `crowdsec` middleware | `crowdsec` | + +--- + +## Step 1 — `hashicorp_vault.yml` + +### The credential prompt + +The play opens with a single `vars_prompt` for the **Gitea admin password** (`gitea_admin_password`, marked `unsafe: true` because the password may contain shell-hostile characters like `{`). This is the only interactive input the stage needs — everything else is derived or minted on the fly. + +### Orchestration flow + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef prompt fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef mint fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef vault fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef revoke fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + P["vars_prompt:
gitea_admin_password"]:::prompt + T["Mint temp GITEA_ADMIN_TOKEN
(role gitea_token, replace=true)"]:::mint + R["Run hashicorp_vault role:
init · unseal · OIDC backend · gitea secret"]:::vault + D["post_tasks:
delete GITEA_ADMIN_TOKEN"]:::revoke + + P --> T --> R --> D +``` + +1. **Mint a temporary token.** The `arcodange.factory.gitea_token` role generates a `GITEA_ADMIN_TOKEN` with scopes `write:admin,write:organization,write:repository,write:user` (and `gitea_token_replace: true`, so any stale token of the same name is rotated). It is stashed in the fact `vault_GITEA_ADMIN_TOKEN`. +2. **Run the `hashicorp_vault` role.** Invoked with three derived vars: the Postgres admin credentials (read straight out of the Postgres host's docker-compose `environment` via `hostvars[groups.postgres[0]]`), the `gitea_admin_token` (= the temp token), and the prompted `gitea_admin_password`. The role does the heavy lifting — see below. +3. **Revoke the temporary token.** A `post_tasks` block re-invokes `gitea_token` with `gitea_token_delete: true`, so the admin token never outlives the run. + +### What the `hashicorp_vault` role does + +The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) runs a fixed sequence; the OIDC backend setup is wrapped in a `block`/`always` so the freshly minted **root token is always revoked**, even on failure: + +| Phase | Task file | What happens | +| --- | --- | --- | +| **Init** | `init.yml` | First-time only. Checks `vault operator init -status`; if uninitialised, runs `vault operator init` with **1 key share / threshold 1** and writes the keys to `~/.arcodange/cluster-keys.json` (mode `600`). Idempotent on re-run. | +| **Unseal** | `unseal.yml` | Reads `cluster-keys.json` and runs `vault operator unseal` on every server pod. Required on **every reboot** — Vault always restarts sealed. | +| **Root token** | `new_root_token.yml` | Mints a one-shot root token via the `generate-root` OTP/nonce dance (using the unseal key), needed to authenticate the OpenTofu apply. | +| **OIDC backend** | `gitea_oidc_auth.yml` | Drives a Playwright script to register/read the Gitea OAuth app, then runs **OpenTofu in a throwaway Docker volume** to provision the `gitea` (OIDC) + `gitea_jwt` (JWT) auth backends, the admin identity, and the `kvv1` static secrets. Finally writes the `vault_oauth__sh_b64` script to Gitea Actions secrets. | +| **Revoke** | `revoke_token.yml` (in `always`) | Revokes the root token unconditionally. | + +> [!IMPORTANT] +> The OpenTofu apply runs the [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside an ephemeral Docker volume (`docker volume create` → `tofu init` + `tofu apply` → `docker volume rm`), with the state in a GCS backend (`gs://arcodange-tf`, prefix `tools/hashicorp_vault/gitea_oidc`). The CA is mounted read-only via `VAULT_CACERT`. The destroy step is commented out by design — this provisions, it does not tear down. + +### The `vault_oauth__sh_b64` Gitea secret + +The last act of the role renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) (an OIDC authorization-code → access-token helper for CI), base64-encodes it, and publishes it as the **org-level** Gitea Action secret `vault_oauth__sh_b64`. Because Gitea Action secrets are scoped per owner, the role then **re-publishes the identical secret to each user-owned namespace** listed in `gitea_secret_propagation_users` — repos under a personal account cannot read org-level secrets. This is what lets a Gitea Actions workflow obtain the OIDC JWT that authenticates to Vault under the `gitea_cicd_` role (the CI half of the [secret model](../../lab-ecosystem/secrets-and-vault.md)). + +> [!CAUTION] +> The role has an **off-by-default** `vault_oidc_force_reset` flag. When set, it runs `vault auth disable gitea` **and** `gitea_jwt` before re-applying — which **wipes every `gitea_cicd_` per-app JWT role** created by the tools-repo IaC. Leave it `false` unless you are deliberately rebuilding the OIDC backend from scratch (e.g. `bound_issuer` config drift). + +--- + +## Step 2 — `crowdsec.yml` + +The CrowdSec sub-playbook is a thin wrapper that runs the `crowdsec` role to bolt a CrowdSec-bouncer middleware onto Traefik. The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) wires three things together. + +| Step | What it creates | Detail | +| --- | --- | --- | +| **Turnstile secret** | `ServiceAccount` + `VaultAuth` + `VaultStaticSecret` in `kube-system` | Authenticates via the Kubernetes auth backend (role `factory_crowdsec_conf`) and pulls the Cloudflare Turnstile keys from `kvv2` path `cms/factory/turnstile` into a K8s Secret (`refreshAfter: 30s`). | +| **Bouncer key** | A CrowdSec LAPI bouncer named `traefik-plugin` | Runs `cscli bouncers add traefik-plugin` inside the LAPI pod; on collision it deletes and re-adds, so the run is repeatable. | +| **Traefik middleware** | A `traefik.io/v1alpha1` `Middleware` named `crowdsec` | Stream mode, captcha provider `turnstile` (site/secret keys from the Turnstile secret), Redis cache, trusted-IP allow-lists. | + +After applying the middleware the role **cleans up `Failed` CrowdSec pods** and **bounces Traefik** (scale to 0 → back to 1, inside a `block`/`rescue`/`always` that guarantees Traefik returns to 1 replica no matter what) so the new middleware config is loaded. + +> [!NOTE] +> The Turnstile keys come from the **CMS-managed** Vault path `cms/factory/turnstile` — they are provisioned outside this stage. CrowdSec only *reads* them here. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how `VaultStaticSecret` materialises a Vault path into a Kubernetes Secret. + +--- + +## Gotchas + +> [!WARNING] +> - **Vault must be unsealed before anything secret-dependent recovers.** Stage 4's unseal step reads `~/.arcodange/cluster-keys.json`; if that file is missing, init/unseal cannot proceed and the OpenTofu apply (which needs a live Vault) fails. The same file gates step 2 of the [power-cut recovery order](../../lab-ecosystem/storage-and-recovery.md). +> - **Docker is required on the control node.** The OIDC backend provisioning shells out to `docker run … opentofu` and `docker volume`. The Playwright step also runs containerised. A control node without Docker will fail this stage. +> - **`gitea_admin_password` is `unsafe`.** Do not strip the `unsafe: true` flag from the prompt — passwords with `{`/`}` are mangled by Jinja templating otherwise. +> - **Re-running is safe by default.** Init and unseal are idempotent; the temp admin token and root token are both revoked on the way out. Only `vault_oidc_force_reset` makes a re-run destructive. +> - **CrowdSec bounces Traefik.** The middleware step briefly scales Traefik to 0 — expect a short ingress blip during stage 4. The `always` block restores it to 1 even if the scale-down errors. + +--- + +## Where stage 4 sits + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef next fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + + s03["03 · CI/CD"]:::done + s04["04 · Tools
Vault · CrowdSec"]:::here + s05["05 · Backup"]:::next + + s03 --> s04 --> s05 +``` + +1. **03 · CI/CD** registered the `act_runner` executors — a prerequisite, since the `vault_oauth__sh_b64` secret published here is consumed by those CI runners. +2. **04 · Tools** (this page) stands up Vault and CrowdSec. +3. **05 · Backup** is next — it schedules the cron dumps that protect the state the cluster now holds. diff --git a/vibe/guidebooks/factory-provisioning/ansible/05-backup.md b/vibe/guidebooks/factory-provisioning/ansible/05-backup.md new file mode 100644 index 0000000..6bf7df0 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/05-backup.md @@ -0,0 +1,107 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **05 · Backup** + +# 05 · Backup — daily cron dumps + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [06 · Recover](06-recover.md) — how these dumps are replayed +> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [04 · Tools](04-tools.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Stage 5 installs three independent **cron-driven backup jobs** that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point [`playbooks/05_backup.yml`](../../../../ansible/arcodange/factory/playbooks/05_backup.yml) imports [`playbooks/backup/backup.yml`](../../../../ansible/arcodange/factory/playbooks/backup/backup.yml), which chains the three sub-playbooks, each passing `backup_root_dir: /mnt/backups`. + +Every job follows the **same anatomy**: run a daily cron at **04:00**, write a date-stamped archive to `/mnt/backups//`, prune anything older than **3 days**, and drop a matching `restore.sh` next to the backup script. `/mnt/backups` is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps. + +> [!NOTE] +> All three sub-playbooks **install** scripts and cron entries; they do not run a backup themselves (beyond a one-shot `test backup_cmd` smoke check that pipes to `/dev/null`). The actual backups fire from cron. To read failures, SSH to the host and use `sudo su` → `mails` (see [`backup/README.md`](../../../../ansible/arcodange/factory/playbooks/backup/README.md)). + +--- + +## The three jobs + +| Job | Sub-playbook | Host | Backup command | Artifact | Scripts dir | +| --- | --- | --- | --- | --- | --- | +| **Postgres** | [`backup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/backup/postgres.yml) | `postgres` | `docker exec pg_dumpall -U ` ∣ `gzip` | `backup_YYYYMMDD.sql.gz` | `…/docker_composes/postgres/scripts` | +| **Gitea** | [`backup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/backup/gitea.yml) | `gitea` | `docker exec -u git gitea dump --skip-log --skip-db --skip-package-data --type tar.gz` | `backup_YYYYMMDD.gitea.gz` | `…/docker_composes/gitea/scripts` | +| **K3s PVC** | [`backup/k3s_pvc.yml`](../../../../ansible/arcodange/factory/playbooks/backup/k3s_pvc.yml) | `pi1` | `kubectl get pv,pvc` + `volumes.longhorn.io` + `settings.longhorn.io` (YAML) | `backup_YYYYMMDD.volumes` | `/opt/k3s_volumes` | + +All three share: `keep_days: 3`, cron `minute: 0 hour: 4 user: root`, and `backup_dir: /mnt/backups/`. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4; + + C["cron · daily 04:00 · user root"]:::cron + PG["postgres.yml
pg_dumpall ∣ gzip"]:::job + GT["gitea.yml
gitea dump tar.gz"]:::job + PV["k3s_pvc.yml
PV · PVC · Longhorn CRDs"]:::job + D["/mnt/backups/{postgres,gitea,k3s_pvc}/
keep 3 days"]:::store + L["Longhorn:
snapshot · replicate · off-site"]:::ship + + C --> PG --> D + C --> GT --> D + C --> PV --> D + D --> L +``` + +1. A single daily **04:00 root cron** triggers each job's `backup.sh`. +2. **postgres.yml** runs `pg_dumpall` through `gzip`, **gitea.yml** streams a `gitea dump` tarball, **k3s_pvc.yml** serialises the volume metadata. +3. Each writes a date-stamped archive into `/mnt/backups//` and prunes files older than 3 days (`find … -mtime +3 -delete`). +4. Because `/mnt/backups` is a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron. + +--- + +## Job details + +### Postgres — `postgres.yml` + +The backup command is built from the Postgres host's docker-compose facts (`container_name`, `POSTGRES_USER`). `pg_dumpall` captures **all databases plus globals (roles)** in one logical dump, gzipped. The generated `restore.sh` takes an optional `YYYYMMDD` argument (defaults to the latest dump), `docker cp`s it into the container, gunzips, and replays with `psql -f`. If the restore misbehaves, the script reminds you to wipe the data dir before replaying. + +### Gitea — `gitea.yml` + +The dump runs as the `git` user with `--skip-db` (Postgres is backed up separately by the Postgres job) and `--skip-package-data`, streamed to stdout (`-f -`) so it never lands on the container's own disk. The `restore.sh` unpacks the tarball back into `/data/gitea` (config/data) and `/data/git/repositories` (repos), fixes `git:git` ownership, and **regenerates hooks** (`gitea admin regenerate hooks`) — without that step the restored repos have stale hook paths. + +### K3s PVC — `k3s_pvc.yml` + +This job does **not** back up volume *data* (Longhorn handles the bytes). It backs up the **Kubernetes objects** needed to re-bind those volumes: all `pv` + `pvc`, the **`volumes.longhorn.io` CRDs**, and `settings.longhorn.io`, concatenated into one `.volumes` YAML (`---`-separated). It writes the dump to both `/mnt/backups/k3s_pvc/` *and* a copy alongside the script. The `restore.sh` prefers a fallback dir (`/home/pi/arcodange/backups/k3s_pvc`) then the primary, picks the latest (or a dated) dump, and `kubectl apply`s it. + +> [!IMPORTANT] +> **Backing up the Longhorn `volumes.longhorn.io` CRDs is what enables *fast* recovery.** With the Volume CRDs in the backup, recovery is a single `kubectl apply` that re-associates the surviving on-disk replicas with their PVs (see [06 · Recover → `longhorn.yml`](06-recover.md)). **Without** the Volume CRDs, a Longhorn reinstall assigns **new engine IDs**, cannot adopt the orphaned replica directories, and you fall through to the slow **block-device data recovery** (`longhorn_data.yml`). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). This is the prevention half of the [storage failure mode](../../lab-ecosystem/storage-and-recovery.md). + +--- + +## Gotchas + +> [!WARNING] +> - **3-day retention is tight.** A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local `/mnt/backups` files are short-lived. +> - **The smoke test runs the real dump.** Each play has a `test backup_cmd` task that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design. +> - **Cron runs as `root`, scripts live in app dirs.** The `backup.sh`/`restore.sh` are written into the app's docker-compose `scripts/` dir (or `/opt/k3s_volumes`); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5. +> - **Gitea restore needs the hook regeneration.** Skipping `gitea admin regenerate hooks` leaves repos with broken push hooks — the `restore.sh` already does it, so use the script rather than a manual untar. +> - **Postgres and Gitea DB are backed up by *different* jobs.** Gitea dumps with `--skip-db`; its database rows come from the Postgres `pg_dumpall`. Restoring Gitea fully means restoring **both** archives. + +--- + +## Where stage 5 sits + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + s04["04 · Tools"]:::done + s05["05 · Backup
Postgres · Gitea · K3s PVC"]:::here + rec["recover/*
(on disaster)"]:::rec + + s04 --> s05 + s05 -. "feeds restore" .-> rec +``` + +1. **04 · Tools** stood up Vault and CrowdSec — the secret store stage 5's dumps help protect. +2. **05 · Backup** (this page) is the last linear stage: it schedules the daily dumps. +3. The artifacts here are the **input** to the on-demand [06 · Recover](06-recover.md) branch — the `.volumes` dump in particular gates whether recovery is fast (CRDs present) or slow (block-device). diff --git a/vibe/guidebooks/factory-provisioning/ansible/06-recover.md b/vibe/guidebooks/factory-provisioning/ansible/06-recover.md new file mode 100644 index 0000000..a57b45d --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/06-recover.md @@ -0,0 +1,149 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover** + +# 06 · Recover — Longhorn disaster recovery + +> [!NOTE] +> **Status:** 🟡 beta · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md) +> **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume +> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) + +The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?** + +> [!IMPORTANT] +> **Decision — pick the right playbook before you start:** +> - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted. +> - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4; + classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + classDef dead fill:#6b7280,stroke:#4b5563,color:#fff; + + Q{"Do the Longhorn
Volume CRDs
still exist?"}:::q + F["longhorn.yml
CSI/CRD recovery (fast)"]:::fast + S{"Raw replica
.img files
survive?"}:::q + D["longhorn_data.yml
block-device recovery (slow)"]:::slow + X["Data unrecoverable
(replicas zeroed)"]:::dead + + Q -- "yes" --> F + Q -- "no" --> S + S -- "yes" --> D + S -- "no" --> X +``` + +1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast. +2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk. +3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume. +4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this. + +> [!NOTE] +> This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo. + +--- + +## `longhorn.yml` — CSI/CRD recovery (CRDs present) + +Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`. + +| Phase | What it does | +| --- | --- | +| **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. | +| **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. | +| **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. | +| **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. | +| **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. | +| **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. | +| **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. | + +> [!IMPORTANT] +> Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe. + +--- + +## `longhorn_data.yml` — block-device data recovery (CRDs gone) + +This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file** — `vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml). + +```sh +ansible-playbook -i inventory/hosts.yml \ + playbooks/recover/longhorn_data.yml \ + -e @vars/recovery_volumes.yml +``` + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb; + classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff; + classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4; + + P0["Pre-flight + Phase 0:
auto-discover largest replica dir (>16K)"]:::pre + P1["Phase 1: back up untouched replica dir
(safe copy before any op)"]:::merge + P2["Phase 2: merge-longhorn-layers.py
→ single .img · test-mount RO"]:::merge + P3["Phase 3: create Volume CRD
(scale down workload, clear stuck PVCs)"]:::k8s + P5["Phase 5: attach via maintenance ticket
→ /dev/longhorn/<pv>"]:::k8s + P6["Phase 6: mkfs + rsync merged image
into live block device"]:::merge + P8["Phase 8: recreate PV (Retain) + PVC
pinned by volumeName"]:::k8s + P9["Phase 9: scale workload up · verify"]:::done + + P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9 +``` + +1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this. +2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`. +3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound. +4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise). +5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/` appears on the source node, with the frontend enabled. +6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions). +7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound. +8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod. + +> [!CAUTION] +> The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). + +> [!NOTE] +> **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml). + +--- + +## Gotchas + +> [!WARNING] +> - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed. +> - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that. +> - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica. +> - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`. +> - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves. + +--- + +## Why this is rehearsed + +A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate. + +--- + +## Where this branch sits + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + s05["05 · Backup
(produces .volumes dump)"]:::done + rec["recover/*
longhorn.yml · longhorn_data.yml"]:::here + s01["01 · System
(rejoin pipeline)"]:::done + + s05 -. "on disaster" .-> rec + rec -. "once recovered" .-> s01 +``` + +1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays. +2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone). +3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green. diff --git a/vibe/guidebooks/factory-provisioning/ansible/README.md b/vibe/guidebooks/factory-provisioning/ansible/README.md new file mode 100644 index 0000000..f5dd516 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/README.md @@ -0,0 +1,120 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **Ansible** + +# Ansible — factory provisioning + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [01 · System](01-system.md) · [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) · [04 · Tools](04-tools.md) · [05 · Backup](05-backup.md) · [06 · Recover](06-recover.md) · [Inventory & variables](inventory.md) · [Roles reference](roles.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Ansible is the **imperative half** of the factory: it takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a running K3s cluster with Docker, Longhorn storage, Gitea CI runners, CrowdSec, and Vault. OpenTofu (the declarative half) then provisions everything that lives *outside* the cluster — see the [OpenTofu sub-hub](../opentofu/README.md). + +--- + +## Collection layout + +Everything ships as a single Ansible **collection** committed under [`ansible/arcodange/factory/`](../../../../ansible/arcodange/factory). The collection root, not the repo root, is what `ansible-galaxy collection install` and the FQCN references (`arcodange.factory.`) resolve against. + +| File | Path | What it declares | +| --- | --- | --- | +| `galaxy.yml` | [`ansible/arcodange/factory/galaxy.yml`](../../../../ansible/arcodange/factory/galaxy.yml) | Collection identity: **namespace `arcodange`**, **name `factory`**, **version `1.0.0`**. Together they form the FQCN prefix `arcodange.factory.*` used by every role and playbook import. | +| `requirements.yml` | [`ansible/requirements.yml`](../../../../ansible/requirements.yml) | External dependencies pulled at install time (see table below). | +| `ansible.cfg` | [`ansible/arcodange/factory/ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg) | `collections_path = ~/.ansible/collections` and `scp_if_ssh = True` for the SSH connection plugin. | +| `inventory/` | [`ansible/arcodange/factory/inventory/`](../../../../ansible/arcodange/factory/inventory) | `hosts.yml` + `group_vars/`. Detailed in [Inventory & variables](inventory.md). | +| `playbooks/` | [`ansible/arcodange/factory/playbooks/`](../../../../ansible/arcodange/factory/playbooks) | The numbered pipeline `01..05` plus the `recover/` branch. | +| `roles/` | [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) | Seven reusable roles. Detailed in [Roles reference](roles.md). | + +### External dependencies (`requirements.yml`) + +| Dependency | Type | Why it is needed | +| --- | --- | --- | +| `geerlingguy.docker` | role | Installs and configures the Docker engine on each Pi. | +| `ansible.posix` | collection | POSIX primitives (mounts, sysctl, `synchronize`). | +| `community.crypto` | collection | Certificate/key generation for the step-ca PKI and Traefik. | +| `community.docker` | collection | Manages containers and Compose stacks (Gitea, act_runner). | +| `community.general` | collection | Broad utility modules used across the pipeline. | +| `kubernetes.core` | collection | `k8s` / `helm` modules used by every K3s-facing task. Needs the `kubernetes` Python lib at runtime. | +| `k3s-ansible` (`git+https://github.com/k3s-io/k3s-ansible.git`) | git role/collection | Upstream playbooks that install and cluster K3s itself. | + +> [!TIP] +> The runtime Python libraries (`kubernetes`, `jmespath`, `dnspython`) that `kubernetes.core` and friends import are declared in the **repo-root `pyproject.toml`**, not in `requirements.yml`. `uv sync` installs them; `ansible-galaxy` installs the Galaxy/git content. Both steps are required. + +--- + +## Invocation pattern + +The control node runs Ansible from a `uv`-managed venv. The `localhost` inventory entry sets `ansible_python_interpreter: "{{ ansible_playbook_python }}"`, so `uv run` is enough to put Ansible on the venv's Python — no hardcoded interpreter path. Full recipe lives in [`ansible/README.md`](../../../../ansible/README.md). + +1. **Sync the venv** — installs `ansible-core` plus the runtime Python deps: + ```sh + uv sync + ``` +2. **Install collection dependencies** — pulls the Galaxy + git content from `requirements.yml`: + ```sh + uv run ansible-galaxy collection install -r ansible/requirements.yml + ``` +3. **Run a stage** — point `-i` at the inventory directory and pass one numbered playbook: + ```sh + uv run ansible-playbook \ + -i ansible/arcodange/factory/inventory \ + ansible/arcodange/factory/playbooks/.yml + ``` + +### The vault password (`ANSIBLE_VAULT_PASSWORD_FILE`) + +Encrypted vars are decrypted with a password that is **sourced from the cluster, not stored on disk**. `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable script that reads the K8s secret `arcodange-ansible-vault` from the `kube-system` namespace: + +```sh +kubectl get secret -n kube-system arcodange-ansible-vault \ + --template='{{index .data.pass | base64decode}}' +``` + +> [!IMPORTANT] +> The same `arcodange-ansible-vault` secret in `kube-system` is consumed by the Gitea CI runners (needed for the Gitea mailer). Create it once with `kubectl create secret generic arcodange-ansible-vault --from-literal="pass=" -n kube-system`. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how this fits the broader secret model. + +--- + +## The provisioning pipeline + +The numbered playbooks are meant to be run **in order** on a fresh cluster — each is a thin wrapper that `import_playbook`s a stage directory (e.g. `01_system.yml` → `system/system.yml`). The `recover/` playbooks are **not** part of the linear sequence; they are an on-demand branch used only during disaster recovery. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart LR + classDef stage fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef recover fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + s01["01 · System
Docker · K3s · Longhorn · DNS · SSL"]:::stage + s02["02 · Setup
Gitea · Postgres · NFS backup"]:::stage + s03["03 · CI/CD
act_runner registration"]:::stage + s04["04 · Tools
CrowdSec · Vault"]:::stage + s05["05 · Backup
cron reports · PVC/db dumps"]:::stage + rec["recover/*
Longhorn + data restore"]:::recover + + s01 --> s02 --> s03 --> s04 --> s05 + s05 -. "on disaster" .-> rec + rec -. "rejoin pipeline" .-> s01 +``` + +1. **`01 · System`** — base OS hardening on each Pi, then Docker, Longhorn disk prep + iSCSI, K3s install, CoreDNS, the step-ca cert issuer, and final K3s config (kubeconfig, Longhorn, Traefik). +2. **`02 · Setup`** — deploys the cluster-resident services: Gitea, PostgreSQL (on `pi2`), and the NFS backup target. +3. **`03 · CI/CD`** — fetches a Gitea runner-registration token and rolls out the `act_runner` Docker Compose stack on every non-Gitea Pi so CI jobs have executors. +4. **`04 · Tools`** — installs the operational tooling layer: CrowdSec (WAF/IPS) and HashiCorp Vault. +5. **`05 · Backup`** — schedules the cron-driven backup + email-report jobs and the Gitea / Postgres / K3s-PVC dump routines. +6. **`recover/*` (on demand)** — invoked only after data loss to rebuild Longhorn and replay volume data; once recovered, the cluster re-enters the normal pipeline at `01 · System`. + +--- + +## Index + +| # | Page | Covers | State | +| --- | --- | --- | --- | +| 01 | [System](01-system.md) | RPi hardening, Docker, K3s, Longhorn/iSCSI, CoreDNS, step-ca SSL | ✅ | +| 02 | [Setup](02-setup.md) | Gitea, PostgreSQL, NFS backup target | ✅ | +| 03 | [CI/CD](03-cicd.md) | Gitea `act_runner` registration & Compose deploy | ✅ | +| 04 | [Tools](04-tools.md) | CrowdSec, HashiCorp Vault | ✅ | +| 05 | [Backup](05-backup.md) | Cron report jobs, Gitea/Postgres/PVC dumps | ✅ | +| 06 | [Recover](06-recover.md) | Longhorn + data restore (on-demand DR branch) | 🟡 | +| — | [Inventory & variables](inventory.md) | `hosts.yml` groups, `group_vars/` layering, host→service mapping | ✅ | +| — | [Roles reference](roles.md) | The seven `arcodange.factory.*` roles | ✅ | diff --git a/vibe/guidebooks/factory-provisioning/ansible/inventory.md b/vibe/guidebooks/factory-provisioning/ansible/inventory.md new file mode 100644 index 0000000..d997498 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/inventory.md @@ -0,0 +1,111 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Inventory & variables** + +# Inventory & variables + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [Roles reference](roles.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [PRD · isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md) + +The inventory is the single source of truth for **which machines exist** and **which service each machine runs**. It is a directory inventory — [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml) plus a layered [`group_vars/`](../../../../ansible/arcodange/factory/inventory/group_vars) tree — passed to every playbook with `-i ansible/arcodange/factory/inventory`. + +> [!IMPORTANT] +> This inventory describes **live production**. The three IPs `192.168.1.201-203` are the real Pis that run the public CMS, the Dolibarr ERP, and business email. A playbook pointed at this inventory mutates prod. The safe-environment work treats this file as the prod blast-radius and requires a **separate sandbox inventory + a prod-IP guard** before any sandbox apply — see the [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md) and the first row of the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md). + +--- + +## Hosts + +Defined in [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml). Three physical Pis are each reachable two ways — over the LAN (the canonical path) and through an internet port-forward managed at the firewall — plus the control node as `localhost`. + +| Host | `ansible_host` | `preferred_ip` | Port | Reach | +| --- | --- | --- | --- | --- | +| `pi1` | `pi1.home` | `192.168.1.201` | 22 | LAN | +| `pi2` | `pi2.home` | `192.168.1.202` | 22 | LAN | +| `pi3` | `pi3.home` | `192.168.1.203` | 22 | LAN | +| `internetPi1` | `rg-evry.changeip.co` | — | `51022` | WAN port-forward → `pi1` | +| `internetPi2` | `rg-evry.changeip.co` | — | `52022` | WAN port-forward → `pi2` | +| `internetPi3` | `rg-evry.changeip.co` | — | `53022` | WAN port-forward → `pi3` | +| `localhost` | (local connection) | — | — | control node | + +> [!NOTE] +> The `internetPiN` entries share one DNS name (`rg-evry.changeip.co`) and differ only by SSH port (`5N022`). The hosts file documents the choice of `changeip.co` over `arcodange.duckdns.org`: changeip is **managed directly with the firewall** rather than depending on a DuckDNS registry update, so the forward is stable. `preferred_ip` is a custom hostvar (not a connection variable) — roles read it to build DNS records, the Gitea SSH domain, and the Pi-hole local-DNS table. + +--- + +## Groups + +Groups map machines to roles. The membership is small and deliberate; read the table as "this service runs on these hosts". + +| Group | Members | Defined as | What it is for | +| --- | --- | --- | --- | +| `raspberries` | `pi1`, `pi2`, `pi3` + `internetPi1-3` | explicit hosts | Every Pi, LAN and WAN handles. Carries the shared `ansible_user: pi`. | +| `local` | `localhost`, `pi1`, `pi2`, `pi3` | explicit hosts | The control-node-facing group; `localhost` runs `kubectl`/`tofu`/`docker` tasks that talk to the cluster. | +| `postgres` | `pi2` | explicit host | The single PostgreSQL node. `pi2` is the database host. | +| `gitea` | `pi2` (via `children: postgres`) | child of `postgres` | Gitea co-locates with its database, so the group simply inherits `postgres`. `groups.gitea[0]` resolves to `pi2` everywhere. | +| `pihole` | `pi1`, `pi3` | explicit hosts | The HA DNS pair (Pi-hole + Gravity Sync). | +| `step_ca` | `pi1`, `pi2`, `pi3` | explicit hosts | Every Pi runs a step-ca node (primary `pi1`, standbys `pi2`/`pi3`). | +| `all` | everything (`children: raspberries`) | implicit + child | Ansible's universal group; `group_vars/all/` applies to all hosts. | + +> [!TIP] +> Because `gitea` is a **child of `postgres`** and `postgres` has exactly one host, every reference to `groups.gitea[0]` (the Gitea container, the API base URL `http://{{ groups.gitea[0] }}:3000`, the SSH domain) points at `pi2`. Move Postgres and Gitea follows automatically. + +--- + +## Connection variables + +| Variable | Where set | Value / effect | +| --- | --- | --- | +| `ansible_user` | `raspberries.vars` | `pi` — the SSH login on every Pi. | +| `ansible_ssh_extra_args` | per-host (`pi1`/`pi2`/`pi3`) | `-o StrictHostKeyChecking=no` — Pis get reimaged, so host-key churn is expected; the check is disabled rather than forcing `known_hosts` edits. | +| `ansible_port` | `internetPiN` | `51022` / `52022` / `53022` — the firewall's per-Pi SSH forwards. | +| `ansible_connection` | `localhost` | `local` — run on the control node, no SSH. | +| `ansible_python_interpreter` | `localhost` | `"{{ ansible_playbook_python }}"` — uses the `uv`-managed venv's Python, no hardcoded path. | + +The control-node tooling chain (`scp_if_ssh = True`) is set in [`ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg); the `collections_path` lives there too. + +--- + +## `group_vars/` layering + +Variables are split by group so each service owns its own file. The path `group_vars//.yml` is auto-loaded for every host in ``. + +| File | Scope | Declares | +| --- | --- | --- | +| [`all/common.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/common.yml) | all hosts | `user_home` — the control user's `$HOME`, looked up from the environment. | +| [`all/ssh.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/ssh.yml) | all hosts | SSH-public-key discovery: `first_found` over `id_ed25519_arcodange.pub` → `id_ed25519.pub` → `id_rsa.pub`, then splits the file into `ssh_public_key`, `ssh_key_title`, `ssh_key_algorithm`. Roles push this key to authorized hosts. | +| [`all/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/gitea.yml) | all hosts | `gitea_secret_propagation_users: [arcodange]` — user namespaces that must also receive org-level Gitea Action secrets (see the [`gitea_secret`](roles.md) role). | +| [`gitea/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea.yml) | `gitea` | `gitea_version: 1.25.5`, the `gitea_database` triple, and the full Gitea Docker Compose: Postgres backend (`postgres:5432`), the `smtps`/orange.fr mailer, SSH on `2222:22`, `ROOT_URL https://gitea.arcodange.lab/`, registration disabled. SSH domain is built from `hostvars[groups.gitea[0]].preferred_ip`. | +| [`gitea/gitea_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea_vault.yml) | `gitea` | **VAULTED.** The `gitea_vault.*` map — `GITEA__mailer__PASSWD` (consumed by the compose above) plus the `github_api_token` / `gitlab_api_token` read by the mirror roles. | +| [`postgres/postgres.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/postgres/postgres.yml) | `postgres` | The Postgres Docker Compose — `postgres:16.3-alpine`, `5432:5432`, data under `/home/pi/arcodange/docker_composes/postgres/data` — plus the `pgbouncer` auth-user block. | +| [`step_ca/step_ca.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca.yml) | `step_ca` | `step_ca_primary: pi1`, `step_ca_fqdn: ssl-ca.arcodange.lab`, the `step` user/home/dir, and `step_ca_listen_address: ":8443"`. | +| [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml) | `step_ca` | **VAULTED.** `vault_step_ca_password` (the CA root password) and `vault_step_ca_jwk_password` (the cert-manager JWK provisioner password). | + +> [!NOTE] +> Encrypted files are conventionally suffixed `_vault.yml`. They are normal `group_vars` files whose **contents** are `ansible-vault`-encrypted; non-vault siblings hold the plaintext structure that references the vaulted keys (e.g. `gitea/gitea.yml` interpolates `gitea_vault.GITEA__mailer__PASSWD`). + +--- + +## The vault model + +Two distinct mechanisms share the word "vault" here — keep them apart: + +1. **`ansible-vault`** encrypts the `*_vault.yml` files at rest in git (AES256). Decryption happens transparently at playbook runtime. +2. **The vault password itself is never on disk.** `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable that fetches the password from the K8s secret `arcodange-ansible-vault` in the `kube-system` namespace: + +```sh +kubectl get secret -n kube-system arcodange-ansible-vault \ + --template='{{index .data.pass | base64decode}}' +``` + +So decrypting any `*_vault.yml` requires `kubectl` access to the live cluster — the cluster *is* the key custodian. The setup recipe (and the `kubectl create secret` to seed it) lives in [`ansible/README.md`](../../../../ansible/README.md); how this fits the broader secret hierarchy is in [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md). + +> [!CAUTION] +> This is **not** HashiCorp Vault. HashiCorp Vault (`vault.arcodange.lab`) is a separate, cluster-resident service installed by the [`hashicorp_vault`](roles.md) role in the `04 · Tools` stage. The `arcodange-ansible-vault` K8s secret only holds the `ansible-vault` password and is also read by the Gitea CI runners for the mailer. + +--- + +## Why this page matters for safe-prod + +The variables above bind Ansible directly to live infrastructure: the host IPs, the prod Vault address, the prod Postgres superuser, and the prod Gitea forge. The safe-environment design maps each of these to a sandbox control — a parallel `inventory/sandbox/hosts.yml` with VM/cloud hosts, a pre-task guard that aborts on any `192.168.1.201-203` target unless `i_mean_prod=true`, and per-service overrides — detailed in the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md). Until that lands, **assume every run is a prod run**. diff --git a/vibe/guidebooks/factory-provisioning/ansible/roles.md b/vibe/guidebooks/factory-provisioning/ansible/roles.md new file mode 100644 index 0000000..b334e83 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/ansible/roles.md @@ -0,0 +1,186 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Roles reference** + +# Roles reference + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [Inventory & variables](inventory.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +Roles live in two places, by reuse scope: + +- **Shared roles** — reusable across stages — live in [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) and are referenced by FQCN `arcodange.factory.`. +- **Nested roles** — owned by one playbook stage — live under [`playbooks//roles/`](../../../../ansible/arcodange/factory/playbooks) and are auto-discovered by that stage's playbook. + +This page is split by **altitude**. Tier 1 covers the heavyweight platform-service roles (one subsection each); Tier 2 is a single table of the smaller building-block roles. + +--- + +## Tier 1 — platform-service roles + +### `hashicorp_vault` + +[`playbooks/tools/roles/hashicorp_vault`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault) · runs on `localhost` in the `04 · Tools` stage. It initializes and unseals the cluster Vault and wires Gitea as an OIDC provider so CI jobs can authenticate to Vault. + +The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) flow is: + +1. **Init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/init.yml)) — first run only. Lists the Vault server pods in the `tools` namespace, checks `vault operator init -status`, and if uninitialized runs `vault operator init` with **`key-shares=1`, `key-threshold=1`** (defaults from [`defaults/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/defaults/main.yml)). The JSON output — unseal keys + initial root token — is written to `~/.arcodange/cluster-keys.json` (dir `0700`, file `0600`). +2. **Unseal** ([`unseal.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/unseal.yml)) — required after every reboot. Reads the keys file and runs `vault operator unseal` for each server, then revokes the *initial* root token (idempotent — tolerates an already-revoked token). +3. **Generate a fresh root token** ([`new_root_token.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/new_root_token.yml)) — runs the `generate-root` OTP/nonce dance using the unseal keys to mint a short-lived `vault_root_token`. +4. **Set up Gitea OIDC** ([`gitea_oidc_auth.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/gitea_oidc_auth.yml)) — drives Gitea through the bundled [`playwright_setupGiteaApp.js`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/playwright_setupGiteaApp.js) (via the [`playwright`](#tier-2--building-block-roles) role) to create an OAuth2 app, then applies the bundled OpenTofu [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside a disposable `ghcr.io/opentofu/opentofu` container (state on a throwaway docker volume) to provision the Vault JWT/OIDC backend. Finally it renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) into the Gitea Actions secret **`vault_oauth__sh_b64`** (base64) at **org** scope, then propagates the same secret to each user in `gitea_secret_propagation_users` (Action secrets are per-owner, so user-owned repos can't read org secrets). +5. **Revoke the temp root token** — the `always` block of `main.yml` revokes `vault_root_token` no matter how step 4 ended, so no long-lived root token survives the run. + +| Var | Default | Meaning | +| --- | --- | --- | +| `vault_unseal_keys_path` | `~/.arcodange/cluster-keys.json` | Where unseal keys + root token are stored. | +| `vault_unseal_keys_shares` / `_key_threshold` | `1` / `1` | Single-key seal (lab posture; `threshold <= shares`). | +| `vault_address` | `https://vault.arcodange.lab` | The cluster Vault endpoint. | +| `gitea_admin_user` / `gitea_admin_password` | `arcodange@gmail.com` / (prompted) | Credentials Playwright uses to create the OAuth app. | +| `vault_oidc_force_reset` | `false` | When `true`, `vault auth disable gitea` + `gitea_jwt` before re-applying. | + +> [!CAUTION] +> `vault_oidc_force_reset=true` is **destructive**: it disables and wipes **all** `gitea_cicd_*` per-app JWT roles created by the bundled tofu, every run. Default is off. Likewise, losing `~/.arcodange/cluster-keys.json` means the Vault can never be unsealed again — that file is the single point of failure for the whole secret plane (see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)). + +### `step_ca` + +[`playbooks/ssl/roles/step_ca`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca) · runs on the `step_ca` group (all three Pis) in the `01 · System` stage via [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml). It is the lab's internal ACME/CA for `*.arcodange.lab` certificates, run **active/standby**: primary `pi1`, replicas `pi2`/`pi3`. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/main.yml) imports five task files in order: + +1. **install** — install the `step` / `step-ca` binaries. +2. **init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/init.yml)) — primary only. `step ca init` (non-interactive, password file) with `creates:` guard so it is idempotent. The CA name is `Arcodange Lab CA`, DNS `ssl-ca.arcodange.lab`, listen `:8443`. +3. **sync** ([`sync.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/sync.yml)) — replicates the CA from primary to standbys. It takes a **lockfile** on the primary (`.sync.lock`), computes a deterministic `tar | sha256sum` **checksum** of `~/.step`, compares it to the last checksum cached on the controller, and only `rsync`s (pull → controller → push to standbys) when the checksum changed. This is how the standbys hold an identical CA without a shared filesystem. +4. **systemd** — install/enable the `step-ca` unit (the `restart step-ca` handler fires on cert/config change). +5. **provisioners** ([`provisioners.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/provisioners.yml)) — primary only. Ensures a **JWK provisioner named `cert-manager`** exists: lists provisioners, generates the JWK keypair (`creates:` guard) under `~/.step/provisioners/`, and `step ca provisioner add`s it. This is what lets in-cluster cert-manager request certs from the CA. + +| Var | Default | Meaning | +| --- | --- | --- | +| `step_ca_primary` | `pi1` | The writable CA node; standbys sync from it. | +| `step_ca_fqdn` | `ssl-ca.arcodange.lab` | CA DNS name; URL is `https://{fqdn}:8443`. | +| `step_ca_provisioner_name` / `_type` | `cert-manager` / `JWK` | The cert-manager provisioner. | +| `step_ca_force_reinit` | `false` | When `true`, stops the service and **wipes `~/.step`** before re-init. | + +| Secret | Source | +| --- | --- | +| `vault_step_ca_password` | CA root password — from vaulted [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml). | +| `vault_step_ca_jwk_password` | cert-manager JWK provisioner password — same vaulted file. | + +> [!CAUTION] +> `step_ca_force_reinit=true` **wipes the entire CA** (`~/.step`) on the primary and re-issues a new root — every previously issued `*.arcodange.lab` cert immediately becomes untrusted until clients reload the new root. Use only for a deliberate PKI rebuild. + +### `crowdsec` + +[`playbooks/tools/roles/crowdsec`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec) · runs on `localhost` in the `04 · Tools` stage. It wires CrowdSec's decisions into Traefik as a bouncer middleware with a Turnstile CAPTCHA. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) flow: + +1. **Vault → K8s secret plumbing** — creates a `ServiceAccount` (`factory-ansible-tool-crowdsec-traefik-plugin`), a `VaultAuth` (kubernetes auth, role `factory_crowdsec_conf`), and a `VaultStaticSecret` that reads **`kvv2/cms/factory/turnstile`** into a K8s secret (`refreshAfter: 30s`). The Turnstile sitekey/secret come from there. +2. **Bouncer key** — finds the CrowdSec LAPI pod in `tools` and runs `cscli bouncers add traefik-plugin` (deletes + re-adds on conflict) to obtain the bouncer API key. +3. **CAPTCHA HTML** — `inject_captcha_html.yml` pushes `captcha.html` into the Traefik PVC; this task is **tagged `never`** (opt-in only) so the default run skips it. +4. **Traefik Middleware** — applies a `traefik.io/v1alpha1` `Middleware` named **`crowdsec-bouncer`** (`crowdsec` in `kube-system`) configured with the bouncer key, stream mode, Turnstile (`captchaProvider: turnstile` + site/secret keys), and a **Redis cache at `redis.tools:6379`**. +5. **Restart Traefik** — scales the Traefik Deployment to 0 then back to 1 (with a `rescue`/`always` guard guaranteeing it scales back up) to load the new middleware. + +| Var | Default | Meaning | +| --- | --- | --- | +| `traefik_pvc_name` | `traefik` | The PVC the (tagged-`never`) captcha.html inject targets. | + +| Secret | Source | +| --- | --- | +| Turnstile sitekey + secret | Vault `kvv2/cms/factory/turnstile`, surfaced via `VaultStaticSecret`. | +| Bouncer API key | Minted at runtime by `cscli bouncers add`. | + +### `pihole` + +[`playbooks/dns/roles/pihole`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole) · runs on the `pihole` group (`pi1`, `pi3`) in the `01 · System` stage. It configures **HA DNS**: two Pi-hole nodes kept in sync. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/main.yml) includes three task files: + +1. **`ha_pihole_setup.yml`** — **waits for a manual Pi-hole install** (it prints the `curl … | sudo bash` command and `wait_for`s `/etc/pihole/pihole-FTL.db` for up to 10 minutes; Pi-hole itself is not installed by Ansible). It then patches [`pihole.toml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/ha_pihole_setup.yml) (listen port, `listeningMode = "ALL"`, enable `/etc/dnsmasq.d`) and writes three dnsmasq drop-ins: `10-custom-rules.conf` (wildcard `address=/fqdn/ip` from `pihole_custom_dns`), `20-rpis.conf` (`.home` → `preferred_ip` for every Pi), and `99-upstream.conf` (explicit upstream from `pihole_upstream_dns`). +2. **`gravity_setup.yml`** — sets up **Gravity Sync** between the two nodes: a `pihole_gravity` system user with a freshly **rotated ed25519 keypair** each run, cross-authorized `authorized_keys`, full **sudo** (`/etc/sudoers.d/gravity-sync`), the installer, and a generated `gravity-sync.conf` (each node points `REMOTE_HOST` at the other), then runs the sync. +3. **`client_setup.yml`** — points DNS clients at the Pi-hole pair by editing `/etc/resolv.conf` (insert nameservers after `search`) and the active NetworkManager connections via `nmcli` (per-interface `ipv4.dns` + `dns-priority`, eth0 50 / wlan0 100). + +| Var | Default | Meaning | +| --- | --- | --- | +| `pihole_primary` | `pi1` | First node; the other is derived as the secondary. | +| `pihole_ports` | `8081o,443os,…` | Web-interface listen ports. | +| `pihole_custom_dns` | `{}` | FQDN→IP wildcard records (validated as IPv4). | +| `pihole_upstream_dns` | `[8.8.8.8, 1.1.1.1, 8.8.4.4]` | Explicit upstreams (avoids DHCP-provided DNS). | + +> [!WARNING] +> This role is **not fully idempotent**: it depends on a human running the Pi-hole installer first, it **rotates the gravity SSH key on every run**, and it grants the `pihole_gravity` user passwordless **sudo ALL**. Treat reruns as state-changing, not no-ops. + +### `deploy_docker_compose` + +[`roles/deploy_docker_compose`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose) · shared. This is the **generic compose mechanism** every app deploy builds on. The caller passes a `dockercompose_content` dict; the [`tasks/main.yml`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose/tasks/main.yml): + +1. Derives `app_name` from `dockercompose_content.name` and creates `////` plus `data/` and `scripts/`. +2. Writes the compose file with `to_nice_yaml` and **validates it** with `validate: 'docker compose -f %s config'` — a bad compose fails the task before anything is written live. +3. Writes a small wrapper script `scripts/docker-compose` that runs `docker compose -f "$@"`, so the app can be driven without remembering the path. + +| Var | Default | Meaning | +| --- | --- | --- | +| `app_name` | `(dockercompose_content.name)` | App directory name. | +| `app_owner` / `app_group` | `pi` / `docker` | File ownership. | +| `root_path` | `/home/pi/arcodange` | Base path; `partition` (`docker_composes`) nests under it. | + +--- + +## Tier 2 — building-block roles + +Smaller roles, mostly Gitea/forge plumbing and one-shot helpers. Shared roles live in [`roles/`](../../../../ansible/arcodange/factory/roles); `deploy_gitea`/`deploy_postgresql` are nested under [`playbooks/setup/roles/`](../../../../ansible/arcodange/factory/playbooks/setup/roles). + +| Role | Purpose | Key vars / notes | Secrets | +| --- | --- | --- | --- | +| [`gitea_repo`](../../../../ansible/arcodange/factory/roles/gitea_repo) | Ensure a repo exists across Gitea + GitHub + GitLab and add **8h push mirrors** (`sync_on_commit: true`) to GitHub/GitLab. | Creates missing repos on each forge; mirror URLs + namespace IDs in [`vars/main.yml`](../../../../ansible/arcodange/factory/roles/gitea_repo/vars/main.yml). | `github_api_token`, `gitlab_api_token` (from `gitea_vault`). | +| [`gitea_token`](../../../../ansible/arcodange/factory/roles/gitea_token) | Generate / replace / delete a Gitea access token via `docker exec … gitea admin user generate-access-token`. | Stores the raw token in the fact named by `gitea_token_fact_name`; `gitea_token_replace` / `gitea_token_delete` toggles; scopes default to `write:admin,organization,package,repository,user`. | The minted token itself (a fact, not persisted). | +| [`gitea_secret`](../../../../ansible/arcodange/factory/roles/gitea_secret) | `PUT` a Gitea **Actions secret** at user or org scope. | `gitea_secret_name` / `_value`; `gitea_owner_type` (`user`\|`org`) selects the API path. | `gitea_api_token` (Authorization). | +| [`gitea_sync`](../../../../ansible/arcodange/factory/roles/gitea_sync) | List repos on all **three forges**, diff them, and call `gitea_repo` for the repos missing somewhere. | Computes `repos_incomplete = all − common`; loops `gitea_repo` over the gaps. | GitHub/GitLab/Gitea API tokens. | +| [`traefik_certs`](../../../../ansible/arcodange/factory/roles/traefik_certs) | Extract the live **`*.arcodange.lab`** cert from Traefik's `acme.json`. | `kubectl exec` into Traefik → `jq` the LetsEncrypt wildcard cert → `traefik_cert_pem` fact; no-op if already set. | — (reads in-cluster acme.json). | +| [`playwright`](../../../../ansible/arcodange/factory/roles/playwright) | Run a Playwright browser-automation script in Docker. | Builds `playwright:` (default `1.47.0`) from `files/`, runs the script with `playwright_env` injected as `-e`; default script `loginGitea.js`. Used by `hashicorp_vault` for the OIDC app setup. | Script-specific env (e.g. Gitea admin creds). | +| [`deploy_gitea`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea) | Deploy Gitea: template [`app.ini.j2`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea/tasks/main.yml), `docker compose up`, then **health-check `:3000`** until ready. | Compose source is `/home/pi/arcodange/docker_composes/gitea`; admin user `arcodange`. | (consumes the vaulted Gitea compose env). | +| [`deploy_postgresql`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql) | Deploy Postgres via compose, then per-app **create DB + user** ([`create_db_and_user.yml`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql/tasks/create_db_and_user.yml)). | Waits on `pg_isready`, loops `applications_databases` (`{app: {db_name, db_user, db_password}}`). | Per-app DB passwords from `applications_databases`. | + +--- + +## Role dependency view + +How the roles relate: shared building blocks feed the `setup`-stage app deploys, and a few platform-service roles include shared roles directly. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef shared fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef setup fill:#1e4620,stroke:#22c55e,color:#f9fafb; + classDef platform fill:#4a2c1e,stroke:#f59e0b,color:#f9fafb; + + dc["deploy_docker_compose
generic compose writer"]:::shared + pw["playwright
browser automation"]:::shared + gt["gitea_token
mint access token"]:::shared + gs["gitea_secret
PUT Actions secret"]:::shared + gr["gitea_repo
mirror to GitHub/GitLab"]:::shared + gsync["gitea_sync
diff 3 forges"]:::shared + tc["traefik_certs
extract lab cert"]:::shared + + dpg["deploy_postgresql"]:::setup + dgi["deploy_gitea"]:::setup + + hv["hashicorp_vault"]:::platform + sca["step_ca"]:::platform + cs["crowdsec"]:::platform + ph["pihole"]:::platform + + gsync --> gr + hv --> pw + hv --> gs + dc -. "used by app deploys" .-> dpg + dc -. "used by app deploys" .-> dgi +``` + +1. **`gitea_sync` → `gitea_repo`** — the sync role include-loops `gitea_repo` for each repo missing from one of the three forges. +2. **`hashicorp_vault` → `playwright`** — Vault's OIDC setup drives Gitea through Playwright to create the OAuth app. +3. **`hashicorp_vault` → `gitea_secret`** — the rendered `vault_oauth__sh_b64` is published as a Gitea Actions secret at org and user scope. +4. **`deploy_docker_compose` → `deploy_postgresql` / `deploy_gitea`** — the generic compose writer is the substrate the `setup`-stage app deploys lean on. +5. **`step_ca`, `crowdsec`, `pihole`** stand alone — they configure their own services (PKI, WAF, DNS) without including other roles. + +--- + +## See also + +- [Inventory & variables](inventory.md) — the groups (`gitea`, `postgres`, `step_ca`, `pihole`) these roles target, and the vaulted `group_vars` they read. +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — where `hashicorp_vault`'s OIDC tokens and the `kvv2/cms/factory/turnstile` path fit the broader secret model. +- [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) — how the compose `data/` dirs and the step-ca state relate to backup and disaster recovery. diff --git a/vibe/guidebooks/factory-provisioning/opentofu/README.md b/vibe/guidebooks/factory-provisioning/opentofu/README.md new file mode 100644 index 0000000..9362cd6 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/README.md @@ -0,0 +1,95 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **OpenTofu** + +# OpenTofu — factory provisioning + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Downstream:** [factory iac](factory-iac.md) · [postgres iac](postgres-iac.md) · [CI apply flow](ci-apply-flow.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +OpenTofu is the **declarative half** of the factory: it provisions everything that lives *outside* the K3s cluster — Gitea repos & CI users, Vault policies, Cloudflare DNS, OVH domains, a GCS backup bucket, and the in-cluster PostgreSQL roles/databases. The imperative half (the cluster itself) is built by [Ansible](../ansible/README.md). + +OpenTofu is pinned to **`1.8.2`** in CI (`OPENTOFU_VERSION`). + +--- + +## Two independent state roots + +There are **two separate Terraform/OpenTofu roots**, each with its own `backend.tf`, its own GCS state prefix, its own provider set, and its own CI workflow. They never share state and can be applied independently. + +| Root | Code path | State backend (GCS) | Triggered by | +| --- | --- | --- | --- | +| **factory iac** | [`iac/`](../../../../iac) | `gs://arcodange-tf/factory/main` | changes under `iac/**` → [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) | +| **postgres iac** | [`postgres/iac/`](../../../../postgres/iac) | `gs://arcodange-tf/factory/postgres` | changes under `postgres/**` → [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) | + +> [!NOTE] +> Both roots share the same GCS **bucket** (`arcodange-tf`) but live under **distinct prefixes** (`factory/main` vs `factory/postgres`), so their state objects never collide. + +--- + +## Providers + +| Provider | Version | Endpoint / scope | Auth | +| --- | --- | --- | --- | +| `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var | +| `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` | +| `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` (factory) / `GOOGLE_BACKEND_CREDENTIALS` (postgres backend) | +| `cloudflare/cloudflare` | `~> 5` | DNS / IAM | `CLOUDFLARE_API_TOKEN` env var | +| `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` | +| `cyrilgdn/postgresql` | `1.24.0` | `192.168.1.202` (pi2), `superuser` | `POSTGRES_USERNAME` / `POSTGRES_PASSWORD` (TF vars) | + +The first five providers belong to the **factory iac** root ([`iac/providers.tf`](../../../../iac/providers.tf)); the **postgres iac** root ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf)) declares only `postgresql` + `vault`. Both roots configure the `vault` provider identically (JWT, mount `gitea_jwt`, role `gitea_cicd`). + +--- + +## The Vault-JWT auth model + +Neither root carries long-lived Vault credentials. Instead CI mints a short-lived Gitea OIDC token and exchanges it for Vault access: + +1. A first job decodes the base64 secret **`vault_oauth__sh_b64`** and runs it (`base64 -d | bash`), producing a **Gitea OIDC JWT** as a job output (`gitea_vault_jwt`). +2. That JWT is exported into the apply job as **`TERRAFORM_VAULT_AUTH_JWT`**. +3. The `vault` provider's `auth_login_jwt` block consumes it against mount `gitea_jwt` / role `gitea_cicd`, yielding a scoped Vault token used to read the per-provider secrets (Google creds, Gitea token, Cloudflare token, OVH app keys, Postgres creds). + +See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the full Vault policy/mount design and [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough. + +--- + +## CI apply flow + +Both workflows share the same two-job shape: authenticate, then apply. The trigger paths differ (`iac/**` vs `postgres/**`) but the structure is identical. + +```mermaid +%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%% +flowchart TD + classDef trigger fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb; + classDef job fill:#1e4620,stroke:#22c55e,color:#f0fdf4; + classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2; + + push["push / PR touching
iac/** or postgres/**"]:::trigger + auth["job: gitea_vault_auth
decode vault_oauth__sh_b64
mint Gitea OIDC JWT"]:::job + tofu["job: tofu
read Vault secrets via JWT
set provider env vars"]:::job + apply["dflook/terraform-apply@v1
auto_approve: true"]:::danger + + push --> auth + auth -- "gitea_vault_jwt output" --> tofu + tofu --> apply +``` + +1. A **push or PR** that touches files under `iac/**` (factory) or `postgres/**` (postgres) starts the matching workflow; `workflow_dispatch` allows a manual run. +2. The **`gitea_vault_auth`** job decodes `vault_oauth__sh_b64` and emits the Gitea OIDC JWT as `gitea_vault_jwt`. +3. The **`tofu`** job (`needs: gitea_vault_auth`) sets `TERRAFORM_VAULT_AUTH_JWT` from that output, reads the provider secrets out of Vault, and prepares the homelab CA cert (`VAULT_CACERT`). +4. The job runs **`dflook/terraform-apply@v1`** against the root's `path` (`iac` or `postgres/iac`) with **`auto_approve: true`**. + +> [!CAUTION] +> **Applies are auto-approve.** There is no manual plan-review gate — once a change to `iac/**` or `postgres/**` lands on `main`, CI applies it to the real Gitea, Vault, Cloudflare, OVH, GCS, and PostgreSQL targets without further confirmation. Treat every merge as a production change and review the diff *before* merging, not after. This trade-off is recorded in [ADR-0001 · safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md). + +--- + +## Index + +| Page | Covers | State | +| --- | --- | --- | +| [factory iac](factory-iac.md) | `iac/` root — Gitea, Vault, Google/GCS backup, Cloudflare, OVH | ✅ | +| [postgres iac](postgres-iac.md) | `postgres/iac/` root — PostgreSQL roles & databases on pi2 | ✅ | +| [CI apply flow](ci-apply-flow.md) | Both Gitea workflows, the Vault-JWT exchange, auto-approve apply | ✅ | diff --git a/vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md b/vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md new file mode 100644 index 0000000..b521895 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/ci-apply-flow.md @@ -0,0 +1,114 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **CI apply flow** + +# CI apply flow + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Upstream:** [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml), [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) +> **Downstream:** [factory iac](factory-iac.md), [postgres iac](postgres-iac.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) + +Two Gitea Actions workflows turn every commit that touches the OpenTofu code into a live `apply`. `IAC` ([`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml)) drives the factory infrastructure under [`iac/`](../../../../iac/); `Postgres` ([`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml)) drives the database stack under [`postgres/iac/`](../../../../postgres/). They share the same two-job shape: a short OIDC-auth job feeds a Vault JWT to a `tofu` job that reads secrets and runs `terraform apply`. + +> [!CAUTION] +> **`auto_approve: true` means every merge to `main` applies immediately — there is no plan-gate.** The `dflook/terraform-apply@v1` step skips the interactive approval, so any change that lands on `main` (or any matched `push`) rewrites real cloud and homelab state without a human reviewing the plan. Mitigations are entirely upstream of CI: (1) **mandatory code review** on the PR before merge, and (2) **least-privilege Vault policies** on the `gitea_cicd` role so a runaway apply can only touch the resources its token is scoped to. See [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md): the sandbox lane runs the *same* tofu but **plan-only** against a `sandbox/` state prefix and a throwaway DNS zone, so contributors can validate changes without an auto-apply. + +## Triggers + +Both workflows fire on the same three events; only the watched path globs differ. + +| Event | `IAC` (factory) | `Postgres` | +| --- | --- | --- | +| `push` | `iac/*.tf`, `iac/*.tfvars`, `iac/**/*.tf`, `iac/**/*.tfvars` | `postgres/**/*.tf`, `postgres/**/*.tfvars` | +| `pull_request` | same globs (YAML anchor `*tofuPaths`) | same globs (YAML anchor `*postgresTofuPaths`) | +| `workflow_dispatch` | manual, no inputs | manual, no inputs | + +> [!IMPORTANT] +> `concurrency` is keyed on `${{ github.ref }}-${{ github.workflow }}` with `cancel-in-progress: true`, so a newer push to the same branch cancels an in-flight run. A `pull_request` event triggers the workflow — but the `apply` still runs, so the safety contract is "review **before** merge", not "CI only plans on PRs". + +## Job 1 — `gitea_vault_auth` + +Mints a Gitea OIDC token that Vault will trust. The whole job is one step: + +```bash +echo -n "${{ secrets.vault_oauth__sh_b64 }}" | base64 -d | bash +``` + +| Field | Value | +| --- | --- | +| Runner | `ubuntu-latest` | +| Secret consumed | `vault_oauth__sh_b64` — a base64-encoded shell script | +| Step id | `gitea_vault_jwt` | +| Output | `gitea_vault_jwt` ← `steps.gitea_vault_jwt.outputs.id_token` | + +The decoded script asks Gitea for an OIDC `id_token` and emits it as a step output. The `tofu` job declares `needs: [gitea_vault_auth]` so it receives `needs.gitea_vault_auth.outputs.gitea_vault_jwt`. + +## Job 2 — `tofu` + +| Field | `IAC` | `Postgres` | +| --- | --- | --- | +| Job name | `Tofu` | `Tofu - Postgres` | +| `needs` | `gitea_vault_auth` | `gitea_vault_auth` | +| `OPENTOFU_VERSION` | `1.8.2` | `1.8.2` | +| `TERRAFORM_VAULT_AUTH_JWT` | `needs.gitea_vault_auth.outputs.gitea_vault_jwt` | same | +| `VAULT_CACERT` | `${{ github.workspace }}/homelab.pem` | same | +| Apply path | `iac` | `postgres/iac` | + +Step order inside the job: + +1. **read vault secret** — the shared `*vault_step` anchor (see below). +2. **`actions/checkout@v4`** — pull the repo into the workspace. +3. **prepare vault self signed cert** — `echo -n "${{ secrets.HOMELAB_CA_CERT }}" | base64 -d > $VAULT_CACERT`, writing the homelab CA to `homelab.pem` so the runner trusts `https://vault.arcodange.lab`. +4. **terraform apply** — `dflook/terraform-apply@v1` with the path above and `auto_approve: true`. + +### Vault secret reads (`*vault_step`) + +The `read vault secret` step uses [`arcodange-org/vault-action`](https://gitea.arcodange.lab/arcodange-org/vault-action), authenticating with `method: jwt`, `path: gitea_jwt`, `role: gitea_cicd`, `url: https://vault.arcodange.lab`, `caCertificate: ${{ secrets.HOMELAB_CA_CERT }}`, and `jwtGiteaOIDC` set to the auth job's output. The secrets it exports into the job env differ per workflow: + +| Workflow | Vault path | Selector | Exported as | +| --- | --- | --- | --- | +| `IAC` | `kvv1/google/credentials` | `credentials` | `GOOGLE_CREDENTIALS` | +| `IAC` | `kvv1/admin/gitea` | `token` | `GITEA_TOKEN` | +| `IAC` | `kvv1/admin/cloudflare` | `iam_token` | `CLOUDFLARE_API_TOKEN` | +| `IAC` | `kvv1/admin/ovh/app` | `*` (all keys) | `OVH_*` | +| `Postgres` | `kvv1/google/credentials` | `credentials` | `GOOGLE_BACKEND_CREDENTIALS` | +| `Postgres` | `kvv1/postgres/credentials` | `*` (all keys) | `TF_VAR_postgres_*` | + +`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS` authenticate the GCS state backend; the `TF_VAR_postgres_*` fan-out feeds the Postgres module's input variables directly. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how the `gitea_cicd` role and KV v1 mounts are provisioned. + +## End-to-end flow + +```mermaid +%%{init: {'theme': 'base'}}%% +flowchart TD + push["push / PR / workflow_dispatch
on iac/** or postgres/** .tf .tfvars"] --> auth["job: gitea_vault_auth
base64 -d | bash -> Gitea OIDC id_token"] + auth -->|"gitea_vault_jwt output"| tofu["job: tofu
OPENTOFU_VERSION 1.8.2"] + tofu --> readvault["read vault secret
vault-action jwt role gitea_cicd"] + readvault -->|"GOOGLE_CREDENTIALS, TF_VAR_postgres_*, ..."| init["tofu init
GCS backend, state prefix"] + init --> apply["dflook/terraform-apply@v1
auto_approve: true"] + apply --> state["state updated in GCS
real cloud + homelab mutated"] + + classDef trigger fill:#1f3a5f,stroke:#7fb0ff,color:#eaf2ff; + classDef job fill:#3a2f5f,stroke:#b39dff,color:#f3eeff; + classDef secret fill:#5f3a2f,stroke:#ffb38a,color:#fff1e8; + classDef danger fill:#5f1f2f,stroke:#ff8a9d,color:#ffe8ec; + class push trigger; + class auth,tofu,init job; + class readvault secret; + class apply,state danger; +``` + +1. A **push**, **pull_request**, or **workflow_dispatch** event matching the `iac/**` or `postgres/**` path globs starts the workflow. +2. Job **`gitea_vault_auth`** runs `base64 -d | bash` on the `vault_oauth__sh_b64` secret to obtain a Gitea OIDC `id_token`, published as the `gitea_vault_jwt` output. +3. Job **`tofu`** (gated by `needs: gitea_vault_auth`) starts on `ubuntu-latest` with `OPENTOFU_VERSION 1.8.2` and `TERRAFORM_VAULT_AUTH_JWT` set to that output. +4. The **read vault secret** step exchanges the JWT (role `gitea_cicd`, path `gitea_jwt`) for the workflow's secrets and exports them as env vars (`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS`, `GITEA_TOKEN`, `CLOUDFLARE_API_TOKEN`, `OVH_*`, or `TF_VAR_postgres_*`). +5. **`tofu init`** configures the GCS backend, binding the working dir to its state prefix using the Google credentials just read. +6. **`dflook/terraform-apply@v1`** runs against `iac` (or `postgres/iac`) with `auto_approve: true` — no plan-gate. +7. The **state** in GCS is updated and the real cloud + homelab resources are mutated to match the committed code. + +## Related pages + +- [factory iac](factory-iac.md) — what the `iac/` stack provisions (the `IAC` workflow's target). +- [postgres iac](postgres-iac.md) — the `postgres/iac/` database stack (the `Postgres` workflow's target). +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — the `gitea_cicd` role, OIDC trust, and KV mounts behind every secret read here. +- [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) — the sandbox lane runs the same tofu plan-only against a `sandbox/` state prefix and a throwaway zone. diff --git a/vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md b/vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md new file mode 100644 index 0000000..ba73c6a --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/factory-iac.md @@ -0,0 +1,148 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **factory iac** + +# factory iac — the `iac/` state root + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Code:** [`iac/`](../../../../iac) · **State backend:** `gs://arcodange-tf/factory/main` ([`iac/backend.tf`](../../../../iac/backend.tf)) +> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [CI apply flow](ci-apply-flow.md) · [postgres iac](postgres-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +The `iac/` root provisions everything that lives **outside** the K3s cluster: the Cloudflare R2 backend that holds OpenTofu state itself, the per-service Cloudflare and OVH API tokens consumed by the [cms](https://gitea.arcodange.lab/arcodange-org/cms) repo, a restricted Gitea CI user for reading private module repos, and the GCS bucket that backs up Longhorn volumes. Each provisioned credential is written **both** to a Gitea Actions secret (where the consuming workflow expects it) **and** to a Vault path (the durable source of truth — see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)). + +This root's state lives at `gs://arcodange-tf/factory/main` and is applied by [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) on any change under `iac/**` — see [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough. + +--- + +## Providers + +Declared in [`iac/providers.tf`](../../../../iac/providers.tf). + +| Provider | Source | Version | Endpoint / scope | Auth | +| --- | --- | --- | --- | --- | +| `gitea` | `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var | +| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` | +| `google` | `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` env var | +| `cloudflare` | `cloudflare/cloudflare` | `~> 5` | DNS / Pages / R2 / IAM | `CLOUDFLARE_API_TOKEN` env var | +| `ovh` | `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` | + +> [!NOTE] +> The Cloudflare account ID is **not** hard-coded — it is resolved at plan time from `data.cloudflare_account.arcodange` filtered on the account name `arcodange@gmail.com` ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)) and exposed as `local.cloudflare_account_id`. + +--- + +## Cloudflare — R2 backend bucket & service tokens + +Defined in [`iac/cloudflare.tf`](../../../../iac/cloudflare.tf). Two tokens are minted through the [`modules/cloudflare_token`](#the-cloudflare_token-module) mechanism: one scoped to the R2 state bucket, one broad token handed to the cms repo. + +| Resource | Type | Identity / scope | Secret destination | +| --- | --- | --- | --- | +| `cloudflare_r2_bucket.arcodange_tf` | R2 bucket | name `arcodange-tf`, jurisdiction `eu` | — (holds the *cms* repo's own OpenTofu state) | +| `module.cf_r2_arcodange_tf_token` | module → `cloudflare_account_token` | account: `Workers R2 Storage Read`, `Account Settings Read`; bucket: `Workers R2 Storage Bucket Item Write` | `vault_kv_secret.cf_r2_arcodange_tf` → `kvv1/cloudflare/r2/arcodange-tf` (S3 access key, secret, `https://.eu.r2.cloudflarestorage.com` endpoint) | +| `vault_policy.cf_r2_arcodange_tf` | Vault policy | name `factory__cf_r2_arcodange_tf` | read on `kvv1/cloudflare/r2/arcodange-tf` **and** `kvv1/zoho/self_client` (the Zoho mail client is created manually) | +| `module.cf_arcodange_cms_token` | module → `cloudflare_account_token` | account-scope: `Pages Write`, `Account DNS Settings Write`, `Account Settings Read`, `Zone Write`, `Zone Settings Write`, `DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write` | Gitea secrets `CLOUDFLARE_API_TOKEN` + `CLOUDFLARE_ACCOUNT_ID` on the `cms` repo; Vault `kvv1/cloudflare/cms/cf_arcodange_cms_token` | + +The `cms` repo (`data.gitea_repo.cms`, owner `arcodange-org`) receives the broad token because it manages the public site end to end: Cloudflare Pages deploys, DNS records, zone settings, the Tunnel, and Turnstile. + +> [!CAUTION] +> Both tokens are minted with **`expires_on = null`** — they never expire. A leaked `cf_arcodange_cms_token` grants standing DNS/Pages/Tunnel/Turnstile write on the whole account until manually revoked. There is no automatic rotation; rotation means tainting the module's `cloudflare_account_token` and re-applying. + +--- + +## OVH — OAuth2 client for the cms domain + +Defined in [`iac/ovh.tf`](../../../../iac/ovh.tf). A `CLIENT_CREDENTIALS` OAuth2 client lets the cms workflow edit DNS nameservers for `arcodange.fr`, constrained by an IAM policy. + +| Resource | Type | Scope | +| --- | --- | --- | +| `ovh_me_api_oauth2_client.cms` | OAuth2 client | name `cms repo`, flow `CLIENT_CREDENTIALS` — "arcodange.fr management" | +| `ovh_iam_policy.cms` | IAM policy | name `cms_manager`; identity = the OAuth2 client; resources = account URN + `urn:v1:eu:resource:domain:arcodange.fr`; allow = a handful of `me/*` reads, all domain **READ** reference-actions (computed via `data.ovh_iam_reference_actions.domain`), plus `domain:apiovh:nameServer/edit` | +| `gitea_repository_actions_secret.ovh_cms_client_id` | Gitea secret | `OVH_CLIENT_ID` on the `cms` repo | +| `gitea_repository_actions_secret.ovh_cms_client_secret` | Gitea secret | `OVH_CLIENT_SECRET` on the `cms` repo | +| `vault_kv_secret.ovh_cms_token` | Vault secret | `kvv1/ovh/cms/app` — `client_id`, `client_secret`, `urn` | + +> [!NOTE] +> The write surface is deliberately narrow: the policy grants **only** `nameServer/edit` for writes; everything else is read-only. This lets the cms pipeline point `arcodange.fr` at Cloudflare nameservers without exposing the broader OVH account. + +--- + +## Gitea — restricted CI module-reader user + +Defined in [`iac/gitea_tofu_ci_user.tf`](../../../../iac/gitea_tofu_ci_user.tf). A locked-down Gitea account whose SSH key lets CI clone private Terraform module repos without exposing a privileged token. + +| Resource | Type | Notes | +| --- | --- | --- | +| `random_password.tofu` | password | length 32 — the user's login password | +| `gitea_user.tofu` | Gitea user | username `tofu_module_reader`, email `tofu-module-reader@arcodange.fake`, `restricted = true`, `visibility = private`, `prohibit_login = false` | +| `tls_private_key.tofu` | keypair | algorithm **ED25519** | +| `gitea_public_key.tofu` | SSH key | public half attached to `tofu_module_reader` | +| `vault_kv_secret.gitea_admin_token` | Vault secret | `kvv1/gitea/tofu_module_reader` — `ssh_private_key` + `ssh_public_key` | + +> [!NOTE] +> Despite the Terraform resource name `gitea_admin_token`, the stored payload is the **SSH keypair**, not an admin token. The user is `restricted`, so it can only read repos it is explicitly granted access to. + +--- + +## Google / GCS — Longhorn backup target + +Defined in [`iac/gcs_backup.tf`](../../../../iac/gcs_backup.tf). A GCS bucket plus an HMAC key wired into Vault so the in-cluster Longhorn controller can pull S3-compatible backup credentials. See [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for how this fits the cluster-recovery story. + +| Resource | Type | Value | +| --- | --- | --- | +| `google_storage_bucket.longhorn_backup` | GCS bucket | name `arcodange-backup`, location `NAM4` (dual-region), `force_destroy = true`, `public_access_prevention = enforced` | +| `google_service_account.longhorn_backup` | service account | account_id `longhorn-backup` | +| `google_storage_bucket_iam_member.longhorn_backup` | IAM binding | `roles/storage.admin` on the bucket, member = the SA | +| `google_storage_hmac_key.longhorn_backup` | HMAC key | S3-compatible access_id + secret for that SA | +| `vault_kv_secret_v2.longhorn_gcs_backup` | Vault **KVv2** secret | mount `kvv2`, name `longhorn/gcs-backup`, `cas = 1`, `delete_all_versions = true` — `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_ENDPOINTS = https://storage.googleapis.com` | +| `vault_policy.longhorn_gcs_backup` | Vault policy | name `longhorn-gcs-backup` — read on `kvv2/data/longhorn/gcs-backup` | +| `vault_kubernetes_auth_backend_role.longhorn` | Vault k8s auth role | role `longhorn`, bound SA `longhorn-vault-secret-reader` in namespace `longhorn-system`, audience `vault`, policy `longhorn-gcs-backup` | + +The bound service-account name `longhorn-vault-secret-reader` must match the `VaultAuth` manifest in-cluster — that's the handshake that lets Longhorn read the HMAC creds at runtime. + +> [!WARNING] +> The HMAC key is an **S3-compatible** credential and is weaker than a native GCS service-account key: it is a long-lived static secret with no key rotation built into this config, and `roles/storage.admin` grants full read/write/delete on the backup bucket. Combined with `force_destroy = true`, a state operation that destroys `arcodange-backup` will delete every Longhorn backup without prompting. Treat this bucket as critical and irreplaceable infrastructure. + +--- + +## The `cloudflare_token` module + +Source: [`iac/modules/cloudflare_token/`](../../../../iac/modules/cloudflare_token). This local module turns **human-readable permission names** into a working Cloudflare account token, so callers never hard-code permission-group UUIDs. + +How it works ([`main.tf`](../../../../iac/modules/cloudflare_token/main.tf)): + +1. It reads **all** available permission groups via `data.cloudflare_account_api_token_permission_groups_list`, then builds `local.permission_map`: `":" => id` (e.g. `"account:Pages Write" => `), keyed by the last dotted segment of the group's scope. +2. Caller-supplied names (`var.permissions.account` / `var.permissions.bucket`) are looked up against that map; any name with no match lands in `local.missing_permissions` and trips a **`precondition`** that fails the apply with a clear "Permissions introuvables" error. +3. Policies are assembled dynamically — an `account` policy targeting `com.cloudflare.api.account.` and, if `var.bucket` is set, a `bucket` policy targeting `com.cloudflare.edge.r2.bucket.__`. +4. The `cloudflare_account_token.token` resource sets `expires_on = null` and **ignores** drift on `expires_on` and `policies` (the upstream permission IDs are unstable). Instead, a `null_resource.cloudflare_account_token_replace` hashes the **sorted permission names** into its triggers, and `replace_triggered_by` forces a fresh token whenever the *names* change — surviving id churn while still rotating on a real permission change. +5. Outputs ([`outputs.tf`](../../../../iac/modules/cloudflare_token/outputs.tf)): `token` (sensitive), `token_id`, `token_sha256`, and — when `var.bucket` is set — `r2_credentials` mapping `access_key_id = token.id` and `secret_access_key = sha256(token.value)` for S3-compatible R2 access. + +--- + +## Vault layout: mixed KVv1 / KVv2 + +This root writes to **both** KV engines, which is easy to trip over. + +| Path | Engine | Written by | +| --- | --- | --- | +| `kvv1/cloudflare/r2/arcodange-tf` | KVv1 (`vault_kv_secret`) | R2 backend token | +| `kvv1/cloudflare/cms/cf_arcodange_cms_token` | KVv1 | cms Cloudflare token | +| `kvv1/ovh/cms/app` | KVv1 | OVH OAuth2 client | +| `kvv1/gitea/tofu_module_reader` | KVv1 | CI user SSH key | +| `kvv2/longhorn/gcs-backup` | KVv2 (`vault_kv_secret_v2`) | Longhorn GCS HMAC | + +> [!WARNING] +> Most secrets here use the **KVv1** engine (`vault_kv_secret`), but the Longhorn backup secret uses **KVv2** (`vault_kv_secret_v2`). The policy paths differ accordingly — KVv2 reads target `kvv2/data/longhorn/gcs-backup` (note the `/data/` segment), whereas KVv1 policies read the literal path. Mixing the two engines means a policy copied from one secret to another will silently grant nothing. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the engine-level design. + +--- + +## Outputs + +The root exposes a single top-level `output "token"` (sensitive) = the cms Cloudflare token ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)). Everything else is delivered side-effect-style into Gitea secrets and Vault paths rather than as Terraform outputs. + +--- + +## See also + +- [CI apply flow](ci-apply-flow.md) — how `iac/**` changes reach `gs://arcodange-tf/factory/main` via the Vault-JWT exchange and auto-approve apply. +- [postgres iac](postgres-iac.md) — the sibling root that provisions in-cluster PostgreSQL. +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md). diff --git a/vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md b/vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md new file mode 100644 index 0000000..6935824 --- /dev/null +++ b/vibe/guidebooks/factory-provisioning/opentofu/postgres-iac.md @@ -0,0 +1,116 @@ +[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **postgres iac** + +# postgres iac — the `postgres/iac/` state root + +> [!NOTE] +> **Status:** ✅ active · **Last Updated:** 2026-06-23 +> **Code:** [`postgres/iac/`](../../../../postgres/iac) · **State backend:** `gs://arcodange-tf/factory/postgres` ([`postgres/iac/backend.tf`](../../../../postgres/iac/backend.tf)) +> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md) +> **Related:** [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [CI apply flow](ci-apply-flow.md) · [factory iac](factory-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) + +The `postgres/iac/` root provisions **PostgreSQL roles, databases, and the pgbouncer auth function** on the live cluster database — one strand of the per-application `` join key described in [Naming conventions](../../lab-ecosystem/naming-conventions.md). For each application it creates a non-login owner role, an `` database owned by that role, and a `user_lookup()` function that lets PgBouncer authenticate against `pg_shadow`. A single `credentials_editor` login role (whose password is stored in Vault) is granted admin over every per-app role so that downstream tooling can mint application credentials without superuser rights. + +This root's state lives at `gs://arcodange-tf/factory/postgres` and is applied by [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) on any change under `postgres/**` — see [CI apply flow](ci-apply-flow.md). + +> [!CAUTION] +> This root runs as a **PostgreSQL superuser** ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf): `superuser = true`) pinned to the live database at **`192.168.1.202`** (pi2) **through PgBouncer**, with `sslmode = disable`. The provider can therefore **drop or alter live application databases** — an errant `terraform destroy` or a renamed `applications` entry will delete real data. And because the only route to Postgres is via PgBouncer on that host, **if PgBouncer is down OpenTofu cannot connect and no apply can run.** Treat every `postgres/**` merge as a production database change ([ADR-0001](../../../ADR/0001-safe-prod-like-environment.md)). + +--- + +## Providers + +Declared in [`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf). + +| Provider | Source | Version | Connection | Auth | +| --- | --- | --- | --- | --- | +| `postgresql` | `cyrilgdn/postgresql` | `1.24.0` | host `192.168.1.202` (pi2), via PgBouncer, `sslmode = disable`, `superuser = true` | `var.POSTGRES_USERNAME` / `var.POSTGRES_PASSWORD` (TF vars from `TF_VAR_POSTGRES_*`, sourced from Vault in CI) | +| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` | + +The two `POSTGRES_*` variables are declared `sensitive` in the same file; CI populates them from Vault as `TF_VAR_POSTGRES_USERNAME` / `TF_VAR_POSTGRES_PASSWORD` (see [CI apply flow](ci-apply-flow.md)). + +--- + +## The application set + +Everything in this root fans out over one variable. `var.applications` is a `set(string)` ([`variables.tf`](../../../../postgres/iac/variables.tf)) whose members are listed in [`terraform.tfvars`](../../../../postgres/iac/terraform.tfvars): + +| `applications` member | +| --- | +| `webapp` | +| `erp` | +| `crowdsec` | +| `plausible` | +| `dance-lessons-coach` | + +Adding an app to that list creates a full role + database + lookup-function bundle on the next apply; **removing** one would `DROP` the live database (see the caution above). + +--- + +## The `credentials_editor` role + +Defined in [`postgres/iac/main.tf`](../../../../postgres/iac/main.tf). A single login role, granted admin over every per-app role, whose credentials downstream tooling uses to provision application logins. + +| Resource | Type | Detail | +| --- | --- | --- | +| `random_password.credentials_editor` | password | length 24, `override_special = "-:!+<>"` | +| `postgresql_role.credentials_editor` | role | `login = true`, `create_role = true`; `lifecycle { ignore_changes = [roles] }` so its grant membership isn't reverted | +| `vault_kv_secret.postgres_admin_credentials` | Vault **KVv1** secret | `kvv1/postgres/credentials_editor/credentials` — `username` + `password` | + +--- + +## Per-application resources + +For each member of `var.applications`, `main.tf` creates the following (all `for_each` over the set): + +| Resource | Type | What it creates | +| --- | --- | --- | +| `postgresql_role.app_role[""]` | role | non-login role `_role` (`login = false`) — owns the database | +| `postgresql_grant_role.credentials_editor_app_role[""]` | grant | `credentials_editor` → `_role` **WITH ADMIN OPTION** | +| `postgresql_database.app_db[""]` | database | database ``, owner `_role`, `template = template0`, `alter_object_ownership = true` | +| `postgresql_function.pgbouncer_user_lookup[""]` | function | `user_lookup(i_username text)` in db `` — see below | +| `postgresql_grant.pgbouncer_user_lookup_public_revoke[""]` | grant | revoke (empty `privileges`) of `user_lookup` from role `public` in schema `public` | +| `postgresql_grant.pgbouncer_user_lookup[""]` | grant | `EXECUTE` on `user_lookup` to role `pgbouncer_auth`; `depends_on` the public-revoke (the two grants can't run in parallel) | + +So `webapp` yields role `webapp_role`, database `webapp`, function `webapp.user_lookup`, and the matching grants; likewise for `erp`, `crowdsec`, `plausible`, and `dance-lessons-coach`. + +### The pgbouncer `user_lookup()` function + +`postgresql_function.pgbouncer_user_lookup` defines a `plpgsql` function with **`security_definer = true`** and `parallel = "SAFE"`. It takes `i_username` (IN, text) and returns a record of `uname` + `phash`: + +```sql +BEGIN + SELECT usename, passwd FROM pg_catalog.pg_shadow + WHERE usename = i_username INTO uname, phash; + RETURN; +END; +``` + +PgBouncer's `auth_query` calls this to fetch the stored password hash. Because reading `pg_shadow` is privileged, the function is `SECURITY DEFINER` (runs as its owner). Access is locked down in two steps: first **revoke** the default `public` execute grant, then **grant** `EXECUTE` only to the `pgbouncer_auth` role — the `pgbouncer_auth` role itself is expected to already exist on the server (it is not created by this root). + +> [!NOTE] +> The two grants are ordered with an explicit `depends_on`: `postgresql_grant.pgbouncer_user_lookup` waits for `postgresql_grant.pgbouncer_user_lookup_public_revoke` because the provider can't apply both grants on the same object concurrently. + +--- + +## Vault layout + +This root writes a single KVv1 secret. + +| Path | Engine | Contents | +| --- | --- | --- | +| `kvv1/postgres/credentials_editor/credentials` | KVv1 (`vault_kv_secret`) | `username`, `password` of the `credentials_editor` login role | + +--- + +## No outputs + +There is **no `outputs.tf`** in this root. Nothing is exported as a Terraform output — the `credentials_editor` credentials are delivered into Vault, and the per-app roles/databases/functions are side effects on the live server. Consumers read the credentials from `kvv1/postgres/credentials_editor/credentials`, not from state outputs. + +--- + +## See also + +- [Naming conventions](../../lab-ecosystem/naming-conventions.md) — the `` databases here are one strand of the per-application `` join key (alongside namespaces, Vault paths, and repos). +- [CI apply flow](ci-apply-flow.md) — how `postgres/**` changes reach `gs://arcodange-tf/factory/postgres` and where `TF_VAR_POSTGRES_*` come from. +- [factory iac](factory-iac.md) — the sibling root for everything outside the cluster. +- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md). diff --git a/vibe/guidebooks/lab-ecosystem/01-factory.md b/vibe/guidebooks/lab-ecosystem/01-factory.md index 79e54cd..a8cc875 100644 --- a/vibe/guidebooks/lab-ecosystem/01-factory.md +++ b/vibe/guidebooks/lab-ecosystem/01-factory.md @@ -5,6 +5,7 @@ > **Status:** ✅ Active > **Last Updated:** 2026-06-23 > **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md) +> **Deeper dive:** [Factory provisioning guidebook](../factory-provisioning/README.md) — page-by-page walkthrough of the Ansible playbooks/roles and OpenTofu modules summarized here > **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md) `factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md).