docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)

Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 21:11:51 +02:00
parent b886f06824
commit dbe32161dc
16 changed files with 1571 additions and 0 deletions

View File

@@ -0,0 +1,82 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **02 · Setup**
# 02 · Setup — Postgres, Gitea, NFS backup target
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [01 · System](01-system.md)
> **Downstream:** [03 · CI/CD](03-cicd.md)
> **Related:** [Inventory & variables](inventory.md) · [Roles reference](roles.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)
## What it does
`02 · Setup` deploys the **stateful services the rest of the platform leans on**: a PostgreSQL server and a Gitea instance — both running as **Docker Compose stacks on `pi2`, outside K3s** — plus the in-cluster NFS backup target. The wrapper [`playbooks/02_setup.yml`](../../../../ansible/arcodange/factory/playbooks/02_setup.yml) imports [`playbooks/setup/setup.yml`](../../../../ansible/arcodange/factory/playbooks/setup/setup.yml), which pings the Pis, then imports three sub-playbooks: `backup_nfs.yml` (tagged `never`), `postgres.yml`, and `gitea.yml`.
> [!IMPORTANT]
> **Postgres and Gitea do not run in Kubernetes.** They are Docker Compose stacks on `pi2` (the sole member of the `postgres` group, which `gitea` inherits as a child — see [Inventory & variables](inventory.md)). K3s only references them: Traefik exposes Gitea via an `ExternalName` Service, and the `pg-fix-table-ownership` CronJob reaches Postgres over the LAN. This keeps the two services available even when the cluster is being rebuilt.
## Ordered steps
| # | Sub-playbook | Purpose | Key vars / versions |
| --- | --- | --- | --- |
| 1 | [`setup/backup_nfs.yml`](../../../../ansible/arcodange/factory/playbooks/setup/backup_nfs.yml) | Provision the shared backup volume: a **Longhorn RWX PVC `backups-rwx` (50Gi)**, a Longhorn `RecurringJob`, a `busybox` deploy to spawn the share-manager, then mount the resulting NFS share at `/mnt/backups` on every Pi. | `tags: never`; `backup_size: 50Gi`, RecurringJob `thrice-a-month-backup` (`cron 0 5 */2 * *`, retain 2) |
| 2 | [`setup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/setup/postgres.yml) | Deploy the Postgres Compose stack (`deploy_docker_compose` + `deploy_postgresql` role), create the `gitea` DB/user, create the **pgbouncer auth_user + `user_lookup()` functions** in both `postgres` and `gitea` DBs, publish the K8s Secret `postgres-admin-credentials`, and install the **`pg-fix-table-ownership` CronJob**. | **Postgres `16.3-alpine`**; container `postgres`; CronJob daily `0 3 * * *` |
| 3 | [`setup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/setup/gitea.yml) | Deploy the Gitea Compose stack (`deploy_docker_compose` + `deploy_gitea` role), create admin `arcodange`, mint an API token via `gitea_token`, upload the avatar, register the SSH key, create org `arcodange-org`, then **delete the temp token**. | **Gitea `1.25.5`**; base URL `http://pi2:3000` |
## NFS backup target — how the share is born
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%%
flowchart TD
classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4;
classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
pvc["RWX PVC backups-rwx (50Gi)<br>longhorn-system"]:::cluster
rj["RecurringJob thrice-a-month-backup<br>cron 0 5 */2 *"]:::cluster
dep["busybox Deployment rwx-nfs<br>mounts the PVC"]:::cluster
sm["Longhorn share-manager<br>(spawned by the mount)"]:::cluster
svc["Service nfs-backups-rwx<br>ClusterIP :2049"]:::cluster
mount["mount /mnt/backups on pi1/pi2/pi3<br>NFS vers=4.1"]:::host
pvc --> rj
pvc --> dep --> sm --> svc --> mount
```
1. A **ReadWriteMany Longhorn PVC** (`backups-rwx`, 50Gi) is created in `longhorn-system`.
2. A **`RecurringJob`** is attached to the volume so Longhorn snapshots/backs it up on the `0 5 */2 * *` schedule.
3. A **`busybox` Deployment (`rwx-nfs`)** mounts the PVC — the act of mounting an RWX volume makes Longhorn spawn an **NFS share-manager** pod.
4. A stable **ClusterIP Service** (`nfs-backups-rwx`, port 2049) is created (or reused) to front the share-manager.
5. Each Pi installs `nfs-common` and **mounts the share at `/mnt/backups`** (`vers=4.1`, `nofail`, `x-systemd.automount`), persisted in `fstab`.
## Postgres — what gets created
| Artifact | Where | Purpose |
| --- | --- | --- |
| Compose stack `arcodange_factory` | `pi2` Docker | Runs `postgres:16.3-alpine`, container `postgres`, port `5432`, data under `/home/pi/arcodange/docker_composes/postgres/data`. |
| `gitea` DB + user | inside Postgres | Created by the `deploy_postgresql` role from `applications_databases.gitea` (`gitea_database`). |
| pgbouncer `auth_user` (`pgbouncer_auth`) | `postgres` + `gitea` DBs | Login role used by the [pgbouncer pooler](../../lab-ecosystem/02-tools.md) for SCRAM lookups. |
| `user_lookup(text)` function | `postgres` + `gitea` DBs | `SECURITY DEFINER` function over `pg_shadow`; `EXECUTE` granted only to `pgbouncer_auth`. |
| K8s Secret `postgres-admin-credentials` | `kube-system` | Base64 admin user/password so the in-cluster CronJob can authenticate. |
| CronJob `pg-fix-table-ownership` | `kube-system` | Runs `postgres:16.3` daily at **03:00**; discovers `%_role` roles, derives each DB by stripping `_role`, and re-`ALTER TABLE ... OWNER TO` every public table — repairing ownership after a restore. |
## Gitea — bootstrap sequence
1. **Compose deploy** via `deploy_docker_compose`, then the `deploy_gitea` role wires Gitea to the Postgres DB (host/db/user/password pulled from the compose env).
2. **Admin user** `arcodange` (`arcodange@gmail.com`) is created with `--random-password --admin` if absent.
3. **API token** is minted by the `gitea_token` role and used for the next HTTP calls.
4. **Avatar** upload, **SSH public key** registration (idempotent), and **org `arcodange-org`** (full name "Arcodange") creation + avatar.
5. **Cleanup** — a `post_tasks` invocation of `gitea_token` with `gitea_token_delete: true` removes the temporary token.
## Gotchas
> [!WARNING]
> **The NFS play is `never`-tagged and order-sensitive.** `backup_nfs.yml` only runs when explicitly tagged, and several of its tasks (`Créer PVC RWX`, `Lancer un Deployment pour déclencher NFS`, `Attendre que le pod rwx-nfs soit Running`) are themselves `tags: never`. The RWX volume must already exist for the busybox deploy to spawn the share-manager; running the mount step before the share-manager is `Running` will hang on the `until` retry loop.
> [!WARNING]
> **Postgres lives on `pi2` outside K3s.** Treat it as a single-host service: there is no Postgres pod to `kubectl get`. The cluster only sees the `postgres-admin-credentials` Secret and the `pg-fix-table-ownership` CronJob, both of which reach the DB over the LAN at `pi2:5432`. A `pi2` outage takes Postgres (and Gitea) down regardless of cluster health.
> [!CAUTION]
> **`pg-fix-table-ownership` exists because restores break ownership.** After a Longhorn/data recovery, tables can come back owned by the wrong role and apps lose write access. The daily CronJob silently re-owns every `public` table to the `<db>_role` matching each `%_role` PostgreSQL role. If you add a database whose owning role does **not** follow the `<db>_role` naming convention, this job will not fix it — see [Naming conventions](../../lab-ecosystem/naming-conventions.md).
> [!NOTE]
> **The admin password is random and printed once.** Gitea's admin is created with `--random-password`; capture it from the play output (or reset it via `docker exec`) — it is not stored in the inventory. The bootstrap API token is deliberately deleted at the end, so re-running the play re-mints a fresh one.