Files
factory/vibe/guidebooks/factory-provisioning/ansible/01-system.md
Gabriel Radureau dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00

10 KiB

vibe > Guidebooks > Factory provisioning > Ansible > 01 · System

01 · System — base OS, Docker, K3s, Longhorn, DNS, SSL

Note

Status: active · Last Updated: 2026-06-23 Upstream: Ansible sub-hub · Factory provisioning hub Downstream: 02 · Setup · 03 · CI/CD Related: Storage & recovery · Secrets & Vault · Naming conventions · ADR-0001 safe prod-like environment

What it does

01 · System takes three bare Raspberry Pis (pi1, pi2, pi3) and turns them into a configured K3s cluster. The wrapper playbooks/01_system.yml does nothing but import_playbook the stage orchestrator playbooks/system/system.yml, which in turn imports ten sub-playbooks in strict order. Each sub-play layers one capability: hostname/DNS hygiene, Pi-hole HA DNS, the step-ca PKI, the external backup disk, Docker, the iSCSI/dm-crypt prerequisites for Longhorn, K3s itself, CoreDNS forwarding, the cert-manager issuer, and finally the cluster config (Longhorn + Traefik).

All host-facing plays target raspberries:&local — the intersection of the raspberries group and the local group, which resolves to pi1/pi2/pi3 (see Inventory & variables). The K3s server/agent split is decided at runtime: the first host (alphabetically) becomes the server, the rest become agents.

Ordered steps

# Sub-playbook Purpose Key vars / versions
1 system/rpi.yml Set each node's hostname to its inventory_hostname. On Pi-hole nodes (pi1/pi3) add dnsmasq to the dip group, then stop & disable dnsmasq to free port 53 for pihole-FTL. tags: never (opt-in only)
2 dns/dns.ymldns/pihole.yml Install & configure Pi-hole HA DNS via the pihole role. Adds custom records mapping .arcodange.lab and .arcodange.duckdns.org to pi1. pihole_custom_dnspi1.preferred_ip
3 ssl/ssl.ymlssl/step-ca.yml Install step-ca (the step_ca role) on all three Pis; fetch the root CA from pi1; build a Gitea runner image that trusts the CA (runner-images:ubuntu-latest-ca) and push it to the registry. step_ca_primary: pi1, root at /home/step/.step/certs/root_ca.crt
4 system/prepare_disks.yml Auto-detect the largest external (non-mmcblk0) USB partition, format it ext4 with label arcodange_500, mount at /mnt/arcodange, and persist in fstab. Skips format if the label already exists. pause confirm before any format. mount_point: /mnt/arcodange, disk_label: arcodange_500
5 system/system_docker.yml Install Docker via geerlingguy.docker; write daemon.json with json-file logging (max-size 10m, max-file 5) and data-root: /mnt/arcodange/docker (only when the external disk is mounted). tags: never; storage-driver: overlay2
6 system/iscsi_longhorn.yml Install open-iscsi (+ enable iscsid) and cryptsetup, and load the dm_crypt kernel module (persisted in /etc/modules) — Longhorn's encrypted-volume prerequisites. Creates /mnt/arcodange/longhorn. module dm_crypt
7 system/system_k3s.yml Build the K3s inventory dynamically (first sorted host → server, rest → agent), install the k3s-ansible content, run k3s.orchestration.site, then fetch the kubeconfig to ~/.kube/config (rewriting 127.0.0.1 → server IP). k3s v1.34.3+k3s1; server args --docker --disable traefik
8 system/k3s_dns.yml Create the coredns-custom ConfigMap so cluster DNS forwards arcodange.lab:53 to the Pi-hole IPs; also patch the main CoreDNS Corefile to forward to the same HA Pi-holes. pihole_ips (extracted from hostvars)
9 system/k3s_ssl.yml Deploy cert-manager + step-issuer as k3s static HelmCharts; create the StepClusterIssuer step-ca wired to the JWK provisioner and root CA. cert-manager v1.19.2, step-issuer 1.9.11, caUrl: https://ssl-ca.arcodange.lab:8443, ARM64 kube-rbac-proxy override
10 system/k3s_config.yml Deploy Longhorn + Traefik as HelmCharts; issue the wildcard cert, set the default TLSStore, wire Gitea, the IP-allow-list middleware, and the CrowdSec bouncer plugin; then delete the old Traefik to force a redeploy. Longhorn v1.9.1, Traefik v37.4.0 (see detail below)

How the stages fit together

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%%
flowchart TD
  classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
  classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4;
  classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;

  rpi["1 · rpi.yml<br>hostname + dnsmasq off"]:::host
  dns["2 · pihole<br>HA DNS"]:::host
  ssl["3 · step-ca<br>root CA + CA-trusting runner image"]:::host
  disk["4 · prepare_disks.yml<br>ext4 arcodange_500 -> /mnt/arcodange"]:::danger
  docker["5 · system_docker.yml<br>data-root on external disk"]:::host
  iscsi["6 · iscsi_longhorn.yml<br>open-iscsi + dm_crypt"]:::host
  k3s["7 · system_k3s.yml<br>k3s v1.34.3 (--disable traefik)"]:::cluster
  cdns["8 · k3s_dns.yml<br>coredns-custom -> Pi-hole"]:::cluster
  cmgr["9 · k3s_ssl.yml<br>cert-manager + step-issuer"]:::cluster
  cfg["10 · k3s_config.yml<br>Longhorn + Traefik + redeploy"]:::cluster

  rpi --> dns --> ssl --> disk --> docker --> iscsi --> k3s --> cdns --> cmgr --> cfg
  1. rpi.yml fixes the hostname and, on Pi-hole nodes, stops dnsmasq so pihole-FTL can own port 53.
  2. Pi-hole comes up as the HA DNS authority for arcodange.lab.
  3. step-ca is installed; its root CA is fetched and baked into a Gitea runner image so CI can trust internal TLS.
  4. prepare_disks.yml formats and mounts the external USB disk at /mnt/arcodange (with a confirmation pause).
  5. Docker installs with its data-root pointed at that disk and capped logging.
  6. iSCSI + dm_crypt prerequisites land so Longhorn can attach (and encrypt) volumes.
  7. K3s installs with the first host as server, Docker as the container runtime, and Traefik disabled.
  8. CoreDNS is reconfigured to forward arcodange.lab to the Pi-holes.
  9. cert-manager + step-issuer wire the in-cluster issuer to step-ca.
  10. k3s_config.yml deploys Longhorn and a fully-customized Traefik, then deletes the old Traefik so the helm-controller redeploys with the new config.

k3s_config.yml — Longhorn & Traefik detail

Resource Value Notes
Longhorn HelmChart v1.9.1 defaultSettings.defaultDataPath: /mnt/arcodange/longhorn — volumes live on the external disk.
Traefik HelmChart v37.4.0 Deployed as a k3s static manifest (traefik-v3.yaml) with an inline traefik-configmap.
Wildcard cert wildcard-arcodange-lab Certificate for arcodange.lab + *.arcodange.lab, issued by the step-issuer StepClusterIssuer.
TLSStore default defaultCertificate: wildcard-arcodange-lab Makes the wildcard cert the cluster-wide default.
Gitea exposure gitea-external ExternalName Service → pi2 port 3000 Gitea runs outside K3s as Docker Compose on pi2; Traefik routes gitea.arcodange.lab to it.
localIp middleware ipAllowList Restricts dashboard/Gitea routers to LAN + pod CIDR + the detected public IP.
CrowdSec bouncer plugin v1.3.3 Traefik experimental plugin crowdsec-bouncer-traefik-plugin (config completed in 04 · Tools).
DuckDNS token traefik-duckdns-token Secret → DUCKDNS_TOKEN Consumed by the letsencrypt ACME DNS-challenge resolver via envFrom.

Gotchas

Caution

Step 4 formats a disk — data loss is real. prepare_disks.yml picks the largest non-system partition and runs mkfs.ext4 -F on it when the arcodange_500 label is absent. The run_once pause prompt ("tapez 'oui' pour continuer") is the only guard, and a wrong USB stick plugged into the wrong Pi will be wiped. Confirm target_device in the debug output before answering. If a candidate already carries the label, the format is skipped and the disk is only (re)mounted.

Warning

K3s ships with --disable traefik. The bundled Traefik is intentionally turned off in step 7 so step 10 can deploy its own fully-customized v37.4.0. If you re-enable the bundled Traefik or run k3s_config.yml out of order, two Traefiks will fight over the ingress ports.

Warning

ARM64 needs the kube-rbac-proxy image override. step-issuer's default gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0 is AMD64-only and crash-loops on pi3 (ARM64). k3s_ssl.yml overrides it to quay.io/brancz/kube-rbac-proxy:v0.15.0. Do not remove this override.

Warning

Traefik is force-redeployed. The last play of k3s_config.yml deletes the traefik Deployment and the helm-install-traefik Job so the k3s helm-controller re-runs the install against the new manifest. Expect a brief ingress outage during this window; the play then waits for the new Deployment to come back before finishing.

Note

tags: never plays are opt-in. rpi.yml and system_docker.yml carry tags: never, so they are skipped unless you explicitly pass their tag (e.g. --tags rpi / --tags ...) or --tags all. The K3s/Longhorn/Traefik plays run on a normal invocation.