46 Commits

Author SHA1 Message Date
801724e1bc Merge pull request 'chore(iac): remove spent R2 import block' (#14) from arcodange/r2-import-cleanup into main 2026-06-24 13:24:09 +02:00
7727b244ad chore(iac): remove spent R2 import block
The one-time import block from the previous change reconciled
cloudflare_r2_bucket.arcodange_tf into state (run #29: "Import complete",
"Apply complete! Resources: 1 imported"). It is now a no-op, so remove it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:23:42 +02:00
e2a79a08a7 Merge pull request 'fix(iac): import existing EU R2 bucket into state' (#13) from arcodange/r2-state-import into main 2026-06-24 13:19:56 +02:00
a0fbe5c655 fix(iac): import existing EU R2 bucket into state
Run #28 applied cleanly except cloudflare_r2_bucket.arcodange_tf: the bucket
exists in the EU jurisdiction, but its prior state entry lacked the jurisdiction,
so cloudflare provider >=5.20 read it as not-found, removed it from state, and
then failed to recreate it ("already exists"). Add a config-driven import block
with the jurisdiction-qualified id (<account_id>/<bucket_name>/<jurisdiction>) so
the next apply adopts the real bucket. No-op once reconciled; removable after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:19:32 +02:00
fc28c52b85 Merge pull request 'fix(iac): pin cloudflare provider + lockfile, trust homelab CA in gitea provider' (#12) from arcodange/iac-provider-fixes into main 2026-06-24 13:03:16 +02:00
bfa05ff633 Merge pull request 'fix(ci): run factory tofu workflows on the CA-trusting runner' (#11) from arcodange/focused-dirac-151213 into main 2026-06-24 13:02:58 +02:00
9b545e6f8f fix(iac): pin cloudflare provider + lockfile, trust homelab CA in gitea provider
With the runner CA fix (#11) the iac workflow now runs far enough to apply,
which exposed two provider problems:

cloudflare drift — `cloudflare/cloudflare` floated on `~> 5` with no committed
lock file, so CI pulled v5.21.1 where `cloudflare_account_token.policies[].resources`
is a JSON string, not a map ("Incorrect attribute value type"). Fix:
- pin to `~> 5.21` and commit a multi-platform `.terraform.lock.hcl`
  (linux_arm64 for the runner + darwin_arm64 for local);
- `jsonencode(...)` the module's policy resources;
- bind the cloudflare_token module to `cloudflare/cloudflare` explicitly (it was
  defaulting to `hashicorp/cloudflare`, pulling a redundant provider);
- stop `.gitignore` from hiding the lock file (the old `.terraform.*` rule did).

gitea provider TLS — it runs inside the dflook/terraform-apply container, which
doesn't trust the homelab CA (only the ubuntu-latest-ca runner does), so it
failed `x509: certificate signed by unknown authority` reaching
gitea.arcodange.lab. Fix: feed it the homelab CA via the provider's `cacert_file`
(TF_VAR_gitea_cacert_file -> the homelab.pem the workflow already materializes).

Validated locally with `tofu validate` + provider-schema inspection (no prod
calls). Complements #11. Out of scope (need a live run / operator): the OVH
consumer-key scope, and the R2 bucket "not found" on refresh (a state reconcile).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:56:46 +02:00
e5c537a967 fix(ci): run factory tofu workflows on the CA-trusting runner
After the move to the self-signed internal DNS (gitea.arcodange.lab /
vault.arcodange.lab), the default `ubuntu-latest` runner image does not
trust the homelab CA, so the `uses:` clone of the vault-action over HTTPS
fails TLS verification. webapp's workflows already moved to the
`ubuntu-latest-ca` runner (whose image ships the homelab CA); apply the
same to the factory `iac` and `postgres` tofu workflows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 11:22:54 +02:00
3b0919b804 Merge pull request 'docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md' (#10) from arcodange/focused-dirac-151213 into main
Reviewed-on: #10
2026-06-24 11:01:09 +02:00
053b04337a chore: gitignore .claude/worktrees
Per-session Claude Code checkouts live under .claude/worktrees/<slug>/
on the trunk; keep them out of git so the main checkout stays clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:55:43 +02:00
1824a1885d docs(vibe): add maintenance rule to the ansible + opentofu sub-hubs
The two factory-provisioning sub-hubs were the only guidebook index pages without
the "alter a documented component -> update its page in the same PR" reminder that
every sibling hub carries. Add a scoped maintenance rule to each, pointing back to
the factory-provisioning maintenance rule and the guidebooks' Rules to contribute,
so no folder hub silently drifts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 23:42:24 +02:00
2d76eb45c1 docs(vibe): add new-tool and new-app runbooks (grounded in real PRs)
Two agent-oriented runbooks under vibe/runbooks/ with [AGENT]/[HUMAN] step
markers, grounded in real diffs:

- new-tool.md : add a platform component to the tools repo so ArgoCD deploys it
  into the tools namespace (wrapper Chart.yaml + the tool library + a row in
  chart/values.yaml; optional iac/ for secrets). Mirrors the prometheus/crowdsec
  additions.
- new-app.md  : stand up a brand-new application across THREE repos (app +
  factory + tools) with the strict ordering dependency and the TERRAFORM_SSH_KEY
  pitfall. Phase-by-phase mapped to the dance-lessons-coach onboarding PRs
  (#89/#97/#98/#99/#100), factory #1/#2, tools #1; the FR doc/runbooks/new-web-app
  is linked as the detailed companion.

2 mermaid diagrams MCP-validated; zero dead links across the vibe tree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 22:22:09 +02:00
7bf83e75ed docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops)
Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most
data-critical app, cross-linked from the applications hub (bidirectional):

- README.md             : Dolibarr 22.0.4 on Postgres; data-criticality; overview
  diagram; the Vault-unseal-before-scale recovery ordering (CAUTION).
- deployment.md         : upstream image + custom entrypoint (MySQL->psql), the
  50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init
  scripts (conf.php creds, table-ownership), ingress, CI.
- backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day
  retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery
  ordering (Longhorn -> Vault unseal -> erp scale-up).
- operations.md         : the read-only bin/arcodange CLI, static/company.json,
  Deno+Playwright tests, day-2 ops.

erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams
MCP-validated; zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 22:12:11 +02:00
4823394e0e docs(vibe): add applications/ guidebook (webapp + url-shortener)
Tree-docs guidebook under vibe/guidebooks/applications/ documenting the common
app pattern and two contrasting archetypes, drilling into lab-ecosystem/01-factory
(bidirectional):

- README.md  : the shared app pattern (repo = Dockerfile + chart + optional iac +
  CI; ArgoCD app-of-apps; the <app> join key; .fr vs .lab ingress conventions) +
  a two-archetype comparison.
- webapp.md  : canonical Go + Postgres exemplar (chart, VaultAuth/Static/Dynamic
  CRDs, inline iac vs the shared app_roles module, CI); notes the current nuance
  that the live pod still uses the static pgbouncer_auth DATABASE_URL.
- url-shortener.md : Rust + SQLite-on-Longhorn-RWO counterpart (single replica,
  no iac/no Vault, CI mirrors the upstream image); the power-cut recovery story.

erp is referenced in prose only (its own guidebook lands next). Sibling-repo code
via full gitea URLs; 2 mermaid diagrams MCP-validated; zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:58:36 +02:00
548dacfc44 docs(vibe): add tools/ and cms/ guidebooks
Two code-grounded tree-docs guidebooks under vibe/guidebooks/, drilling into the
lab-ecosystem 02-tools and 03-cms pages (bidirectional):

- tools/  : hub + components.md (Vault+VSO, Prometheus, Grafana, CrowdSec,
  pgbouncer, Redis/KeyDB, Plausible, ClickHouse; pgcat/tool as Tier-2) +
  secrets-and-vso.md (Vault engines/auth, the app_roles/app_policy modules =
  the <app> join-key machinery, VSO CRDs, secret-paths inventory).
- cms/    : hub + site.md (Nuxt + dual Pages/k3s deploy) + cloudflare.md
  (zone via OVH->CF, Pages, cloudflared tunnel, Turnstile, R2 state) +
  zoho-email.md (OAuth, MX/SPF/DKIM/DMARC/BIMI, the 7 aliases).

Sibling-repo code linked via full gitea URLs; vibe-internal links bidirectional.
Reconciled the cloudflared tunnel token path to kvv2 cms/cloudflared (the chart
VaultStaticSecret is kv-v2; the kvv1 tofu reference is a commented-out stub).
6 mermaid diagrams MCP-validated; zero dead links. Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:41:15 +02:00
dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00
b886f06824 docs(vibe): backfill PR #10 crosslink into ADR-0001 + PRD STATUS
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:53:39 +02:00
7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00
827af6b392 Merge pull request 'docs(runbooks): runbook « mettre en service une nouvelle application web »' (#9) from claude/new-web-app-runbook into main 2026-05-31 17:25:37 +02:00
8330d82225 docs(runbooks): add "new web app" setup runbook under doc/runbooks/
Document, as a tree-docs tree, the end-to-end procedure to stand up a new
web application on the Arcodange platform — a mechanic spread across the
factory, tools and app repos with non-trivial ordering dependencies.

Covers: Gitea repo creation (org-secret inheritance), Postgres DB + owner
role (factory/postgres/iac), platform Vault declaration (gitea_cicd_<app>
+ policies, tools/hashicorp-vault/iac), the app Helm chart (VSO dynamic
secrets via pgbouncer), the app Terraform (app_roles module), the CI
workflows (tofu apply + image build, incl. the copy-pasted role pitfall),
and ArgoCD registration (factory/argocd/values.yaml). Adds a naming-
conventions concept page and an ordered checklist.

Wires the legacy doc/adr "setup hello world web app" item and the factory
README to the runbook. New docs live under doc/ (singular) per the PR #8
convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 17:22:30 +02:00
54b3092305 Merge pull request 'fix(docs): place ADR under doc/adr (singular) per convention' (#8) from fix/adr-path-doc-singular into main
Reviewed-on: #8
2026-05-09 15:29:22 +02:00
e0fb337a5f docs: place new ADR under doc/adr (singular) per convention
The 20260509 ADR landed in docs/adr/ (plural) by mistake. Convention
is doc/adr/ (alongside the existing 00_*, 01_*, … docs and the
network-architecture/cicd-architecture ADRs that pre-existed there).

Note : 20260407-*.md files in the typo'd docs/adr/ are still untracked
(never committed) — separate cleanup task.
2026-05-09 14:25:37 +02:00
ea500abe62 Merge pull request 'docs(adr): telegram-gateway auth (Phase 1.5)' (#7) from docs/telegram-gateway-auth-adr into main
Reviewed-on: #7
2026-05-09 14:22:12 +02:00
62673a2d65 docs(adr): telegram-gateway auth (Phase 1.5)
Documents the authentication layer added to telegram-gateway in
Phase 1.5 :

- principal bot @arcodange_factory_bot (handler=auth) gère /auth, /whoami, /logout
- session Redis 24h keyed by Telegram from.id (TTL via AUTH_SESSION_TTL)
- allowlist optionnelle (ALLOWED_USERS) — silent drop avant la gate
- requireAuth secure-by-default (true), opt-out explicite par bot
- handler=auth force requireAuth=false (chicken-and-egg)

Cross-links bidirectionnels avec le code (Gitea URLs vers
arcodange/telegram-gateway), AUTH.md (user-facing) et HOWTO_ADD_BOT.md
(Cas 2 mis à jour). Diagrammes mermaid avec contrastes explicites.
2026-05-09 13:58:27 +02:00
4163b06659 Merge pull request 'argocd: add telegram-gateway application' (#6) from feat/homelab-gateway-app into main
Reviewed-on: #6
2026-05-09 12:41:49 +02:00
3fb7544351 argocd: rename homelab-gateway → telegram-gateway
Aligns with the upstream repo rename
(arcodange/homelab-gateway → arcodange/telegram-gateway) so the name
matches the public URL tg.arcodange.fr and Arcodange's naming
conventions.
2026-05-09 12:35:37 +02:00
5038956332 argocd: add homelab-gateway application
Adds the homelab-gateway Argo CD Application pointing at
arcodange/homelab-gateway (user space, like dance-lessons-coach).

Image Updater watches gitea.arcodange.lab/arcodange/homelab-gateway:latest
with digest strategy.

Phase 1 of the Telegram webhook gateway — a long-running pod that
receives webhooks (no more polling) and routes per-bot to handler
implementations. Initial bot: @arcodange_factory_bot, slug=factory,
echo handler.
2026-05-09 12:25:30 +02:00
6ede249da9 🔒 fix(ansible): gate vault auth disable behind vault_oidc_force_reset (default off) (#5)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 15:03:33 +02:00
9e821e1626 ♻️ refactor(ansible): move gitea secret user-propagation list to inventory (#4)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 14:48:05 +02:00
69b7e9ddcb Merge remote-tracking branch 'origin/main' 2026-05-06 14:38:01 +02:00
069edd72f1 chore(cicd): drop temporary commented-out tasks from 03_cicd.yml
Removes the commented PACKAGES_TOKEN/HOMELAB_CA_CERT blocks and the legacy
"Deploy Argo CD" play that were left behind during the migration to
Helm-based ArgoCD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:37:48 +02:00
a644436746 🔒 fix(ansible): propagate vault_oauth__sh_b64 to user-owned namespaces (arcodange) (#3)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 14:18:06 +02:00
a3526e51f8 Merge remote-tracking branch 'origin/main' 2026-05-06 12:58:01 +02:00
01f0f37691 chore(ansible): add per-collection ansible.cfg + drop trailing whitespace
ansible/arcodange/factory/ansible.cfg sets collections_path so ansible
commands run from inside the collection directory still find user-installed
collections under ~/.ansible/collections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:54 +02:00
f114d7e6f0 feat(argocd): allow per-app syncPolicy override in values.yaml
The apps template hardcoded automated{prune,selfHeal} for every app. Some
apps (e.g. tools, where Vault unseal is manual) need a custom syncPolicy
without selfHeal. Read $app_attr.syncPolicy when set, fall back to the
existing automated default otherwise. Use the override on `tools` to keep
the existing behavior explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:49 +02:00
1688fe0dfd fix(crowdsec): clean up Failed pods before Traefik middleware reload
Re-running the role would leave behind crowdsec pods stuck in Failed phase
(typically after a config error on a previous run), which then blocked the
Traefik middleware refresh. Delete them up front so the next reconcile
schedules fresh pods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:39 +02:00
499410a160 feat(cicd): persist gitea act-runner cache + isolate on dedicated docker network
Pins the actcache server to a fixed port (43707) and exposes it, then
mounts /mnt/arcodange/gitea-runner-cache and /mnt/arcodange/gitea-runner-act
into the runner so the actions/cache and act image layer cache survive
container restarts. Moves the runner onto a dedicated `gitea_action_network`
so CI job containers can reach the cache server by name without sharing the
host network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:34 +02:00
e3e0decd98 docs(adr): extend network-architecture ADR with .lab SSL/TLS deep dive
Replaces the placeholder "Success Metrics" section with a detailed
walkthrough of the internal PKI: Step CA provisioners, cert-manager +
StepClusterIssuer wiring, certificate issuance/renewal sequence diagram,
device-trust installation steps, and troubleshooting playbook for the
common stuck-CertificateRequest / Traefik TLS / device-trust failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:27 +02:00
1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:18 +02:00
934b62d922 chore(ansible): use project-local uv venv for ansible runtime deps
Moves the local ansible runtime from a global `uv tool install ansible-core`
(which required remembering `--with kubernetes --with jmespath --with dnspython`)
to a project-managed venv described by `pyproject.toml` + `uv.lock`. Fixes the
"Failed to import the required Python library (kubernetes)" error on localhost.

The localhost inventory entry now derives `ansible_python_interpreter` from
`{{ ansible_playbook_python }}`, so `uv run ansible-playbook` is enough — no
more hardcoded user-specific paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:35:28 +02:00
09a270d179 🤖 ci(postgres): declare dance-lessons-coach DB + role + pgbouncer lookup (#2)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 08:18:30 +02:00
0ce004cc6a 🤖 ci(argocd): enroll dance-lessons-coach + per-app org override in apps template (#1)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 08:01:50 +02:00
e6fc24c101 fix(dns): harden DNS resilience after power-cut incident
During the 2026-04-13 power cut recovery, DNS resolution failures blocked
Longhorn reinstall. Root causes:
- CoreDNS forwarded to a single hardcoded Pi-hole IP instead of both HA instances
- CoreDNS main Corefile forwarded to /etc/resolv.conf which pointed to itself on pi3
- Pi-hole lacked explicit upstream DNS, relying on DHCP-provided config
- dnsmasq system service conflicted with pihole-FTL on port 53

Changes:
- k3s_dns: forward CoreDNS to both Pi-hole HA instances (pi1 + pi3) dynamically
- k3s_dns: update main CoreDNS Corefile to forward to Pi-holes instead of resolv.conf
- pihole defaults: add explicit upstream DNS servers (8.8.8.8, 1.1.1.1, 8.8.4.4)
- pihole ha_setup: write /etc/dnsmasq.d/99-upstream.conf with explicit upstreams
- rpi: add dnsmasq user to dip group and disable conflicting dnsmasq service on Pi-hole nodes

See docs/adr/20260414-internal-dns-architecture.md for full rationale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:54:42 +02:00
355ab11c4d fix(system_docker): fix daemon.json corruption on re-run
Two bugs caused daemon.json to be overwritten with invalid content:
- Invalid `when` condition using unsupported Ansible inline stat syntax,
  causing the existing file read to be silently skipped and docker_config
  to always reset to {}
- Folded scalar `>` in set_fact converted the dict to a Python string
  representation, which to_nice_json serialized as a JSON string instead
  of an object

Fixes identified during 2026-04-13 power cut incident post-mortem.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:52:27 +02:00
ad70b424cf Add sequence diagram to Docker storage ADR
This commit adds a detailed sequence diagram to the Docker storage optimization ADR, illustrating the workflow for configuring Docker storage, pinning images, and maintaining Longhorn performance.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:33:03 +02:00
b299469d00 Consolidate ADRs into docs/adr/
This commit moves Architecture Decision Records (ADRs) from ../../../docs/adr/ to docs/adr/ in the arcodange/factory repository. This centralizes all ADRs in one location for better maintainability and discoverability.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:09:34 +02:00
119 changed files with 15245 additions and 293 deletions

View File

@@ -36,7 +36,7 @@ concurrency:
jobs:
gitea_vault_auth:
name: Auth with gitea for vault
runs-on: ubuntu-latest
runs-on: ubuntu-latest-ca
outputs:
gitea_vault_jwt: ${{steps.gitea_vault_jwt.outputs.id_token}}
steps:
@@ -50,7 +50,7 @@ jobs:
name: Tofu
needs:
- gitea_vault_auth
runs-on: ubuntu-latest
runs-on: ubuntu-latest-ca
env:
OPENTOFU_VERSION: 1.8.2
TERRAFORM_VAULT_AUTH_JWT: ${{ needs.gitea_vault_auth.outputs.gitea_vault_jwt }}
@@ -62,6 +62,10 @@ jobs:
run: echo -n "${{ secrets.HOMELAB_CA_CERT }}" | base64 -d > $VAULT_CACERT
- name: terraform apply
uses: dflook/terraform-apply@v1
env:
# the apply runs in dflook's container, which doesn't trust the homelab CA;
# hand the gitea provider the CA cert the step above wrote to the workspace
TF_VAR_gitea_cacert_file: "${{ github.workspace }}/homelab.pem"
with:
path: iac
auto_approve: true

View File

@@ -33,7 +33,7 @@ concurrency:
jobs:
gitea_vault_auth:
name: Auth with gitea for vault
runs-on: ubuntu-latest
runs-on: ubuntu-latest-ca
outputs:
gitea_vault_jwt: ${{steps.gitea_vault_jwt.outputs.id_token}}
steps:
@@ -47,7 +47,7 @@ jobs:
name: Tofu - Postgres
needs:
- gitea_vault_auth
runs-on: ubuntu-latest
runs-on: ubuntu-latest-ca
env:
OPENTOFU_VERSION: 1.8.2
TERRAFORM_VAULT_AUTH_JWT: ${{ needs.gitea_vault_auth.outputs.gitea_vault_jwt }}

10
.gitignore vendored
View File

@@ -1,4 +1,10 @@
.terraform
.terraform.*
.terraform/
*.tfstate
*.tfstate.*
# keep .terraform.lock.hcl tracked (it pins provider versions; the old `.terraform.*` rule hid it)
.DS_Store
node_modules/
.venv/
# Claude Code worktrees (per-session checkouts under .claude/worktrees/<slug>/)
.claude/worktrees/

131
AGENTS.md Normal file
View File

@@ -0,0 +1,131 @@
# Arcodange Lab — Agent Guide & Operating Rules
The Arcodange lab is a self-hosted home/company platform running on three Raspberry Pi (pi1/pi2/pi3) behind a home Livebox, driven from a MacBook Pro M4 control node. **factory** is the cornerstone admin repo: it provisions the cluster (k3s), defines what gets deployed (ArgoCD app-of-apps), manages cloud/forge state (OpenTofu), and provisions databases (PostgreSQL). Two sibling repos carry workloads: **tools** (platform services — Vault, Prometheus, Grafana, CrowdSec, poolers) and **cms** (the public Nuxt site `arcodange.fr`). Everything is deployed into k3s namespaces via ArgoCD, every secret comes from Vault, and public traffic enters through a Cloudflared Zero-Trust tunnel into the internal Traefik.
```mermaid
%%{init: {'theme':'base'}}%%
flowchart TB
subgraph control["Control node (MacBook Pro M4)"]
ansible["Ansible"]:::proc
tofu["OpenTofu"]:::proc
end
factory["factory repo<br>orchestrator: ansible + argocd + iac + postgres + doc"]:::src
tools["tools repo<br>Vault, Prometheus, Grafana, CrowdSec, poolers"]:::src
cms["cms repo<br>Nuxt site arcodange.fr"]:::src
subgraph cluster["k3s cluster (pi1 server, pi2/pi3 agents)"]
argocd["ArgoCD app-of-apps"]:::proc
vault["Vault + VSO"]:::store
nsapps["namespaces: tools / cms / webapp / erp / ..."]:::proc
traefik["internal Traefik"]:::proc
end
cflared["Cloudflared Zero-Trust tunnel"]:::proc
public["public: *.arcodange.fr"]:::store
ansible -- "provision k3s + base" --> cluster
tofu -- "state in GCS" --> factory
factory -- "defines Applications" --> argocd
argocd -- "deploy charts" --> tools
argocd -- "deploy charts" --> cms
tools --> nsapps
cms --> nsapps
vault -- "inject secrets" --> nsapps
nsapps --> traefik
cflared -- "ingress" --> traefik
public --> cflared
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
```
1. The **control node** (MacBook Pro M4) runs Ansible and OpenTofu — the two hands that build everything.
2. **Ansible** provisions the **k3s cluster** (pi1 server, pi2/pi3 agents) and its base layer.
3. **OpenTofu** manages forge/cloud state, persisted in **GCS** (`gs://arcodange-tf`), serving the **factory** repo's intent.
4. **factory** defines the **ArgoCD app-of-apps**, which is the single deployment authority.
5. **ArgoCD** deploys the Helm charts of **tools** and **cms** into per-app k3s **namespaces**.
6. **Vault** (with the Vault Secrets Operator) injects secrets into those namespace pods — no secrets live in git.
7. Workload pods route through the **internal Traefik**.
8. The **Cloudflared Zero-Trust tunnel** is the only public ingress, forwarding `*.arcodange.fr` traffic into internal Traefik.
---
## Repos at a glance
| Repo | Purpose | Key dirs | How deployed |
|---|---|---|---|
| **factory** | Cornerstone admin repo: provisions the cluster, defines deployments, owns infra state and DBs, holds canonical docs | [ansible/](ansible/) · [argocd/](argocd/) · [iac/](iac/) · [postgres/](postgres/) · [doc/](doc/) | Ansible (cluster + base) + OpenTofu (forge/cloud/PG state); ArgoCD reads its app-of-apps |
| **tools** | Platform services in the `tools` namespace | [hashicorp-vault](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault), [prometheus](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus), [grafana](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana), [crowdsec](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec), [pgbouncer](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgbouncer) | Helm/Kustomize charts deployed via ArgoCD |
| **cms** | Public Nuxt static site `arcodange.fr` + its zone/email IaC | [chart](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart), [cloudflare](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare), [zoho](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/zoho) | Helm chart via ArgoCD; Cloudflare/Zoho via OpenTofu |
Self-hosted Gitea is at `gitea.arcodange.lab` (org `arcodange-org`). **Pick the forge tool from the remote**: these repos live on Gitea, so use the `mcp__gitea__*` MCP tools for PRs/issues/releases — `gh` will silently fail.
## The `<app>` join key
One kebab-case identifier — `<app>` — is reused **identically** across the Gitea repo, the PG database + `<app>_role`, Vault (`postgres/creds/<app>`, k8s auth role `<app>`, policies `<app>` / `<app>-ops`, `gitea_cicd_<app>`), the k8s namespace + ServiceAccount, the ArgoCD Application, the GCS state prefix `<app>/main`, and DNS (`<app>.arcodange.lab` / `<app>.arcodange.fr`). Bricks wire together **by name convention, not explicit config**, so a single typo breaks the chain silently. Source of truth: [doc/runbooks/new-web-app/conventions.md](doc/runbooks/new-web-app/conventions.md); concept page (going forward): [vibe/guidebooks/lab-ecosystem/naming-conventions.md](vibe/guidebooks/lab-ecosystem/naming-conventions.md).
## Where knowledge lives
Start at the knowledge-base front door: [vibe/README.md](vibe/README.md). The six `vibe/` folders:
- [vibe/ADR/](vibe/ADR/README.md) — architecture decision records (the *why*); canonical home going forward.
- [vibe/PRD/](vibe/PRD/README.md) — product/project requirement docs, each with a mandatory `STATUS.md`.
- [vibe/investigations/](vibe/investigations/README.md) — numbered investigations (`INV-NNN-slug`), with notebooks when data-heavy.
- [vibe/guidebooks/](vibe/guidebooks/README.md) — tree-docs that map the lab's components (the *how it fits together*).
- [vibe/runbooks/](vibe/runbooks/README.md) — step-by-step operational procedures with `[AGENT]` / `[HUMAN]` markers.
- [vibe/shareouts/](vibe/shareouts/README.md) — handouts and presentations (FRENCH; the one exception to the English rule).
Historical infra docs still live under [doc/](doc/) (ADRs, the new-web-app runbook) — see also `CLUSTER_RECOVERY.md` (at the lab root, **outside** this repo) for tested power-cut recovery.
## Operating rules for agents
### No-tombstone rule (FOREMOST)
Write every file as **currently true**. NEVER leave "Correction (date): …", "previously X, now Y", changelogs, or "updated to …" notes — git history is the audit trail. The **only** allowed exception is a forward-looking `> [!CAUTION]` about a live operational risk.
### Mermaid preferences
Begin each block with an init directive selecting `theme base` (or `forest`). Define a `classDef` palette legible on both light and dark backgrounds (dark fills + light text), e.g. `classDef src fill:#2563eb,stroke:#1e40af,color:#fff`. Use HTML `<br>` for line breaks (never `\n`). Put a leading space before any label starting with a slash, and escape angle brackets inside labels. **Validate every diagram with the Mermaid MCP** before committing, and **immediately after each diagram add a numbered ordered list restating the same flow in words**.
### Tree-docs (guidebooks & big PRDs)
Guidebooks and large PRDs are written as navigable trees: every file's **first line is a breadcrumb** (ancestors are relative links, current page is bold-unlinked, separator ` > `); **every folder has a `README.md` index hub** (a table of its children — link + one-line summary + status — sorted by importance/sequence, not alphabetically); **cross-references are bidirectional** (if A links B, B links A); use numbered file prefixes only for ordered narratives; **stamp `Last Updated: 2026-06-23` at each tree root**.
### Optimized ADR format (MADR-lite)
Sections: **Context / Decision / Consequences / Alternatives / QA & validation / References**. Once an ADR is **Accepted, the body is immutable** — only the status field mutates (Proposed → Accepted → Superseded). The canonical home going forward is [vibe/ADR/](vibe/ADR/README.md); [doc/adr/](doc/adr/README.md) stays as the historical record.
### Investigations
Prefer a **single `INV-NNN-slug.md`** when the finding fits in one file. When data-heavy, write a short stub `.md` beside a same-named folder containing notebooks (`.ipynb` + paired `.py`), a `_data/` dir, and a plain-language `notebook_simple.md` (visuals anyone can read).
### PRD convention
**One subfolder per PRD.** A `STATUS.md` is **mandatory** and must be updated whenever something ships. Big PRDs use tree-docs and **must detail a QA strategy**. The flow is Problem → … → QA strategy → STATUS.md.
### PR crosslinking (bidirectional)
**Every PR body references the ADR/PRD it advances.** In return, the ADR's **References** section and the PRD's **STATUS.md** link back to that PR. Links must be bidirectional — never one-way.
### Guidebook maintenance
Altering a component that is documented in `guidebooks/` **requires updating that guidebook page in the same change**. A code/infra change that leaves its guidebook stale is incomplete.
### Language policy
**English** for everything in `vibe/` and for `AGENTS.md`/`CLAUDE.md` (this tree is for LLM agents). The single exception: **shareouts handouts are FRENCH**.
## The cohort + workflow
These personas are spawned via the Agent tool or a Workflow; document and reuse them across sessions:
| Persona | Role |
|---|---|
| **Lab Cartographer** | Explores the three repos and maps them into guidebooks (tree-docs). Read-mostly; never edits infra. |
| **ADR Scribe** | Writes optimized MADR-lite ADRs; enforces immutability (status-only mutation) and PR crosslinks. |
| **PRD Architect** | Writes PRDs (Problem → … → QA strategy → STATUS.md); uses tree-docs for big ones. |
| **Runbook Engineer** | Writes step-by-step runbooks with `[AGENT]` (read-only, safe) and `[HUMAN]` (prod-mutating, needs approval) markers. |
| **Investigator** | Writes investigations: a single `INV-NNN-slug.md` when possible, else a stub `.md` beside a same-named folder with `.ipynb` + paired `.py`, a `_data/` dir, and `notebook_simple.md`. |
| **Diagram Smith** | Authors and validates every mermaid diagram with the Mermaid MCP validator; enforces the mermaid preferences + the ordered-list-after-diagram rule. |
| **Continuity Warden** | The adversarial reviewer: checks the no-tombstone rule, breadcrumb depth, bidirectional links, dead links, naming, STATUS/PR crosslinks, and the Last Updated stamp. |
**Recommended workflow for substantial `vibe/` contributions:**
1. **Scaffold** — folders + README hubs + templates.
2. **Author** — personas in parallel, each on distinct files.
3. **Validate** — Diagram Smith runs the Mermaid MCP on every diagram.
4. **Review** — Continuity Warden runs the adversarial checklist.
5. **Assemble** — wire cross-refs, run the dead-link self-test, stamp `Last Updated`.

View File

@@ -80,4 +80,9 @@ classDef done fill:gold,stroke:indigo,stroke-width:4px,color:blue;
class prepare_hd,nodeId2 done;
```
## Documentation
- 📚 [`doc/`](doc/README.md) — ADR (décisions d'architecture) + runbooks.
- 🚀 [Runbook : mettre en service une nouvelle application web](doc/runbooks/new-web-app/README.md) — dépôt Gitea, base de données, Vault, chart Helm, Terraform, CI, ArgoCD.
🏹💻🪽

View File

@@ -1,5 +1,17 @@
# Use Ansible
## Run locally (uv)
A project-local venv is defined in `pyproject.toml` at the repo root (ansible-core + the `kubernetes`, `jmespath`, `dnspython` libraries that `kubernetes.core` and friends need at runtime).
```sh
uv sync # creates .venv/ and installs ansible-core + python deps
uv run ansible-galaxy collection install -r ansible/requirements.yml
uv run ansible-playbook -i ansible/arcodange/factory/inventory ansible/arcodange/factory/playbooks/<playbook>.yml
```
The localhost entry in the inventory uses `ansible_python_interpreter: "{{ ansible_playbook_python }}"`, so `uv run` is enough — Ansible picks up the venv's Python automatically without any hardcoded path.
## Run with docker ssh agent side proxy
### build docker images
@@ -67,31 +79,25 @@ ansible -i ,localhost -c local localhost -m raw -a "echo hello world {{ inventor
### local python environment with uv
#### Install UV
`python3 -m pip install uv`
`python3 -m uv python install 3.10 3.11 3.12`
`echo "export PATH=\"$(find ~/Library/Python/*/bin/uv | xargs dirname)\"" >> ~/.zshenv`
`echo 'export PATH="~/.local/bin:$PATH"' >> ~/.zshenv`
#### Set python version to 3.12
`uv python pin 3.12` (edit .python-version file)
#### Install ansible
`uv tool install ansible-core --with dnspython --with jmespath --with kubernetes`
`echo 'export PATH="~/.local/share/uv/tools/ansible-core/bin:$PATH"' >> ~/.zshenv`
#### Install this project depedencies
#### Install UV (one-time)
```sh
python3 -m pip install uv
python3 -m uv python install 3.12
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshenv
```
ansible-galaxy collection install --token 11bebd8fd1ad4009f700bdedbeb80b19743ce3d3 -r ansible/requirements.yml # token is used by a rate limiter and can be sensitive
#### Bootstrap the project venv
```sh
uv sync # honors .python-version (3.12) and pyproject.toml
uv run ansible-galaxy collection install -r ansible/requirements.yml
# `--token <token>` is only needed if you hit galaxy.ansible.com rate limits
```
#### Run
```
ansible-galaxy collection install ./ansible/arcodange/factory -f
ansible-playbook -i ansible/arcodange/factory/inventory ansible/arcodange/factory/playbooks/02_setup.yml
```sh
uv run ansible-galaxy collection install ./ansible/arcodange/factory -f
uv run ansible-playbook -i ansible/arcodange/factory/inventory ansible/arcodange/factory/playbooks/02_setup.yml
```

View File

@@ -0,0 +1,5 @@
[defaults]
collections_path = ~/.ansible/collections
[ssh_connection]
scp_if_ssh = True

View File

@@ -0,0 +1,160 @@
# ADR 20260407: CI/CD Architecture with ArgoCD, Gitea, and Vault
## Status
Proposed
## Context
The home lab requires a secure and automated CI/CD pipeline to deploy applications to the k3s cluster. The pipeline must integrate with:
- **Gitea**: For Git repository management and CI runners.
- **ArgoCD**: For GitOps-based continuous deployment.
- **Vault**: For secrets management and OIDC authentication.
- **Gitea Act Runner**: For executing CI jobs.
## Decision
We will implement a **GitOps-driven CI/CD pipeline** with the following components:
### 1. Gitea OIDC Authentication with Vault
- Gitea is registered as an OIDC application in Vault.
- Vault issues short-lived tokens for Gitea users.
- The `gitea_oidc_auth.yml` playbook automates this setup using Playwright and OpenTofu.
- **OIDC Workflow**:
1. The `oidc_jwt_token.sh` script (base64-encoded in `secrets.vault_oauth__sh_b64`) handles the OIDC flow.
2. Gitea Act Runner executes the script to obtain an ID token from Gitea.
3. The ID token is used to authenticate with Vault and retrieve secrets.
### 2. Gitea Act Runner
- Deployed on `pi1` and `pi3` (not on the Gitea host, which is `pi2`).
- Uses Docker-in-Docker for job execution.
- **Custom Runner Image (`ubuntu-latest-ca`)**: Required due to the self-signed `.lab` domain. The custom image includes the local CA certificate to trust the Gitea instance (`gitea.arcodange.lab`).
- Managed via Docker Compose (`03_cicd.yml`).
### 3. ArgoCD
- Deployed on the k3s cluster (via HelmChart in `/var/lib/rancher/k3s/server/manifests/argocd.yaml`).
- Uses Gitea as the source of truth for GitOps.
- Synchronizes the `factory` repository to deploy applications.
- Configured with Traefik for TLS termination.
### 4. Vault Secrets Operator
- Deployed in the `tools` namespace.
- Manages secrets for applications deployed via ArgoCD.
- Integrates with Gitea OIDC for authentication.
- **Helm Chart Integration**:
- `VaultAuth`: Authenticates with Vault using Kubernetes service accounts.
- `VaultStaticSecret`: Retrieves static secrets (e.g., `kvv2/webapp/config`).
- `VaultDynamicSecret`: Generates dynamic secrets (e.g., PostgreSQL credentials).
### 5. Security
- **TLS**: Traefik terminates TLS using Let's Encrypt.
- **OIDC**: Gitea authentication via Vault.
- **Secrets**: Stored in Vault, injected via the Vault Secrets Operator.
## Architecture Diagram
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#333333', 'edgeLabelBackground':'#f0f0f0', 'tertiaryColor': '#e67e22'}}}%%
graph TD
%% Styles
classDef gitea fill:#ffcc99,stroke:#cc9966,color:#333;
classDef argocd fill:#99ffcc,stroke:#66cc99,color:#333;
classDef vault fill:#ccccff,stroke:#6666cc,color:#333;
classDef k3s fill:#ff9999,stroke:#cc0000,color:#333;
classDef runner fill:#ffff99,stroke:#cccc00,color:#333;
%% Components
Gitea["Gitea (pi2)"]:::gitea
ArgoCD["ArgoCD (k3s)"]:::argocd
Vault["Vault (k3s/tools)"]:::vault
Runner1["Gitea Act Runner (pi1)"]:::runner
Runner2["Gitea Act Runner (pi3)"]:::runner
VaultOperator["Vault Secrets Operator (k3s/tools)"]:::vault
k3s["k3s Cluster"]:::k3s
%% Workflow
Gitea -->|OIDC Auth| Vault
Gitea -->|Trigger CI| Runner1
Gitea -->|Trigger CI| Runner2
Runner1 -->|Deploy to| k3s
Runner2 -->|Deploy to| k3s
ArgoCD -->|GitOps Sync| Gitea
ArgoCD -->|Deploy Apps| k3s
VaultOperator -->|Inject Secrets| k3s
Vault -->|Secrets| VaultOperator
%% Annotations
linkStyle 0,1,2,3,4,5,6,7 stroke:#999,stroke-width:1px;
```
## Consequences
### Positive
- **Automated Deployments**: ArgoCD ensures the cluster state matches Git.
- **Secure Secrets**: Vault centralizes secret management.
- **Scalable CI**: Gitea Act Runners can be added to any host.
- **OIDC Integration**: Secure authentication via Vault.
### Negative
- **Complexity**: Multiple moving parts (Gitea, ArgoCD, Vault).
- **Dependency on Vault**: If Vault fails, CI/CD may be disrupted.
- **Learning Curve**: Requires familiarity with GitOps and Vault.
## Alternatives Considered
### Alternative 1: GitHub Actions
- **Rejected**: Self-hosted Gitea aligns better with the home lab's privacy goals.
### Alternative 2: Jenkins
- **Rejected**: ArgoCD + Gitea Act Runner is lighter and more GitOps-native.
### Alternative 3: No CI/CD
- **Rejected**: Manual deployments are error-prone and unscalable.
## Sequence Diagrams
### 1. CI/CD Workflow for OpenTofu/Terraform
```mermaid
sequenceDiagram
participant Gitea
participant Runner as Gitea Act Runner (pi1/pi3)
participant Vault
participant WebApp as WebApp (k3s)
Gitea->>Runner: Trigger vault.yaml workflow
Runner->>Gitea: Execute vault_oauth__sh_b64 (OIDC)
Gitea-->>Runner: Return ID Token
Runner->>Vault: Authenticate with ID Token
Vault-->>Runner: Return Vault Token
Runner->>Runner: Run OpenTofu/Terraform
Runner->>Vault: Fetch Secrets (via Vault Action)
Vault-->>Runner: Return Secrets
Runner->>WebApp: Deploy Changes
```
### 2. Vault Secrets Operator Workflow
```mermaid
sequenceDiagram
participant ArgoCD
participant WebApp as WebApp (k3s)
participant VaultOperator as Vault Secrets Operator
participant Vault
ArgoCD->>WebApp: Deploy Helm Chart
WebApp->>VaultOperator: Create VaultAuth (K8s Auth)
VaultOperator->>Vault: Authenticate (K8s Service Account)
Vault-->>VaultOperator: Return Vault Token
WebApp->>VaultOperator: Create VaultStaticSecret (kvv2/webapp/config)
VaultOperator->>Vault: Fetch Static Secret
Vault-->>VaultOperator: Return Secret
VaultOperator->>WebApp: Inject Secret (secretkv)
WebApp->>VaultOperator: Create VaultDynamicSecret (postgres/creds/webapp)
VaultOperator->>Vault: Generate Dynamic Secret
Vault-->>VaultOperator: Return Credentials
VaultOperator->>WebApp: Inject Credentials (vso-db-credentials)
WebApp->>WebApp: Restart Pods (Rollout)
```
## Success Metrics
- Gitea Act Runners successfully execute CI jobs.
- ArgoCD synchronizes the `factory` repository without errors.
- Vault Secrets Operator injects secrets into deployed applications.

View File

@@ -0,0 +1,152 @@
# ADR 20260407: Docker Storage Optimization for Gitea Act Runner
## Status
Proposed
## Context
The `pi3` machine (Raspberry Pi) is running both Docker and k3s, with the following storage constraints:
- Root filesystem (`/dev/mmcblk0p2`): 58G total, 89% used (6.4G free)
- External disk (`/dev/sda1`): 458G total, 22G used (413G free)
Gitea Act Runner images (`ubuntu-latest` and `ubuntu-latest-ca`) are frequently deleted, likely due to Docker's automatic garbage collection triggered by low disk space. This disrupts CI/CD pipelines.
### Current Setup
- Docker is configured via Ansible (`system_docker.yml`) using the `geerlingguy.docker` role.
- k3s is configured to use Docker as the container runtime (`--docker` flag).
- Longhorn is used for persistent storage in k3s, and we want to preserve its performance.
## Decision
We will implement a **hybrid storage strategy** to prevent Gitea Act Runner image deletion while maintaining Longhorn performance:
### Docker Storage Optimization Flow
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#333333', 'edgeLabelBackground':'#f0f0f0', 'tertiaryColor': '#e67e22'}}}%%
sequenceDiagram
participant Ansible
participant Docker
participant ExternalDisk
participant GiteaRunner
participant Longhorn
Ansible->>Docker: Configure /etc/docker/daemon.json
Docker->>ExternalDisk: Use /mnt/arcodange/docker for storage
Ansible->>Docker: Restart Docker
Docker->>GiteaRunner: Pull ubuntu-latest-ca image
Ansible->>Docker: Pin image (dummy container)
Docker->>GiteaRunner: Start CI job
GiteaRunner->>Longhorn: Use persistent storage (unaffected)
Docker->>ExternalDisk: Store images (413G free)
Docker->>Docker: Skip garbage collection (pinned)
```
### 1. Pin Critical Images
Use a dummy container to pin the Gitea Act Runner images:
```yaml
# Add to system_docker.yml or a new playbook
- name: Pin Gitea Act Runner images
community.docker.docker_container:
name: pin-gitea-runner-ubuntu-latest-ca
image: gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca
state: present
command: ["sh", "-c", "sleep infinity"]
auto_remove: false
restart_policy: unless-stopped
```
### 2. Configure Docker Storage with Overlay on External Disk
Modify `/etc/docker/daemon.json` to use the external disk for storage while keeping the root filesystem for metadata:
```json
{
"data-root": "/mnt/arcodange/docker",
"storage-driver": "overlay2",
"storage-opts": ["overlay2.override_kernel_check=true"]
}
```
### 3. Ansible Implementation
Update `system_docker.yml` to:
1. Create `/mnt/arcodange/docker` if it doesn't exist.
2. Configure Docker to use the external disk.
3. Pin critical images post-installation.
```yaml
# Add to system_docker.yml tasks
- name: Ensure Docker storage directory exists on external disk
ansible.builtin.file:
path: /mnt/arcodange/docker
state: directory
mode: '0755'
owner: root
group: docker
- name: Configure Docker to use external storage
ansible.builtin.copy:
dest: /etc/docker/daemon.json
content: |
{
"data-root": "/mnt/arcodange/docker",
"storage-driver": "overlay2",
"storage-opts": ["overlay2.override_kernel_check=true"],
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "5"
}
}
mode: '0644'
notify: Redémarrer Docker
- name: Pin Gitea Act Runner images
community.docker.docker_container:
name: "{{ item.name }}"
image: "{{ item.image }}"
state: present
command: ["sh", "-c", "sleep infinity"]
auto_remove: false
restart_policy: unless-stopped
loop:
- { name: "pin-gitea-runner-ubuntu-latest", image: "gitea/runner-images:ubuntu-latest" }
- { name: "pin-gitea-runner-ubuntu-latest-ca", image: "gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca" }
```
## Consequences
### Positive
- **Prevents Image Deletion**: Critical images are pinned and won't be garbage-collected.
- **Preserves Longhorn Performance**: Longhorn continues to use the root filesystem for its operations, maintaining performance.
- **Scalable Storage**: Docker images are stored on the external disk (413G free), preventing root filesystem exhaustion.
- **No k3s Changes Required**: k3s continues to use Docker as the runtime without modification.
### Negative
- **Migration Effort**: Existing Docker data must be migrated to the external disk (one-time operation).
- **Dependency on External Disk**: If `/dev/sda1` fails, Docker will not function until the disk is remounted or the configuration is reverted.
- **Slight Performance Overhead**: Accessing images from the external disk may be slightly slower than the root filesystem (mitigated by SSD/HDD performance).
## Alternatives Considered
### Alternative 1: Increase Root Filesystem Size
- **Rejected**: The SD card is already at capacity, and expanding it is not feasible.
### Alternative 2: Disable Docker Garbage Collection
- **Rejected**: This would risk filling the root filesystem completely, causing system instability.
### Alternative 3: Use k3s Image Garbage Collection
- **Rejected**: k3s does not provide fine-grained control over image retention for non-k8s workloads (e.g., Gitea Act Runner).
### Alternative 4: Save/Load Images Manually
- **Rejected**: Manual intervention is not scalable and does not address the root cause.
## Migration Plan
1. **Backup**: Save critical images to `/mnt/arcodange`:
```bash
docker save gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca -o /mnt/arcodange/gitea-runner-backup.tar
```
2. **Update Ansible**: Apply the changes to `system_docker.yml`.
3. **Run Playbook**: Execute the playbook to reconfigure Docker.
4. **Verify**: Ensure Gitea Act Runner functions correctly post-migration.
## Success Metrics
- Gitea Act Runner images are no longer deleted between runs.
- Root filesystem usage drops below 80%.
- CI/CD pipelines complete without image pull errors.

View File

@@ -0,0 +1,576 @@
# ADR 20260407: Network Architecture
## Status
Proposed
## Context
The home lab requires a secure and resilient network architecture to support:
- Internal services (`.lab` domain).
- External services (`.arcodange.fr` domain).
- DNS resolution and ad-blocking (Pi-hole).
- TLS certificate management (Step CA).
- Ingress routing (Traefik).
- CDN and DDoS protection (Cloudflare).
## Decision
We will implement a **multi-layered network architecture** with the following components:
### 1. External Layer (Internet)
- **Cloudflare**: CDN, DDoS protection, and DNS for `.arcodange.fr`.
- **DuckDNS**: Dynamic DNS for external access.
- **Livebox**: ISP-provided gateway (NAT, DHCP, firewall).
### 2. Internal Layer (Home Lab)
- **Pi-hole (pi1, pi3)**: DNS sinkhole for ad-blocking and internal DNS resolution.
- **Step CA (pi1)**: Internal certificate authority for `.lab` domain.
- **Traefik (k3s)**: Ingress controller with TLS termination.
- **k3s Cluster**: Hosts internal services with Longhorn storage.
### 3. DNS Architecture
- **Pi-hole**: Primary DNS for internal clients.
- Forwards `.lab` queries to Step CA.
- Forwards external queries to Cloudflare (1.1.1.1).
- **Step CA**: Issues certificates for `.lab` services.
- **Cloudflare**: Manages `.arcodange.fr` DNS records.
### 4. Ingress and TLS
- **Traefik**: Terminates TLS for both `.lab` and `.arcodange.fr` domains.
- Uses Let's Encrypt for `.arcodange.fr`.
- Uses Step CA for `.lab`.
- **Helm Chart Annotations**:
- `traefik.ingress.kubernetes.io/router.entrypoints: websecure`
- `traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt`
- `traefik.ingress.kubernetes.io/router.middlewares: localIp@file`
### 5. Security
- **Cloudflare Tunnel**: Securely exposes internal services without port forwarding.
- **CrowdSec**: Intrusion detection and banning.
- **Traefik Middlewares**: IP filtering, rate limiting, and authentication.
- **Cloudflare Turnstile**: CAPTCHA protection for public-facing services.
## Architecture Diagrams
### 0. High-Level Network Architecture (Architecture Beta)
```mermaid
%%{init: {'theme': 'neutral', 'themeVariables': {
'primaryColor': '#f0f0f0',
'primaryBorderColor': '#333333',
'primaryTextColor': '#333333',
'lineColor': '#333333',
'tertiaryColor': '#e67e22'
}}}%%
architectureBeta
%% External Layer
box "Internet" #f9f9f9
component Cloudflare["Cloudflare\n(CDN/DNS)"] #f9f9f9
component DuckDNS["DuckDNS\n(DDNS)"] #f9f9f9
end
%% External Gateway
box "External Gateway" #e6e6e6
component Livebox["Livebox\n(NAT/Firewall)"] #e6e6e6
end
%% Internal Layer
box "Internal Network\n(192.168.1.0/24)" #d4d4d4
%% DNS Layer
box "DNS" #ffff99
component PiHole1["Pi-hole\n(pi1)"] #ffff99
component PiHole3["Pi-hole\n(pi3)"] #ffff99
component StepCA["Step CA\n(pi1)"] #ccccff
end
%% k3s Layer
box "k3s Cluster" #ff9999
component Traefik["Traefik\n(Ingress)"] #ff9999
component CrowdSec["CrowdSec\n(Security)"] #ff9999
component Gitea["Gitea\n(pi2)"] #ffcc99
component Vault["Vault\n(Secrets)"] #ccccff
end
end
%% Connections
Cloudflare --> Livebox : "DNS"
DuckDNS --> Livebox : "DDNS"
Livebox --> PiHole1 : "NAT"
Livebox --> PiHole3 : "NAT"
Livebox --> Traefik : "NAT"
PiHole1 --> StepCA : "Forward .lab"
PiHole1 --> Cloudflare : "Forward External"
PiHole3 --> StepCA : "Forward .lab"
PiHole3 --> Cloudflare : "Forward External"
Traefik --> Cloudflare : "TLS (Let's Encrypt)"
Traefik --> StepCA : "TLS (Step CA)"
CrowdSec --> Traefik : "Ban IPs"
Traefik --> Gitea : "Route"
Traefik --> Vault : "Route"
```
### 1. High-Level Network Architecture
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#333333', 'edgeLabelBackground':'#f0f0f0', 'tertiaryColor': '#f89136'}}}%%
graph TD
%% Styles
classDef internet fill:#f9f9f9,stroke:#999,color:#333;
classDef external fill:#e6e6e6,stroke:#555,color:#333;
classDef internal fill:#d4d4d4,stroke:#777,color:#333;
classDef security fill:#ff9999,stroke:#cc0000,color:#333;
classDef dns fill:#ffff99,stroke:#cccc00,color:#333;
classDef ca fill:#ccccff,stroke:#6666cc,color:#333;
%% Internet
subgraph "Internet"
Cloudflare["Cloudflare (CDN/DNS)"]:::internet
DuckDNS["DuckDNS (DDNS)"]:::internet
end
%% External Gateway
subgraph "External Gateway"
Livebox["Livebox (NAT/Firewall)"]:::external
end
%% Internal Network
subgraph "Internal Network (192.168.1.0/24)"
%% Pi-hole DNS
PiHole1["Pi-hole (pi1)"]:::dns
PiHole3["Pi-hole (pi3)"]:::dns
%% Step CA
StepCA["Step CA (pi1)"]:::ca
%% k3s Cluster
k3s["k3s Cluster"]:::internal
Traefik["Traefik (k3s)"]:::internal
CrowdSec["CrowdSec (k3s)"]:::security
%% Services
Gitea["Gitea (pi2)"]:::internal
Vault["Vault (k3s)"]:::internal
end
%% Connections
Cloudflare -->|DNS| Livebox
DuckDNS -->|DDNS| Livebox
Livebox -->|NAT| PiHole1
Livebox -->|NAT| PiHole3
Livebox -->|NAT| k3s
%% Internal DNS
PiHole1 -->|Forward .lab| StepCA
PiHole1 -->|Forward External| Cloudflare
PiHole3 -->|Forward .lab| StepCA
PiHole3 -->|Forward External| Cloudflare
%% Ingress
Traefik -->|"TLS (Let's Encrypt)"| Cloudflare
Traefik -->|"TLS (Step CA)"| StepCA
CrowdSec -->|Ban IPs| Traefik
%% Service Access
Traefik -->|Route| Gitea
Traefik -->|Route| Vault
```
### 2. DNS Resolution Flow
```mermaid
sequenceDiagram
participant Client
participant PiHole
participant StepCA
participant Cloudflare
participant ExternalDNS
Client->>PiHole: Query example.lab
PiHole->>StepCA: Forward .lab query
StepCA-->>PiHole: Return A record
PiHole-->>Client: Return response
Client->>PiHole: Query example.com
PiHole->>Cloudflare: Forward to 1.1.1.1
Cloudflare->>ExternalDNS: Resolve externally
ExternalDNS-->>Cloudflare: Return response
Cloudflare-->>PiHole: Return response
PiHole-->>Client: Return response
```
### 3. Ingress and TLS Flow
```mermaid
sequenceDiagram
participant User
participant Cloudflare
participant Traefik
participant StepCA
participant Service
User->>Cloudflare: HTTPS Request (webapp.arcodange.fr)
Cloudflare->>Traefik: Forward to internal IP
Traefik->>Let's Encrypt: Request Certificate
Let's Encrypt-->>Traefik: Issue Certificate
Traefik->>Service: Route request
Service-->>Traefik: Return response
Traefik-->>Cloudflare: Return HTTPS response
Cloudflare-->>User: Return response
User->>Traefik: HTTPS Request (webapp.arcodange.lab)
Traefik->>StepCA: Request Certificate
StepCA-->>Traefik: Issue Certificate
Traefik->>Service: Route request
Service-->>Traefik: Return response
Traefik-->>User: Return HTTPS response
```
### 4. Security Flow (CrowdSec + Traefik)
```mermaid
sequenceDiagram
participant Attacker
participant Traefik
participant CrowdSec
participant BannedIPs
Attacker->>Traefik: Malicious Request
Traefik->>CrowdSec: Log suspicious activity
CrowdSec->>BannedIPs: Add IP to ban list
BannedIPs-->>Traefik: Update middleware
Traefik-->>Attacker: Block request (403)
```
## Playbook and Role Analysis
### 1. Pi-hole Deployment
- **Playbook**: `playbooks/system/pihole.yml`
- **Role**: `arcodange.factory.pihole`
- **Configuration**:
- Upstream DNS: Cloudflare (1.1.1.1) and Step CA for `.lab`.
- Blocklists: Ad-blocking and malware domains.
### 2. Step CA Deployment
- **Playbook**: `playbooks/ssl/ssl.yml`
- **Role**: `step_ca`
- **Configuration**:
- Internal CA for `.lab` domain.
- Short-lived certificates (default: 24h).
### 3. Traefik Deployment
- **Playbook**: `playbooks/system/system_k3s.yml` (via k3s)
- **Helm Chart**: `traefik` (installed via k3s)
- **Key Annotations**:
```yaml
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
traefik.ingress.kubernetes.io/router.middlewares: localIp@file
```
### 4. CrowdSec Deployment
- **Playbook**: `playbooks/tools/crowdsec.yml`
- **Role**: `arcodange.factory.crowdsec`
- **Configuration**:
- Bouncer integration with Traefik.
- Custom scenarios for brute-force and bot detection.
## Consequences
### Positive
- **Resilient DNS**: Pi-hole provides ad-blocking and internal DNS resolution.
- **Secure TLS**: Step CA for internal services, Let's Encrypt for external.
- **DDoS Protection**: Cloudflare absorbs external attacks.
- **Intrusion Detection**: CrowdSec bans malicious IPs automatically.
### Negative
- **Complexity**: Multiple layers require careful configuration.
- **Single Point of Failure**: Pi-hole is critical for internal DNS.
- **Certificate Management**: Step CA requires maintenance for `.lab` domain.
## Alternatives Considered
### Alternative 1: Public DNS for `.lab`
- **Rejected**: Exposing internal domains is a security risk.
### Alternative 2: No Ad-Blocking
- **Rejected**: Pi-hole provides essential security and privacy.
### Alternative 3: Self-Signed Certificates
- **Rejected**: Step CA provides better usability with short-lived certs.
### 5. Cloudflare Turnstile + CrowdSec Flow
```mermaid
sequenceDiagram
participant User
participant Cloudflare
participant Turnstile
participant Traefik
participant CrowdSec
participant BannedIPs
User->>Cloudflare: Request protected endpoint
Cloudflare->>Turnstile: Challenge (CAPTCHA)
Turnstile-->>Cloudflare: Return token
Cloudflare->>Traefik: Forward request with token
alt Valid Token
Traefik->>Service: Route request
Service-->>Traefik: Return response
Traefik-->>Cloudflare: Return response
Cloudflare-->>User: Return success
else Invalid Token
Traefik->>CrowdSec: Log suspicious activity
CrowdSec->>BannedIPs: Add IP to ban list
BannedIPs-->>Traefik: Update middleware
Traefik-->>Cloudflare: Block request (403)
Cloudflare-->>User: Return "Access Denied"
end
```
## Deep Dive: `.lab` Domain SSL/TLS Architecture
### Overview
The `.lab` domain relies on a **zero-trust internal PKI** (Public Key Infrastructure) powered by **Step CA**, integrated with **k3s**, **Traefik**, and **cert-manager**. This section details the components, interactions, and operational workflows.
### Core Components
#### 1. **Step CA (Certificate Authority)**
- **Host**: `pi1` (primary), with standby nodes for resilience.
- **Ports**: `8443` (HTTPS), `443` (ACME).
- **Provisioners**:
- `cert-manager`: Dedicated for k3s workloads.
- `admin`: For manual certificate issuance.
- **Certificate Lifecycle**:
- **Short-lived certificates** (default: 24h).
- **Automatic renewal** via cert-manager.
- **OCSP stapling** for revocation checks.
#### 2. **cert-manager**
- **Namespace**: `cert-manager`.
- **CRDs**:
- `Certificate`: Defines desired certificates.
- `CertificateRequest`: Requests signed by Step CA.
- `ClusterIssuer`/`Issuer`: References Step CA.
- `StepClusterIssuer`: Custom resource for Step CA integration.
#### 3. **StepClusterIssuer**
- **Purpose**: Bridges cert-manager with Step CA.
- **Configuration**:
```yaml
apiVersion: certmanager.step.sm/v1beta1
kind: StepClusterIssuer
metadata:
name: step-issuer
namespace: cert-manager
spec:
url: "https://ssl-ca.arcodange.lab:8443"
caBundle: "<base64-encoded-root-ca>"
provisioner:
name: cert-manager
kid: "<key-id>"
passwordRef:
name: step-jwk-password
key: password
```
- **Workflow**:
1. cert-manager creates a `CertificateRequest`.
2. `StepClusterIssuer` forwards the request to Step CA.
3. Step CA signs the certificate and returns it to cert-manager.
4. cert-manager stores the certificate in a Kubernetes `Secret`.
#### 4. **Traefik Ingress Controller**
- **Namespace**: `kube-system`.
- **TLS Configuration**:
- **EntryPoints**: `websecure` (HTTPS), `web` (HTTP → redirect).
- **Certificate Resolvers**:
- `letsencrypt`: For `.arcodange.fr` (public).
- `step-ca`: For `.lab` (internal).
- **Middlewares**:
- `localIp@file`: IP allowlisting.
- `crowdsec-bouncer`: Intrusion prevention.
#### 5. **Certificate and CertificateRequest**
- **Example `Certificate` for `.lab`**:
```yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: wildcard-arcodange-lab
namespace: kube-system
spec:
secretName: wildcard-arcodange-lab-tls
issuerRef:
name: step-issuer
kind: StepClusterIssuer
group: certmanager.step.sm
dnsNames:
- "*.arcodange.lab"
- "arcodange.lab"
```
- **Generated `CertificateRequest`**:
- Automatically created by cert-manager.
- References the `StepClusterIssuer`.
- Status transitions: `Pending` → `Approved` → `Ready`.
#### 6. **k3s Cluster Integration**
- **Nodes**: `pi1` (control plane), `pi2`, `pi3` (workers).
- **Storage**: Longhorn for persistent volumes.
- **Networking**:
- **CNI**: Flannel.
- **Service Mesh**: Traefik for ingress, Linkerd (optional).
### Workflow: Certificate Issuance and Renewal
```mermaid
sequenceDiagram
participant App as Application (e.g., Gitea)
participant Cert as Certificate
participant CR as CertificateRequest
participant SCI as StepClusterIssuer
participant StepCA as Step CA
participant Secret as Kubernetes Secret
participant Traefik as Traefik
App->>Cert: Declare desired certificate
Cert->>CR: Create CertificateRequest
CR->>SCI: Forward to StepClusterIssuer
SCI->>StepCA: Sign CSR (via JWK provisioner)
StepCA-->>SCI: Return signed certificate
SCI->>Secret: Store certificate/key
Secret-->>Traefik: Mount as TLS secret
Traefik->>App: Route traffic with TLS
loop Every 2/3 of certificate lifetime
Cert->>CR: Trigger renewal
CR->>SCI: Re-sign CSR
SCI->>StepCA: Request new certificate
StepCA-->>SCI: Return signed certificate
SCI->>Secret: Update secret
end
```
### Device Trust: Adding `.lab` CA to External Devices
#### **Manual Trust Installation**
1. **Export Root CA**:
```bash
scp pi1:/home/step/.step/certs/root_ca.crt ./arcodange-lab-ca.crt
```
2. **Install on Devices**:
- **macOS**:
```bash
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ./arcodange-lab-ca.crt
```
- **Linux (Debian/Ubuntu)**:
```bash
sudo cp arcodange-lab-ca.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
```
- **Windows**:
- Import via `certmgr.msc` → **Trusted Root Certification Authorities**.
- **Android/iOS**:
- Email the `.crt` and install via device settings.
- **Raspberry Pi**:
```bash
sudo cp arcodange-lab-ca.crt /etc/ssl/certs/
sudo update-ca-certificates
```
#### **Automated Trust via Ansible**
- **Playbook**: `playbooks/system/trust_ca.yml`
- **Role**: `arcodange.factory.trust_ca`
- **Targets**: All nodes in `raspberries` group.
### Troubleshooting Common Issues
#### 1. **Certificate Not Issued**
- **Symptoms**: `CertificateRequest` stuck in `Pending`.
- **Causes**:
- Step CA unreachable.
- Incorrect `caBundle` or provisioner `kid`.
- Network policies blocking egress to Step CA.
- **Fixes**:
```bash
# Check StepClusterIssuer status
kubectl -n cert-manager describe stepclusterissuer step-issuer
# Verify Step CA connectivity
kubectl -n cert-manager logs -l app.kubernetes.io/name=step-issuer
# Test Step CA manually
step ca certificate --ca-url https://ssl-ca.arcodange.lab:8443 \
--root /home/step/.step/certs/root_ca.crt \
test.lab test.crt test.key
```
#### 2. **Traefik TLS Errors**
- **Symptoms**: `502 Bad Gateway` or TLS handshake failures.
- **Causes**:
- Missing certificate in `Secret`.
- Incorrect SNI routing.
- Expired certificates.
- **Fixes**:
```bash
# Check Traefik logs
kubectl -n kube-system logs -l app.kubernetes.io/name=traefik
# Verify certificate secret
kubectl -n kube-system get secret wildcard-arcodange-lab-tls -o yaml
# Restart Traefik
kubectl -n kube-system rollout restart deployment/traefik
```
#### 3. **Device Trust Issues**
- **Symptoms**: Browser warnings (`NET::ERR_CERT_AUTHORITY_INVALID`).
- **Causes**:
- CA not installed in device trust store.
- Clock skew (certificate validity).
- **Fixes**:
- Reinstall CA certificate.
- Sync device clock with NTP:
```bash
sudo ntpdate pool.ntp.org
```
### Security Considerations
#### 1. **Provisioner Security**
- **JWK Provisioner**: Encrypted with a password stored in Kubernetes `Secret`.
- **Password Rotation**:
```bash
# Rotate JWK password via Ansible
ansible-playbook playbooks/ssl/rotate_jwk_password.yml
```
#### 2. **Certificate Revocation**
- **OCSP**: Step CA supports Online Certificate Status Protocol.
- **Manual Revocation**:
```bash
step ca revoke <serial> --reason superseded
```
#### 3. **Network Isolation**
- **Step CA Access**: Restricted to k3s cluster IPs via firewall rules.
- **Traefik Middlewares**: Enforce IP allowlisting for internal services.
### Future Enhancements
1. **Automated Device Onboarding**:
- MDM (Mobile Device Management) integration for CA trust.
- Ansible playbook for bulk device enrollment.
2. **Step CA High Availability**:
- Multi-node Step CA with RAFT consensus.
- Automatic failover for provisioners.
3. **Certificate Transparency**:
- Log all `.lab` certificates to a private CT log.
4. **Short-Lived Certificates**:
- Reduce default TTL to 1h for critical services.
### References
- [Step CA Documentation](https://smallstep.com/docs/step-ca/)
- [cert-manager Step Issuer](https://smallstep.com/docs/step-certificates/kubernetes/)
- [Traefik TLS Configuration](https://doc.traefik.io/traefik/https/tls/)

View File

@@ -0,0 +1,126 @@
# ADR 20260414: Internal DNS Architecture
## Status
Accepted
## Context
During the 2026-04-13 power cut incident, cluster recovery was blocked by DNS resolution failures. The investigation revealed:
1. **CoreDNS forwarding loop**: CoreDNS was configured to forward queries to `/etc/resolv.conf`, which on the node (pi3) pointed to itself (`192.168.1.203`) - a host without a running DNS service
2. **Pi-hole HA misconfiguration**: Both pi1 and pi3 run Pi-hole (pihole-FTL) but:
- pi1's `dnsmasq` service was in a **failed state** due to missing `dip` group membership
- pi3's Pi-hole was running but CoreDNS couldn't reach it due to the forwarding configuration
3. **No explicit upstream DNS**: Pi-hole instances lacked explicitly configured upstream DNS servers
The cluster's HelmChart controller requires external DNS resolution to fetch charts from `charts.longhorn.io`, making DNS a critical dependency for storage provisioning and thus the entire cluster recovery process.
## Decision
### 1. DNS Service Hierarchy
```
┌─────────────────┐ ┌─────────────────┐
│ CoreDNS Pod │────▶│ Pi-hole (pi1) │──┐
│ (kube-system) │ │ Pi-hole (pi3) │ │
└─────────────────┘ └─────────────────┘ │
┌──────────────┐
│ 8.8.8.8 │
│ 1.1.1.1 │
│ 8.8.4.4 │
└──────────────┘
```
### 2. CoreDNS Configuration
CoreDNS will forward **all non-cluster DNS queries** to **both Pi-hole instances** in HA configuration:
```coredns
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
ttl 60
reload 15s
fallthrough
}
prometheus :9153
cache 30
loop
reload
import /etc/coredns/custom/*.override
import /etc/coredns/custom/*.server
forward . 192.168.1.201:53 192.168.1.203:53
}
```
### 3. Pi-hole HA Configuration
- **Primary**: pi1 (192.168.1.201)
- **Secondary**: pi3 (192.168.1.203)
- **Synchronization**: Gravity Sync for configuration consistency
- **Upstream DNS**: Explicitly configured to Cloudflare (1.1.1.1) and Google (8.8.8.8, 8.8.4.4)
### 4. Pi-hole DNS Service Fix
The `dnsmasq` user must be a member of the `dip` group to bind to privileged port 53:
```bash
usermod -aG dip dnsmasq
```
This is managed via Ansible in `playbooks/system/rpi.yml`.
## Consequences
### Positive
- **Resilience**: DNS resolution continues if one Pi-hole node fails
- **Consistency**: Both Pi-hole instances maintain synchronized configuration via Gravity Sync
- **Recovery**: Cluster can recover from power failures without manual DNS intervention
- **Explicit configuration**: Upstream DNS servers are explicitly defined, avoiding reliance on DHCP-provided config
### Negative
- **Complexity**: Additional Ansible tasks required to maintain DNS infrastructure
- **Dependency**: Cluster recovery depends on Pi-hole availability (mitigated by HA)
## Implementation
See related changes in:
- `playbooks/system/rpi.yml` - dnsmasq group membership fix
- `playbooks/dns/k3s_dns.yml` - CoreDNS forwarding to HA Pi-hole instances
- `playbooks/dns/roles/pihole/defaults/main.yml` - Explicit upstream DNS configuration
## Post-Implementation Notes
### Issue Encountered: dnsmasq vs pihole-FTL Port Conflict
During execution, we discovered that **dnsmasq** and **pihole-FTL** both attempt to bind to port 53. On pi1:
- pihole-FTL was running and handling DNS on port 53
- dnsmasq service was failing because port 53 was already in use
**Resolution**: The dnsmasq service on Pi-hole nodes is **not needed** when pihole-FTL is running, as pihole-FTL includes its own DNS server (dnsmasq) internally. The system dnsmasq service should remain **disabled** on Pi-hole nodes to avoid conflicts.
### Verification Commands
Check DNS resolution from cluster:
```bash
kubectl run dns-test --image=busybox:1.28 -it --rm --restart=Never -- \
nslookup charts.longhorn.io 192.168.1.201
# Check CoreDNS forward to both Pi-holes
kubectl get cm -n kube-system coredns -o yaml
# Check Pi-hole instances
ssh pi1 "dig @127.0.0.1 google.com +short"
ssh pi3 "dig @127.0.0.1 google.com +short"
```
## Related Incidents
- [2026-04-13-power-cut](../incidents/2026-04-13-power-cut/README.md) - Power cut caused DNS resolution failure, blocking Longhorn reinstall and Traefik recovery

View File

@@ -0,0 +1,550 @@
# ADR 20260414: Longhorn PVC Recovery When Reinstalled
---
## 📋 **Executive Summary**
After the April 13, 2026 power cut incident and subsequent cluster recovery, we discovered a **critical gap** in Longhorn volume restoration. While the **raw replica data files** (`volume-head-*.img`) remain intact on disk across all nodes, Longhorn cannot automatically **re-associate** them with new Volume CRDs due to its internal engine ID naming scheme. This document explains the problem and provides three recovery approaches.
---
---
## 🔍 **The Root Problem**
### **What Happened**
1. **Power cut** → Longhorn CSI driver lost connection
2. **Force-deletion of Longhorn pods** → Webhook circular dependency
3. **Nuclear cleanup** → All Longhorn CRDs (Volume, Engine, Replica) were deleted
4. **Reinstallation** → New Volume CRDs created with new engine IDs
### **Directory Structure Issue**
Longhorn stores replica data in directories named by **volume name + engine ID**:
```
/mnt/arcodange/longhorn/replicas/
├── pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-cd16e459/ # ← OLD (orphaned)
│ ├── volume-head-002.img # ← Actual Traefik data (128Mi)
│ ├── volume-head-002.img.meta
│ └── volume-snap-*.img
├── pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4/ # ← NEW (empty)
│ ├── volume-head-002.img # ← Empty 128Mi
│ └── volume-head-002.img.meta
└── ...
```
**The Problem:** When you recreate a Volume CRD, Longhorn generates a **new engine ID** (e.g., `8c7d8ab4`), creating a **new empty directory** instead of adopting the existing one (`cd16e459`).
### **Why This Matters**
| Component | Persistence | Recovery Path |
|-----------|-------------|---------------|
| **Replica `.img` files** | ✅ **Survives** on disk | Manual intervention required |
| **Volume CRD** | ❌ **Deleted** | Must recreate |
| **Engine/Replica CRDs** | ❌ **Deleted** | Auto-recreated by Longhorn |
| **Engine ID** | ❌ **Changes** | ** Cannot be recovered without backup ** |
**Without the original Volume CRD backup, Longhorn cannot match orphaned replica directories to new Volume CRDs.**
---
---
## 🎯 **Recovery Methods Comparison**
| Method | Complexity | Data Safety | Downtime | Best For |
|--------|------------|-------------|----------|----------|
| **[A: Manual `dd` Copy](#method-a-manual-dd-copy)** | ⭐⭐⭐⭐ | ✅✅✅✅ | Medium | Critical data, no app backup |
| **[B: Directory Rename](#method-b-directory-rename)** | ⭐⭐⭐ | ✅✅ | Low | Small volumes, no Rebuilding replicas |
| **[C: Fresh Volume + App Restore](#method-c-fresh-volume--app-restore)** | ⭐⭐ | ✅✅✅✅✅ | Low | Non-critical data, app backups exist |
| **[D: Block-Device Injection (Automated)](#method-d-block-device-injection-automated)** | ⭐⭐⭐ | ✅✅✅✅ | Medium | **Recommended — any volume, no dir swap needed** |
| **[E: Longhorn Google Storage Restore](#method-e-longhorn-google-storage-restore)** | ⭐⭐ | ✅✅✅✅✅ | Low | Volumes with Longhorn backup configured |
**Method B was proven risky** (2026-04-13 recovery): Longhorn reconciliation finds `Dirty: true`
metadata + a clean empty pi1 replica → silently rebuilds from the empty source, destroying data.
Use Method D for any volume larger than ~128Mi or with Rebuilding replicas.
---
---
## 🛠️ **Method A: Manual `dd` Copy**
### **Concept**
Manually copy the data from the orphaned `.img` file to the new replica directory that Longhorn created for the new Volume CRD.
### **Prerequisites**
- Root access to all nodes
- Volume CRD already recreated (with new engine ID)
- Longhorn has created new empty replica directories
- `dd` and `qemu-img` tools available
### **Steps**
```bash
# 1. Identify source (old data) and destination (new empty)
SOURCE_NODE=pi2
SOURCE_DIR=/mnt/arcodange/longhorn/replicas/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-cd16e459
SOURCE_IMG=$(ssh $SOURCE_NODE "ls $SOURCE_DIR/volume-head-*.img | head -1")
DEST_DIRS=(
pi1:/mnt/arcodange/longhorn/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4
pi2:/mnt/arcodange/longhorn/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4
pi3:/mnt/arcodange/longhorn/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4
)
# 2. Copy data to each node
for DEST in "${DEST_DIRS[@]}"; do
NODE=${DEST%%:*}
PATH=${DEST#*:}
ssh $NODE "sudo mkdir -p $PATH && sudo dd if=$SOURCE_IMG of=$PATH/volume-head-002.img bs=4M"
done
# 3. Restart Longhorn engine pods to pick up new data
kubectl delete pod -n longhorn-system -l longhorn.io/component=engine
# 4. Verify data is accessible
kubectl get volume -n longhorn-system pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90
# Should show: state=attached, robustness=healthy
```
### **Pros**
- ✅ Guaranteed data recovery
- ✅ Works for any volume size
- ✅ Preserves all snapshots and metadata
### **Cons**
- ⚠️ Requires manual intervention on each node
- ⚠️ Must know source and destination paths
- ⚠️ Risk of data corruption if `dd` fails mid-copy
- ⚠️ Volume must be in detached state during copy
### **Risk Mitigation**
- Verify checksums after copy: `sha256sum /path/to/image.img`
- Copy to one node at a time, verify between each
- Use `pv` for progress: `pv $SOURCE_IMG | ssh $NODE "sudo dd of=$PATH/volume-head-002.img bs=4M"`
---
---
## 🏷️ **Method B: Directory Rename**
### **Concept**
Rename the orphaned replica directory to match the **engine ID** that Longhorn expects for the new Volume CRD.
### **Prerequisites**
- Volume CRD already recreated
- Longhorn has created engine CRDs (check: `kubectl get engines -n longhorn-system`)
- Must act quickly before Longhorn initializes new empty replicas
### **Steps**
```bash
# 1. Find the new engine ID for the volume
ENGINE=$(kubectl get engines -n longhorn-system -l longhorn.io/volume=pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 -o jsonpath='{.items[0].metadata.name}')
# Example: pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-e-0
ENGINE_ID=${ENGINE#*-} # Extract suffix: e-0
# But the directory uses a different format...
# 2. Check actual directory names
kubectl get replicas -n longhorn-system | grep pvc-cc8a
# Output: pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-r-8c7d8ab4
# 3. Rename on the node where orphaned data exists
NEW_DIR_SUFFIX=$(kubectl get replicas -n longhorn-system pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-r-8c7d8ab4 -o jsonpath='{.metadata.labels.longhorn\.io/last-attached-node}')
ssh $NEW_DIR_SUFFIX "sudo mv /mnt/arcodange/longhorn/replicas/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-cd16e459 \
/mnt/arcodange/longhorn/replicas/pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90-8c7d8ab4"
# 4. Restart the replica pod
kubectl delete pod -n longhorn-system $(kubectl get pods -n longhorn-system -o jsonpath='{.items[?(@.metadata.labels.longhorn\.io/replica)=pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90].metadata.name}')
```
### **Pros**
- ✅ Fastest method
- ✅ No data copying required
- ✅ Preserves all existing data and snapshots
### **Cons**
- ⚠️ **High risk of mismatch** - wrong directory rename = data loss
- ⚠️ Must identify correct engine ID for each node
- ⚠️ Replica directories exist on multiple nodes - must rename on ALL
- ⚠️ Longhorn may have already initialized new empty replicas
### **Critical Warning**
**Each volume has replicas on ALL nodes.** You must:
1. Identify which node has which orphaned directory
2. Rename each to match the corresponding new engine's expected path
3. Ensure consistency across all nodes
**Example for pvc-cc8a:**
```bash
# Orphaned dirs:
# pi2: pvc-cc8a...-cd16e459
# pi3: pvc-cc8a...-011b54b3
# New engine paths (from kubectl get replicas):
# pi1: pvc-cc8a...-r-8c7d8ab4
# pi2: pvc-cc8a...-r-32aa3e1e
# pi3: pvc-cc8a...-r-3e84c460
# Must rename EACH orphaned dir to match new engine on SAME node
```
---
---
## 🆕 **Method C: Fresh Volume + App Restore** *(Recommended for Traefik)*
### **Concept**
1. Let Longhorn create a **new empty volume** for the PVC
2. Restore the **application data** (Traefik's `acme.json`) from application-level backups
### **Prerequisites**
- Application-level backup exists (e.g., Traefik config, certificates)
- Data is non-critical or easily restorable
- Storage requirements are small (128Mi for Traefik)
### **Steps**
```bash
# 1. Delete the problematic Volume CRD (if any)
kubectl delete volume -n longhorn-system pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 --ignore-not-found
# 2. Delete the PVC
kubectl delete pvc -n kube-system traefik
# 3. Let StorageClass provision a fresh volume
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: traefik
namespace: kube-system
spec:
accessModes: [ReadWriteOnce]
resources: {requests: {storage: 128Mi}}
storageClassName: longhorn
volumeMode: Filesystem
EOF
# 4. Wait for PV to be provisioned
kubectl wait --for=jsonpath='{.status.phase}'=Bound pvc -n kube-system traefik
# 5. Restore Traefik data from backup
BACKUP_FILE="/path/to/traefik-backup/acme.json"
kubectl cp $BACKUP_FILE kube-system/traefik-XXXXXX-XXXX:/data/acme.json
kubectl exec -n kube-system traefik-XXXXXX-XXXX -- chown 65532:65532 /data/acme.json
kubectl exec -n kube-system traefik-XXXXXX-XXXX -- chmod 600 /data/acme.json
```
### **Traefik-Specific Recovery**
For Traefik, the critical data is:
- `/data/acme.json` - TLS certificates obtained from Let's Encrypt
- `/data/tls.yml` - (if used)
- Secrets in Kubernetes (separate from PVC)
**Backup locations to check:**
```bash
# Check if we have Traefik data backups
ssh pi1 "ls -la /home/pi/arcodange/backups/traefik/ 2>/dev/null || echo 'No backup found'"
# Check ArgoCD apps (if Traefik was deployed via GitOps)
kubectl get app -n argocd | grep traefik
```
### **Pros**
-**Simplest and safest** method
- ✅ No risk of Longhorn directory mismatches
- ✅ Works even without Longhorn CRD backups
- ✅ Verifiable - you can confirm data was restored
- ✅ Clean state - no orphaned directories
### **Cons**
- ⚠️ Requires application-level backups
- ⚠️ TLS certificates may have expired (need to re-issue)
---
---
## 🏆 **Recommendation: Method C for Traefik**
### **Why Method C is Best for This Case**
| Factor | Assessment |
|--------|------------|
| **Volume Size** | 128Mi (small) |
| **Data Criticality** | TLS certs can be re-generated |
| **Backup Availability** | Likely exists in ArgoCD/Git |
| **Complexity** | Low |
| **Risk** | Minimal |
| **Time Required** | ~5 minutes |
### **Data Loss Assessment for Traefik**
The **worst case** (no Traefik backup):
- TLS certificates will be **re-issued** automatically by cert-manager + Let's Encrypt
- No permanent data loss - certificates are ephemeral
- Client impact: Brief TLS warning during re-issuance (~1-2 minutes)
**Verdict:** 🟢 **Method C is the safest and most practical approach.**
---
## 🔧 **Prevention: What We Must Fix**
### **1. Update Backup Playbook** (`playbooks/backup/k3s_pvc.yml`) ✅ Done 2026-04-16
`backup_cmd` now captures:
1. All PersistentVolumes (PV)
2. All PersistentVolumeClaims (PVC)
3. **All Longhorn Volumes** (critical — enables fast restore via `kubectl apply` instead of block-device injection)
4. All Longhorn Settings (backup target configuration)
### **2. Test Backups Regularly**
```bash
# Monthly test: Restore a non-critical volume
# Pick a test volume, delete it, restore from backup
kubectl delete volume -n longhorn-system <test-volume>
kubectl apply -f <backup-file>
kubectl get volume -n longhorn-system <test-volume> -w
```
### **3. Validate Backup Files**
```bash
# Check backup contains Longhorn resources
grep "longhorn.io/v1beta2" /path/to/backup-*.volumes
grep "kind: Volume" /path/to/backup-*.volumes
```
### **4. Document Recovery Procedure**
- [ ] Create `docs/admin/longhorn-recovery.md` with these steps
- [ ] Add to team runbook
- [ ] Include in incident response training
---
## 📊 **Test Scenario: Battle Testing PVC Recovery**
### **Test Setup**
```bash
# 1. Create a test namespace
kubectl create ns longhorn-test
# 2. Create a test PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-longhorn-recovery
namespace: longhorn-test
labels:
purpose: test
spec:
accessModes: [ReadWriteOnce]
resources: {requests: {storage: 1Gi}}
storageClassName: longhorn
EOF
# 3. Deploy a test pod to write data
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-writer
namespace: longhorn-test
spec:
containers:
- name: writer
image: alpine
command: [sh, -c, "echo 'test data for recovery' > /data/testfile.txt && echo 'more data' >> /data/testfile.txt && tail -f /dev/null"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-longhorn-recovery
EOF
# 4. Write and verify data
kubectl exec -n longhorn-test test-writer -- cat /data/testfile.txt
# Should show: "test data for recovery\nmore data"
# 5. Backup everything
kubectl get -A pv,pvc -o yaml > /tmp/test-backup-pv-pvc.yaml
kubectl get -A volumes.longhorn.io -o yaml >> /tmp/test-backup-pv-pvc.yaml
echo '---' >> /tmp/test-backup-pv-pvc.yaml
kubectl get -A settings.longhorn.io -o yaml >> /tmp/test-backup-pv-pvc.yaml
```
### **Test Execution: Simulate Disaster**
```bash
# 6. Simulate disaster - delete everything
kubectl delete pvc -n longhorn-test test-longhorn-recovery
kubectl delete pod -n longhorn-test test-writer
kubectl delete volume -n longhorn-system pvc-$(kubectl get pvc -n longhorn-test test-longhorn-recovery -o jsonpath='{.spec.volumeName}')
# 7. Restore from backup
kubectl apply -f /tmp/test-backup-pv-pvc.yaml
# 8. Verify recovery
kubectl get pvc -n longhorn-test test-longhorn-recovery
kubectl get volumes -n longhorn-system | grep test-longhorn-recovery
# 9. Deploy test reader pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-reader
namespace: longhorn-test
spec:
containers:
- name: reader
image: alpine
command: [sh, -c, "cat /data/testfile.txt && tail -f /dev/null"]
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-longhorn-recovery
EOF
# 10. Check if data is recovered
kubectl logs -n longhorn-test test-reader
# Should show: "test data for recovery\nmore data"
```
### **Expected Results**
| Test Step | Pass Criteria |
|-----------|---------------|
| Volume CRD restored | `kubectl get volumes` shows the test volume |
| PVC bound | `kubectl get pvc` shows status=Bound |
| Data accessible | Test reader pod shows original data |
### **Test Cleanup**
```bash
kubectl delete ns longhorn-test
```
---
---
---
## 🛠️ **Method D: Block-Device Injection (Automated)**
### **Concept**
Bypass Longhorn's replica reconciliation entirely. Create a fresh Volume CRD, attach it in
maintenance mode, then inject the recovered filesystem directly into the live block device via
`rsync`. The old replica dirs are never renamed or touched — the data is copied into the new
Longhorn-managed volume.
### **Implementation**
See `playbooks/recover/longhorn_data.yml` — a 9-phase Ansible playbook that automates the full
sequence for one or more volumes in a single run.
### **Key Steps**
```
Phase 0: Auto-discover best replica dir (skip Rebuilding:true, rank by actual disk usage)
Phase 1: Backup untouched replica dir
Phase 2: Merge sparse snapshot+head layers → single flat image (merge-longhorn-layers.py)
Phase 3: Create Longhorn Volume CRD, wait for replicas
Phase 4: Scale down workload
Phase 5: Attach via VolumeAttachment maintenance ticket
Phase 6: mkfs.ext4 + mount + rsync from merged image
Phase 7: Remove maintenance ticket
Phase 8: Recreate PV (Retain, no claimRef) + PVC (volumeName pinned)
Phase 9: Scale up, wait readyReplicas ≥ 1
```
### **Usage**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars.yml
```
Vars file format:
```yaml
longhorn_recovery_volumes:
- pv_name: pvc-abc123
pvc_name: myapp-data
namespace: myapp
size_bytes: "134217728"
size_human: 128Mi
access_mode: ReadWriteOnce
workload_kind: Deployment
workload_name: myapp
# source_node and source_dir are auto-discovered if omitted
verify_cmd: ""
```
### **Pros**
- ✅ Fully automated — handles all phases including PV/PVC recreation
- ✅ Auto-discovers best replica (skips Rebuilding dirs)
- ✅ Idempotent — safe to re-run (skips backup/merge if already done)
- ✅ Works for RWO and RWX volumes
### **Cons**
- ⚠️ Requires ~2× volume size in temporary disk space for merged image
- ⚠️ The new volume has 3 fresh replicas (not the original topology) — Longhorn will resync
---
---
## 🗄️ **Method E: Longhorn Google Storage Restore**
### **Concept**
Some volumes are configured with Longhorn's built-in backup feature targeting a Google Storage
bucket. For those volumes, a Longhorn backup can be restored into a new volume without needing
the raw replica files.
### **Applicable Volumes**
- `backups-rwx` (`pvc-efda1d2f`) — the cluster backup volume itself has a Longhorn GCS backup configured
### **When to use**
Use when:
- The local replica dirs are missing or corrupted (Method D cannot be used)
- A clean point-in-time restore is preferred over a raw replica merge
### **Status**
A playbook for this method (`playbooks/recover/longhorn_gcs_restore.yml`) is **planned but not
yet implemented**. In the 2026-04-13 incident, `backups-rwx` was successfully recovered via
Method D (local replica merge), so Method E was not needed.
When the playbook is implemented, it will use `kubectl apply` of a `BackupVolume` + `Backup`
restore CR pointing to the GCS bucket configured in Longhorn settings.
---
---
## 📚 **References**
- [Longhorn Documentation: Disaster Recovery](https://longhorn.io/docs/1.6.0/deploy/uninstall/disaster-recovery/)
- [Longhorn Volume CRD Spec](https://github.com/longhorn/longhorn/blob/master/types/types.go)
- [Original Issue: Longhorn GitHub #4837](https://github.com/longhorn/longhorn/issues/4837) (Replica orphan handling)
- [Related ADR: Internal DNS Architecture](./20260414-internal-dns-architecture.md)
- [Related Incident: 2026-04-13 Power Cut](../incidents/2026-04-13-power-cut/README.md)
---
---
*Document created: 2026-04-14*
*Last updated: 2026-04-15*
*Status: Method D (block-device injection) implemented and battle-tested on 5 volumes (2026-04-14/15)*

View File

@@ -0,0 +1,420 @@
---
title: Power Cut - Longhorn Storage System Failure
incident_id: 2026-04-13-001
date: 2026-04-13
time_start: 15:23:57 UTC
time_end: "2026-04-15 (ongoing — Vault/ERP manual recovery deferred)"
status: Mostly Resolved
severity: SEV-1
tags:
- kubernetes
- longhorn
- storage
- k3s
- power-cut
- csi-driver
- block-device-recovery
---
# Power Cut - Longhorn Storage System Failure
## Summary
A power cut caused a cascading failure of the Longhorn distributed storage system in the k3s cluster. The Longhorn CSI driver (`driver.longhorn.io`) lost its registration with kubelet, preventing all Persistent Volume Claims (PVCs) from mounting. This affected ~43 pods across 12 namespaces, including critical infrastructure like Traefik ingress controller, application pods, and monitoring tools.
The actual volume data stored in Longhorn replicas at `/mnt/arcodange/longhorn/replicas/` on each node **remains intact**. Recovery efforts are focused on restoring CSI driver registration and Longhorn manager functionality.
## Impact
### Affected Services
- **Critical**: Longhorn storage system (all CSI components)
- **Critical**: Traefik ingress controller (cannot mount PVC)
- **High**: Application pods using Longhorn PVCs (cms, webapp, erp, clickhouse, etc.)
- **High**: Tool pods (grafana, prometheus, hashicorp-vault, redis, crowdsec)
- **Medium**: Docker storage corruption on nodes (overlay2)
- **Low**: NFS backup mount unavailable
### User Impact
- External access to services via Traefik: **DOWN**
- Gitea registry image pulls: **FAILING**
- Persistent data access: **DEGRADED** (data exists but inaccessible)
- Monitoring dashboards: **DOWN**
### Metrics
- **Failed Pods**: 43 pods in error state (CrashLoopBackOff, Error, ImagePullBackOff)
- **Healthy Pods**: ~37 pods running
- **Longhorn Pods**: 25 total, ~12 currently healthy
- **Nodes**: 3/3 Ready (pi1 control-plane, pi2, pi3)
## Component Roles
### Longhorn Components
| Component | Role | Current Status | Importance |
|-----------|------|----------------|------------|
| **longhorn-manager** | Orchestrates Longhorn volumes, handles volume operations | 2/3 running, 1 partial | CRITICAL |
| **longhorn-driver-deployer** | Deploys the CSI driver to nodes | Init:0/1 (BLOCKED) | CRITICAL |
| **longhorn-csi-plugin** | CSI plugin daemonset - handles node-level CSI operations | 0/3 Error | CRITICAL |
| **csi-attacher** | Handles volume attachment to nodes | 2/3 running, 1 Error | CRITICAL |
| **csi-provisioner** | Creates volumes from PVC requests | 2/3 running, 1 Error | CRITICAL |
| **csi-resizer** | Handles volume resizing | 1/3 running, 2 Error | HIGH |
| **csi-snapshotter** | Handles volume snapshots | 2/3 running, 1 Error | MEDIUM |
| **engine-image** | Pulls and manages engine binaries | 3/3 Running | HIGH |
| **longhorn-ui** | Web UI for Longhorn management | 0/2 CrashLoopBackOff | Medium |
| **rwx-nfs** | NFS server for backup volume | 0/1 ContainerCreating | Medium |
| **share-manager** | Manages NFS shares for volumes | 0/2 Error | MEDIUM |
### Other Affected Components
| Component | Role | Dependencies | Status |
|-----------|------|--------------|--------|
| **Traefik** | Ingress controller, routes external traffic | Requires PVC for certs | Error (cannot mount PVC) |
| **coredns** | Cluster DNS | Docker storage | Crashing (overlay2 corruption) |
| **svclb-traefik** | Service load balancer for Traefik | Docker storage | Crashing (overlay2 corruption) |
| **Application Pods** | Various services (cms, webapp, erp, etc.) | Longhorn PVCs | Error/ImagePullBackOff |
## Timeline
| Time (UTC) | Event | Owner | Notes |
|------------|-------|-------|-------|
| ~15:23 | Power cut occurred | - | Cluster lost power |
| 15:23:57 | Incident detection started | Mistral Vibe | Initial assessment began |
| 15:24:05 | Baseline documented | Mistral Vibe | 43 pods in error, Longhorn down |
| 15:24:10 | Root cause identified | Mistral Vibe | CSI driver `driver.longhorn.io` not registered |
| 15:24:30 | Recovery plan formulated | Mistral Vibe | HelmChart manifest touch, then pod deletion |
| 15:24:50 | Step 1: Touch longhorn-install.yaml | Mistral Vibe | Manifest timestamp updated on pi1 |
| 15:25:50 | Step 1 outcome: Insufficient | Mistral Vibe | Only 1 pod affected, CSI still down |
| 15:32:15 | Step 2: Delete all longhorn-system pods | Mistral Vibe | Force deleted 24 pods — created webhook circular dependency |
| 15:32:30 | Step 2 outcome: Partial recovery | Mistral Vibe | Managers recovering, CSI still failing |
| 16:15:00 | Root cause 2 identified | Mistral Vibe | Webhook circular dependency — decided nuclear cleanup |
| 16:30:00 | Backups secured | Mistral Vibe | PV/PVC and Longhorn CRDs backed up to pi1 |
| 16:35:00 | Backup script bug fixed | Claude Code | `backup_cmd` fixed to produce valid YAML |
| 17:00:00 | Nuclear cleanup executed | Claude Code | Removed all Longhorn CRDs, PVC finalizers, restarted k3s |
| 17:08:00 | Longhorn namespace deleted | Claude Code | Clean slate confirmed |
| 17:09:00 | Longhorn reinstall started | Claude Code | `playbooks/recover/longhorn.yml` run on pi1 |
| 17:30:00 | Docker config corruption found | Claude Code | daemon.json had Python string not JSON |
| 17:35:00 | Docker config fixed | Claude Code | Valid JSON deployed to all nodes |
| 17:50:00 | DNS failure identified | Claude Code | CoreDNS cannot resolve external domains |
| ~19:00 | DNS fixed | Claude Code | Pi-hole dnsmasq group + CoreDNS upstream config |
| ~19:30 | Longhorn reinstall completed | Claude Code | All Longhorn pods Running, CSI registered |
| 2026-04-14 00:00 | PVC recovery work started | Claude Code | Block-device recovery approach developed |
| 2026-04-14 | Traefik recovered | Claude Code | Simple PV recreation (no data loss for certs) |
| 2026-04-14 | url-shortener recovered | Claude Code | Method B (dir rename) + PV/PVC recreate |
| 2026-04-14 | Block-device recovery developed | Claude Code | `merge-longhorn-layers.py` + 9-phase playbook |
| 2026-04-14 | Clickhouse recovered | Claude Code | `longhorn_data.yml` playbook — first automated run |
| 2026-04-15 | Automated recovery for 4 volumes | Claude Code | prometheus, alertmanager, redis, backups-rwx |
| 2026-04-15 | Vault/ERP recovery deferred | - | Too sensitive for automated approach, manual later |
## Root Cause Analysis
### Primary Root Cause
**Power cut caused Longhorn CSI driver registration to be lost.**
The Longhorn CSI driver (`driver.longhorn.io`) is registered with the kubelet on each node. When the power cut occurred:
1. K3s/kubelet processes crashed
2. Longhorn manager pods crashed
3. CSI driver registration was lost
4. On restart, Longhorn pods attempted to restart but:
- The `longhorn-driver-deployer` pod has an init container (`wait-longhorn-manager`) that waits for managers to be ready
- Longhorn managers were slow to recover (some still in CrashLoopBackOff)
- CSI pods (attacher, provisioner, resizer, snapshotter) cannot start without the CSI socket at `/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock`
- Custom Resource Definitions (Volumes, Replicas, etc.) exist but CSI driver cannot communicate with them
### Secondary Issues
1. **Docker overlay2 corruption**: Docker storage at `/mnt/arcodange/docker/overlay2/` was corrupted on at least pi1, affecting coredns and svclb-traefik pods
2. **NFS backup mount unavailable**: The Longhorn share-manager pod (which exports NFS) is in Error state, making `/mnt/backups/` inaccessible
3. **Backup scripts bug**: The `backup.volumes` file at `/opt/k3s_volumes/backup.volumes` is empty due to a script formatting bug
### Failure Propagation
```mermaid
%%{init: { 'theme': 'forest' }}%%
graph TD
A[Power Cut] --> B[Kubelet Crashes]
A --> C[Docker Daemon Crashes]
B --> D[Longhorn Manager Pods Crash]
B --> E[CSI Driver Registration Lost]
C --> F[Overlay2 Filesystem Corrupt]
D --> G[Driver-Deployer Init Container Waits]
E --> H[CSI Socket Disappears]
G --> I[CSI Driver Not Deployed]
H --> J[CSI Pods Cannot Start]
I --> J
J --> K[PVC Mounts Fail]
K --> L[Application Pods Crash]
F --> M[Docker Containers Fail to Start]
M --> N[CoreDNS Crashes]
M --> O[Service Load Balancers Crash]
N --> P[DNS Resolution Fails]
O --> P
P --> L
K --> L
```
### Why Data Is Safe
The Longhorn volume data is stored in replicas across all three nodes at `/mnt/arcodange/longhorn/replicas/`. Checking the Longhorn volumes shows:
```
All 12 volumes: state="attached", robustness="healthy"
```
This confirms that:
1. Volume metadata is intact in etcd
2. Replica data is intact on disk
3. Once CSI driver is restored, volumes will be accessible again
4. **No permanent data loss has occurred**
## Recovery Actions Taken
### Attempt 1: HelmChart Manifest Touch (15:24:50 - 15:25:50)
**Action:** Touched `/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml` on pi1
**Command:**
```bash
ssh pi@pi1 "sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml"
```
**Outcome:** Only triggered reconcile for 1 pod (longhorn-manager-w85v6). CSI driver still not registered.
**Decision:** Insufficient. Need more aggressive approach.
### Attempt 2: Force Delete All Longhorn Pods (15:32:15 - Present)
**Action:** Force deleted all 24 pods in longhorn-system namespace
**Command:**
```bash
kubectl delete pods -n longhorn-system --all --force --grace-period=0
```
**Outcome:**
- HelmChart controller detected changes and recreated all pods
- **Success**: 23/25 pods now in Running state (15:34:30)
- **Blocking**: `longhorn-driver-deployer` stuck in Init:0/1
- **Blocking**: All `longhorn-csi-plugin` pods in Error
- **Investigation**: driver-deployer's `wait-longhorn-manager` init container waiting for manager readiness
### Current Investigation (15:34:30)
**Focus:** Why driver-deployer is stuck in Init state
The `longhorn-driver-deployer` pod has an init container that waits for Longhorn manager to be ready before deploying the CSI driver. Despite 3 manager pods running, the wait condition is not being met.
**Hypotheses:**
1. Manager pods are not fully healthy (readiness probes failing)
2. Network connectivity between driver-deployer and managers
3. RBAC or service account permissions issue
4. Configuration mismatch in HelmChart values
## Current Status (2026-04-15)
### Longhorn System
- **All Longhorn pods**: Running ✅ (reinstalled 2026-04-13)
- **CSI driver**: Registered ✅
### Volume Recovery Status
| PVC | Namespace | Size | Status |
|-----|-----------|------|--------|
| `traefik` (kube-system) | kube-system | 128Mi | ✅ Recovered (2026-04-14) |
| `url-shortener-data` | url-shortener | 128Mi | ✅ Recovered (2026-04-14) |
| `clickhouse-storage-clickhouse-0` | tools | 16Gi | ✅ Recovered (2026-04-14) |
| `prometheus-server` | tools | 8Gi | ⏳ In progress (2026-04-15) |
| `storage-prometheus-alertmanager-0` | tools | 2Gi | ⏳ In progress (2026-04-15) |
| `redis-storage-redis-0` | tools | 1Gi | ⏳ In progress (2026-04-15) |
| `backups-rwx` | longhorn-system | 50Gi | ⏳ In progress (2026-04-15) |
| `data-hashicorp-vault-0` | tools | 10Gi | 🔴 Deferred — manual recovery |
| `audit-hashicorp-vault-0` | tools | 10Gi | 🔴 Deferred — manual recovery |
| `erp` | erp | 50Gi | 🔴 Deferred — manual recovery |
## Next Steps
### Immediate
1. Confirm prometheus, alertmanager, redis, backups-rwx fully recovered via `longhorn_data.yml`
2. Verify monitoring stack (Grafana dashboards, alert routing) is functional
### Short-term
3. Manual recovery of Vault (`data-hashicorp-vault-0`, `audit-hashicorp-vault-0`) — see Vault runbook
4. Manual recovery of ERP (`erp`) — coordinate with application owner
5. Update backup playbook to include Longhorn Volume CRDs (see ADR 20260414-longhorn-pvc-recovery)
6. Prepare Longhorn Google Storage restore playbook for `backups-rwx` alternative recovery path
### Long-term
- Implement UPS for the Raspberry Pi cluster
- Add Longhorn volume health monitoring to Grafana
- Regular backup restore drills
## Architecture Context
```mermaid
%%{init: { 'theme': 'forest' }}%%
flowchart TB
subgraph K3s Control Plane
A[pi1: Control Plane] -->|runs| B[kubelet]
B --> C[k3s server]
C --> D[HelmChart Controller]
end
subgraph Storage Layer
E[Longhorn HelmChart] --> F[Longhorn Manager Pods]
F --> G[Driver Deployer]
G --> H[CSI Driver Registration]
H --> I[CSI Socket: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock]
F --> J[Longhorn Volumes]
J --> K[Replicas on all 3 nodes]
end
subgraph CSI Components
H --> L[csi-attacher Pods]
H --> M[csi-provisioner Pods]
H --> N[csi-resizer Pods]
H --> O[csi-snapshotter Pods]
H --> P[csi-plugin DaemonSet]
end
subgraph Data Path
I --> Q[/mnt/arcodange/longhorn/]
Q --> R[replicas/]
end
subgraph Docker Storage
S[Docker Daemon] --> T[/mnt/arcodange/docker/]
T --> U[overlay2/]
end
L -->|mounts volumes| V[Application Pods]
M -->|creates volumes| J
P -->|node-level ops| I
classDef critical fill:#c00,color:#fff,stroke:#000
classDef healthy fill:#0a0,color:#000,stroke:#000
classDef degraded fill:#ff0,color:#000,stroke:#000
class H,L,M,N,O,P critical
class F,G,E degraded
class I,J,Q,R,U healthy
```
## Component Details
### Longhorn Manager
- **Role**: Primary controller for Longhorn, manages volumes, replicas, snapshots
- **Image**: `longhornio/longhorn-manager:v1.9.1`
- **Ports**: 9500 (manager), 9501 (webhook health), 9502 (metrics)
- **Data Path**: `/mnt/arcodange/longhorn` (configured in HelmChart values)
- **Health Check**: `https://<pod-ip>:9501/v1/healthz`
### Longhorn Driver Deployer
- **Role**: Deploys the CSI driver to each node
- **Image**: `longhornio/longhorn-manager:v1.9.1`
- **Init Container**: `wait-longhorn-manager` - waits for manager to be ready
- **Blocker**: Currently stuck in init, preventing CSI driver deployment
### CSI Driver
- **Role**: Implements the CSI (Container Storage Interface) specification for Longhorn
- **Socket**: `/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock`
- **Registration**: Must be registered with kubelet via CSINode
- **Images**:
- `longhornio/csi-attacher:v4.9.0-20250709`
- `longhornio/csi-provisioner:v5.3.0-20250709`
- `longhornio/csi-resizer:v1.14.0-20250709`
- `longhornio/csi-snapshotter:v8.3.0-20250709`
- `longhornio/csi-node-driver-registrar:v2.14.0-20250709`
### CSI Node Driver Registrar
- **Role**: Registers the CSI driver with kubelet
- **Image**: `longhornio/csi-node-driver-registrar:v2.14.0-20250709`
- **Mechanism**: Creates a `CSINode` resource and registers via kubelet plugin registry
## Action Items
### Immediate (resolved)
- [x] Investigate and resolve driver-deployer init container blocker
- [x] Restore CSI driver registration
- [x] Fix Docker overlay2 corruption / daemon.json on all nodes
- [x] Fix DNS (CoreDNS + Pi-hole dnsmasq config)
- [x] Longhorn reinstalled and healthy
- [x] Traefik ingress controller functional
- [x] Fix backup script (empty backup.volumes bug)
### Short-term (resolved)
- [x] url-shortener data recovered
- [x] Clickhouse data recovered
- [x] Develop automated block-device recovery playbook (`playbooks/recover/longhorn_data.yml`)
- [x] Backup restore procedure documented and tested
### Medium-term (in progress)
- [ ] prometheus, alertmanager, redis, backups-rwx recovered (playbook running 2026-04-15)
- [ ] Vault manual recovery
- [ ] ERP manual recovery
- [ ] Update backup playbook to include Longhorn Volume CRDs
- [ ] Prepare Longhorn Google Storage restore playbook
### Long-term
- [ ] Implement UPS for Raspberry Pi cluster
- [ ] Add Longhorn volume health monitoring to Grafana
- [ ] Add CSI socket health check to monitoring
- [ ] Regular backup restore drills (monthly)
## Lessons Learned
### What Went Well
- Quick identification of root cause (CSI driver registration)
- Longhorn volume data remained intact (good replica design)
- Ability to force-pod-delete triggered partial recovery
- K3s HelmChart approach allows easy manifest-based recovery
### What Could Be Improved
- Need better CSI driver health monitoring and alerting
- Longhorn driver-deployer init container timeout may be too short
- Docker overlay2 on external storage needs better corruption recovery
- Backup script has bugs that prevent reliable backups
- No UPS protection for power cuts
### Technical Debt Identified
- Backup script formatting bug (extra newlines create invalid YAML)
- No automated Longhorn health checks
- Manual intervention required for CSI driver recovery
## Related Files
- **Ansible Playbook**: `playbooks/system/k3s_config.yml` (Longhorn HelmChart creation)
- **HelmChart Manifest**: `/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml` on pi1
- **Backup Scripts**: `/opt/k3s_volumes/backup.sh` and `/opt/k3s_volumes/restore.sh` on pi1
- **Inventory**: `inventory/hosts.yml` (required for all playbooks)
## Commands Reference
### Check Longhorn Status
```bash
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system
kubectl get settings -n longhorn-system
```
### Force Longhorn Recovery (k3s-specific)
```bash
# Method 1: Touch manifest (soft reconcile)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
# Method 2: Delete all pods (force recreate)
kubectl delete pods -n longhorn-system --all --force --grace-period=0
# Method 3: Delete specific pod
kubectl delete pod -n longhorn-system longhorn-driver-deployer-*
```
### Check CSI Driver Registration
```bash
kubectl get csidriver
kubectl get csinodes
kubectl describe csidriver driver.longhorn.io
```
### Check Longhorn Manufacturer
```bash
kubectl describe cm -n longhorn-system longhorn-storageclass
```

View File

@@ -0,0 +1,209 @@
%%{init: { 'theme': 'forest', 'themeVariables': {
'primaryColor': '#1e293b',
'primaryTextColor': '#f8fafc',
'lineColor': '#334155',
'secondaryColor': '#475569',
'tertiaryColor': '#94a3b8',
'edgeLabelBackground':'#fff',
'edgeLabelColor': '#1e293b'
}}}%%
flowchart TD
subgraph Cluster["K3s Cluster (v1.34.3+k3s1)"]
direction TB
subgraph Nodes["Physical Nodes"]
pi1["pi1: 192.168.1.201\nControl Plane"]
pi2["pi2: 192.168.1.202\nWorker"]
pi3["pi3: 192.168.1.203\nWorker"]
end
subgraph K3sComponents["K3s Control Plane Components"]
kubelet1["kubelet"]
kubelet2["kubelet"]
kubelet3["kubelet"]
k3s_server["k3s server"]
helm_controller["HelmChart Controller"]
end
pi1 --> kubelet1
pi2 --> kubelet2
pi3 --> kubelet3
pi1 --> k3s_server
k3s_server --> helm_controller
end
subgraph LonghornStorage["Longhorn Storage System"]
direction TB
subgraph HelmChart["HelmChart Installation"]
manifest[("longhorn-install.yaml")]
end
subgraph Manager["Longhorn Manager layer"]
lh_manager1["longhorn-manager-r6sd2\n2/2 Running\npi2"]
lh_manager2["longhorn-manager-sjc56\n1/2 Running\npi3"]
lh_manager3["longhorn-manager-t9b45\n1/2 Running\npi1"]
webhook["Webhook Leader: pi2"]
end
subgraph DriverDeployer["CSI Driver Deployer"]
deployer["longhorn-driver-deployer\n0/1 Init:0/1\npi3"]
wait_container["wait-longhorn-manager\nwaiting..."]
end
subgraph CSIDriver["CSI Driver Components"]
csi_socket[("/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock")]
csi_registrar["CSI Node Driver Registrar"]
end
subgraph CSIContainers["CSI Containers (Sidecars)"]
attacher1["csi-attacher-54ld9\n1/1 Running\npi2"]
attacher2["csi-attacher-dqq9v\n1/1 Running\npi3"]
attacher3["csi-attacher-k5jmx\n0/1 Error\npi1"]
provisioner1["csi-provisioner-9z79d\n0/1 Error\npi2"]
provisioner2["csi-provisioner-zjwdr\n1/1 Running\npi1"]
provisioner3["csi-provisioner-zk5kp\n1/1 Running\npi3"]
resizer1["csi-resizer-8mrld\n1/1 Running\npi3"]
resizer2["csi-resizer-ddhl2\n0/1 Error\npi1"]
resizer3["csi-resizer-qv5n9\n0/1 Error\npi2"]
snapshotter1["csi-snapshotter-9rzf4\n1/1 Running\npi3"]
snapshotter2["csi-snapshotter-bqdtd\n0/1 Error\npi2"]
snapshotter3["csi-snapshotter-jv6pj\n1/1 Running\npi1"]
end
subgraph CSIPlugin["CSI Plugin DaemonSet"]
plugin1["longhorn-csi-plugin-f44jp\n0/3 Error\npi3"]
plugin2["longhorn-csi-plugin-q2sgh\n1/3 Error\npi1"]
plugin3["longhorn-csi-plugin-vzld8\n2/3 Error\npi2"]
end
subgraph DataLayer["Longhorn Data Layer"]
engine1["engine-image-ei-8ktd9\n1/1 Running\npi1"]
engine2["engine-image-ei-dcjq8\n1/1 Running\npi3"]
engine3["engine-image-ei-m76jf\n1/1 Running\npi2"]
volumes[("12 Longhorn Volumes")]
replicas[("/mnt/arcodange/longhorn/replicas/")]
end
subgraph UIAndTools["UI & Backup"]
ui1["longhorn-ui-8gb4s\n0/1 CrashLoop\npi1"]
ui2["longhorn-ui-hmxz6\n0/1 CrashLoop\npi3"]
share_mgr1["share-manager-...70b4\n0/1 Error\npi1"]
share_mgr2["share-manager-...7ffa\n0/1 Error\npi3"]
nfs["rwx-nfs-4cn9h\n0/1 ContainerCreating\npi3"]
end
manifest --> lh_manager1 & lh_manager2 & lh_manager3
helm_controller --> manifest
lh_manager1 & lh_manager2 & lh_manager3 --> webhook
deployer --> wait_container
wait_container -.->|waits for| lh_manager1 & lh_manager2 & lh_manager3
deployer --> csi_registrar
csi_registrar --> csi_socket
csi_socket --> kubelet1
csi_socket --> kubelet2
csi_socket --> kubelet3
attacher1 & attacher2 & attacher3 --> csi_socket
provisioner1 & provisioner2 & provisioner3 --> csi_socket
resizer1 & resizer2 & resizer3 --> csi_socket
snapshotter1 & snapshotter2 & snapshotter3 --> csi_socket
plugin1 & plugin2 & plugin3 --> csi_socket
lh_manager1 & lh_manager2 & lh_manager3 --> volumes
volumes --> replicas
replicas --> pi1_disk[("pi1: /mnt/arcodange/longhorn")]
replicas --> pi2_disk[("pi2: /mnt/arcodange/longhorn")]
replicas --> pi3_disk[("pi3: /mnt/arcodange/longhorn")]
share_mgr1 & share_mgr2 --> nfs
nfs --> backup_pvc[("PVC: backups-rwx\n50Gi")]
end
subgraph DockerStorage["Docker Storage layer"]
docker1["Docker daemon\npi1"]
docker2["Docker daemon\npi2"]
docker3["Docker daemon\npi3"]
storage1[("/mnt/arcodange/docker/overlay2/")]
docker1 --> storage1
docker2 --> storage1
docker3 --> storage1
end
subgraph ApplicationLayer["Application Pods (Affected)"]
traefik["traefik-5c67cb6889-8b5nk\n0/1 Error\nkube-system"]
cms["cms-arcodange-cms-...\n0/1 ImagePullBackOff\ncms"]
webapp["webapp-6588455979-...\n0/1 ImagePullBackOff\nwebapp"]
erp["erp-648748b4f5-bntd9\n0/1 Error\nerp"]
grafana["grafana-5d496f9668-...\n0/3 Error\ntools"]
vault["hashicorp-vault-0\n0/1 Error\ntools"]
end
subgraph NetworkServices["Network Services"]
coredns["coredns-67476ddb48-jrcg2\n1/1 Running\nkube-system"]
svclb["svclb-traefik-*\n3/3 Running\nkube-system"]
end
%% Connections showing failure paths
csi_socket --x-- traefik :x
csi_socket --x-- cms :x
csi_socket --x-- webapp :x
csi_socket --x-- erp :x
csi_socket --x-- grafana :x
csi_socket --x-- vault :x
docker1 --x-- coredns :x
docker1 --x-- svclb :x
%% Healthy connections
volumes -->|provides storage| traefik
volumes -->|provides storage| cms
volumes -->|provides storage| webapp
volumes -->|provides storage| erp
volumes -->|provides storage| grafana
volumes -->|provides storage| vault
classDef node fill:#0ea5e9,color:#000,stroke:#06b6d4
classDef k3s fill:#84cc16,color:#000,stroke:#65a30d
classDef longhorn fill:#a855f7,color:#fff,stroke:#8b5cf6
classDef csi fill:#f59e0b,color:#000,stroke:#d97706
classDef data fill:#10b981,color:#000,stroke:#059669
classDef app fill:#ec4899,color:#fff,stroke:#db2777
classDef network fill:#6366f1,color:#fff,stroke:#4f46e5
classDef error fill:#ef4444,color:#fff,stroke:#dc2626
classDef waiting fill:#fbbf24,color:#000,stroke:#f59e0b
class pi1,pi2,pi3 node
class kubelet1,kubelet2,kubelet3,k3s_server,helm_controller k3s
class manifest,webhook longhorn
class lh_manager1,lh_manager2,lh_manager3,engine1,engine2,engine3,volumes,replicas,share_mgr1,share_mgr2 data
class deployer,wait_container,csi_registrar,csi_socket longhorn
class attacher1,attacher2,attacher3,provisioner1,provisioner2,provisioner3,resizer1,resizer2,resizer3,snapshotter1,snapshotter2,snapshotter3 csi
class plugin1,plugin2,plugin3 csi
class traefik,cms,webapp,erp,grafana,vault app
class coredns,svclb network
class docker1,docker2,docker3,data
class deployer,wait_container error
class attacher3,provisioner1,resizer2,resizer3,snapshotter2 error
class plugin1,plugin2,plugin3 error
class ui1,ui2,share_mgr1,share_mgr2 error
class traefik,cms,webapp,erp,grafana,vault error
class nfs waiting
class lh_manager2,lh_manager3 waiting
classDef clusterBox stroke:#334155,stroke-width:2px,color:#94a3b8
class Cluster clusterBox
class LonghornStorage clusterBox
class DockerStorage clusterBox
class ApplicationLayer clusterBox
class NetworkServices clusterBox

View File

@@ -0,0 +1,200 @@
%%{init: { 'theme': 'forest', 'themeVariables': {
'primaryColor': '#7c3aed',
'primaryTextColor': '#ffffff',
'lineColor': '#6d28d9',
'secondaryColor': '#8b5cf6',
'tertiaryColor': '#a78bfa',
'edgeLabelBackground':'#5b21b6',
'edgeLabelColor': '#ffffff'
}}}%%
mindmap
root((Longhorn Storage System))
%% ===== CONTROL PLANE COMPONENTS =====
ControlPlane[Control Plane]
Manager[longhorn-manager]
Role1["Role: Primary controller for Longhorn"]
Responsibilities1["• Manages volumes, replicas, snapshots\n• Handles volume lifecycle\n• Coordinates with etcd\n• Exposes API (port 9500)"]
Health1["Health Check: :9501/v1/healthz"]
Webhook1["Webhook: :9502/metrics"]
DriverDeployer[longhorn-driver-deployer]
Role2["Role: CSI driver deployment controller"]
Responsibilities2["• Deploys CSI driver to each node\n• Runs via init container (wait-longhorn-manager)\n• Creates csi.sock on each node"]
WaitCmd["Command: longhorn-manager wait -d <namespace>"]
Blocking["⚠️ BLOCKED: Init container waiting for managers"]
%% ===== CSI COMPONENTS =====
CSILayer[CSI Interface]
CSISocket[("/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock")]
SocketRole["Role: Unix domain socket for CSI communication"]
Attacher[csi-attacher]
AttacherRole["Role: Attaches volumes to nodes"]
AttacherResp["• Monitors VolumeAttachment objects\n• Calls CSI ControllerPublishVolume\n• Handles detach operations"]
AttacherStatus["Status: 2/3 Running, 1 Error"]
Provisioner[csi-provisioner]
ProvisionerRole["Role: Creates volumes from PVCs"]
ProvisionerResp["• Watches PVC objects\n• Calls CSI CreateVolume\n• Handles volume deletion"]
ProvisionerStatus["Status: 2/3 Running, 1 Error"]
Resizer[csi-resizer]
ResizerRole["Role: Handles volume resizing"]
ResizerResp["• Watches PVC size changes\n• Calls CSI ExpandVolume"]
ResizerStatus["Status: 1/3 Running, 2 Error"]
Snapshotter[csi-snapshotter]
SnapshotterRole["Role: Manages volume snapshots"]
SnapshotterResp["• Watches VolumeSnapshot objects\n• Calls CSI CreateSnapshot\n• Handles snapshot deletion"]
SnapshotterStatus["Status: 2/3 Running, 1 Error"]
NodeRegistrar[csi-node-driver-registrar]
RegistrarRole["Role: Registers driver with kubelet"]
RegistrarResp["• Creates CSINode resource\n• Registers via kubelet plugin registry API"]
Plugin[csi-plugin]
PluginRole["Role: Node-level CSI operations"]
PluginResp["• Runs on each node (DaemonSet)\n• Handles NodePublish/UnpublishVolume\n• Manages mount/unmount operations"]
PluginStatus["⚠️ BLOCKED: All 3 pods in Error (no CSI socket)"]
%% ===== DATA LAYER COMPONENTS =====
DataLayer[Data Layer]
Engine[engine-image]
EngineRole["Role: Engine and instance manager"]
EngineResp["• Pulls and manages engine binaries\n• Runs as sidecar in DaemonSet\n• Maintains engine processes"]
EngineStatus["Status: ✅ 3/3 Running"]
Volumes[Longhorn Volumes]
VolumeRole["Role: Logical volume representation"]
VolumeResp["• Managed via Longhorn CRDs\n• Replicated across nodes\n• Supports RWO, RWX access modes"]
VolumeStatus["Status: ✅ All 12 volumes attached & healthy"]
Replicas[Volume Replicas]
ReplicaRole["Role: Physical data storage"]
ReplicaResp["• 3-way replication across nodes\n• Stored at /mnt/arcodange/longhorn/replicas/\n• Data intact after power cut"]
ReplicaPath["Path: pi1, pi2, pi3: /mnt/arcodange/longhorn/replicas/"]
Backups[Backup System]
NFS[RWX NFS Share]
NFSRole["Role: NFS export for backup volume"]
NFSCreate["Created via: playbooks/setup/backup_nfs.yml"]
NFSStatus["⚠️ OFFLINE: share-manager pods in Error"]
BackupPVC[Backup PVC]
BackupPVCRole["Role: Persistent storage for backups"]
BackupPVCDetails["Name: backups-rwx\nNamespace: longhorn-system\nSize: 50Gi\nClass: longhorn"]
ShareManager[share-manager]
ShareRole["Role: Manages NFS exports for Longhorn volumes"]
ShareStatus["⚠️ BLOCKED: 2 pods in Error"]
%% ===== UI & TOOLS =====
UI[Web UI]
UIRole["Role: Longhorn management dashboard"]
UIAccess["Access: Port 9500 on manager pods"]
UIStatus["⚠️ BLOCKED: 2 pods in CrashLoopBackOff"]
%% ===== INFRASTRUCTURE =====
Infrastructure[Underlying Infrastructure]
Nodes[Raspberry Pi Nodes]
pi1["pi1: 192.168.1.201\nRole: Control Plane"]
pi2["pi2: 192.168.1.202\nRole: Worker"]
pi3["pi3: 192.168.1.203\nRole: Worker"]
K3s[Kubernetes (k3s v1.34.3+k3s1)]
Kubelet["kubelet (3 instances)"]
APIServer["API Server (on pi1)"]
etcd["etcd (on pi1)"]
HelmCtrl["HelmChart Controller"]
Docker[Docker Engine]
DockerRole["Role: Container runtime"]
DockerStorage["Storage: /mnt/arcodange/docker/"]
Overlay2["⚠️ ISSUE: overlay2 filesystem corrupted"]
%% ===== EXTERNAL DEPENDENCIES =====
Dependencies[External Dependencies]
CSIRegistration[CSI Driver Registration]
CSIRole["Role: k8s CSI registration"]
CSIDriver["Driver: driver.longhorn.io"]
CSIDriverStatus["⚠️ LOST: Not registered with kubelet"]
%% ===== CONNECTIONS =====
root --> ControlPlane
root --> CSILayer
root --> DataLayer
root --> UI
root --> Infrastructure
root --> Dependencies
ControlPlane --> Manager
ControlPlane --> DriverDeployer
CSILayer --> CSISocket
CSILayer --> Attacher
CSILayer --> Provisioner
CSILayer --> Resizer
CSILayer --> Snapshotter
CSILayer --> NodeRegistrar
CSILayer --> Plugin
CSISocket --> Attacher
CSISocket --> Provisioner
CSISocket --> Resizer
CSISocket --> Snapshotter
CSISocket --> Plugin
CSISocket --> NodeRegistrar
DriverDeployer --> NodeRegistrar
NodeRegistrar --> CSISocket
DataLayer --> Engine
DataLayer --> Volumes
DataLayer --> Replicas
DataLayer --> Backups
Backups --> NFS
Backups --> BackupPVC
Backups --> ShareManager
Infrastructure --> Nodes
Infrastructure --> K3s
Infrastructure --> Docker
Dependencies --> CSIRegistration
CSIRegistration --> CSISocket
%% ===== YET TO BE RESTORED =====
Dependencies --x EmptyCSI["⚠️ CSI Socket Missing"] :x
EmptyCSI --x Attacher :x
EmptyCSI --x Provisioner :x
EmptyCSI --x Resizer :x
EmptyCSI --x Snapshotter :x
EmptyCSI --x Plugin :x
%% ===== STYLES =====
classDef component fill:#8b5cf6,color:#fff,stroke:#7c3aed,stroke-width:2px
classDef role fill:#a78bfa,color:#000,stroke:#8b5cf6
classDef responsibility fill:#c4b5fd,color:#000,stroke:#8b5cf6
classDef status_good fill:#10b981,color:#fff,stroke:#059669
classDef status_bad fill:#ef4444,color:#fff,stroke:#dc2626
classDef status_warn fill:#f59e0b,color:#000,stroke:#d97706
classDef infinite fill:#3b82f6,color:#fff,stroke:#2563eb
class root infinite
class ControlPlane,CSILayer,DataLayer,UI,Infrastructure,Dependencies component
class Manager,Attacher,Provisioner,Resizer,Snapshotter,NodeRegistrar,Plugin,Engine,Volumes,Replicas,NFS,BackupPVC,ShareManager,UIRole,Nodes,K3s,Docker,CSIRegistration component
class Role1,Role2,AttacherRole,ProvisionerRole,ResizerRole,SnapshotterRole,RegistrarRole,PluginRole,EngineRole,VolumeRole,ReplicaRole,NFSRole,ShareRole,UIRole,Kubelet,APIServer,etcd,HelmCtrl,DockerRole,CSIRole,CSIDriver component
class Responsibilities1,Responsibilities2,AttacherResp,ProvisionerResp,ResizerResp,SnapshotterResp,RegistrarResp,PluginResp,EngineResp,VolumeResp,ReplicaResp,NFSRole,BackupPVCDetails,ShareRole,UIAccess,ShareStatus,NFSStatus role
class EngineStatus,VolumeStatus,ReplicaPath status_good
class Blocking,PluginStatus,UIStatus,ShareStatus,NFSCreate,ShareStatus,CSIDriverStatus status_bad
class AttacherStatus,ProvisionerStatus,ResizerStatus,SnapshotterStatus status_warn
classDef mindmapTitle fill:#4c1d95,color:#fff,stroke:#5b21b6,font-size:20px,font-weight:bold
class root mindmapTitle

View File

@@ -0,0 +1,131 @@
%%{init: { 'theme': 'forest', 'themeVariables': {
'primaryColor': '#059669',
'primaryTextColor': '#fff',
'lineColor': '#065f46',
'secondaryColor': '#10b981',
'edgeLabelBackground':'#064e3b',
'edgeLabelColor': '#ffffff'
}}}%%
flowchart TD
%% ===== POWER CUT EVENT =====
Start([Power Cut Event]) -->|Electricity Lost| Crash[Kubernetes Components Crash]
%% ===== IMMEDIATE IMPACT =====
Crash --> KubeletCrash[Kubelet Processes Crash<br>on all 3 nodes]
Crash --> DockerCrash[Docker Daemons Crash<br>on all 3 nodes]
Crash --> K3sCrash[K3s Server Process Crash<br>on pi1]
%% ===== DOCKER STORAGE CORRUPTION =====
DockerCrash --> Overlay2[ /mnt/arcodange/docker/overlay2/<br>Filesystem Corrupted]
Overlay2 --> DockerFail[Docker containers cannot start<br>missing layer files]
DockerFail --> CoreDNSPod[CoreDNS Pod<br>CrashLoopBackOff]
DockerFail --> TraefikLB[svclb-traefik Pods<br>CrashLoopBackOff]
%% ===== LONGHORN IMPACT =====
KubeletCrash --> CSIUnreg[CSI Driver Registration Lost<br>driver.longhorn.io unregistered]
K3sCrash --> HelmCtrl[HelmChart Controller<br>Unresponsive]
CSIUnreg --> CSISocket[ /var/lib/kubelet/plugins/.../csi.sock<br>Disappears]
%% ===== LONGHORN MANAGER LOSS =====
KubeletCrash --> LHManagers[Longhorn Manager Pods<br>Crash 3 pods ]
LHManagers --> NoQuorum[No Manager Quorum<br>Cannot coordinate]
NoQuorum --> VolumesFrozen[Existing Volumes<br>Still healthy but inaccessible]
CSISocket --> CSIChicago[CSI Pods Cannot Start<br>csi-attacher, provisioner, resizer, snapshotter]
CSISocket --> CSIPlugin[CSI Plugin DaemonSet<br>Cannot register driver]
%% ===== VOLUME MOUNT FAILURES =====
CSIChicago --> NoMounts[PVC Mounts Fail<br>All Longhorn PVs inaccessible]
CSIPlugin --> NoMounts
%% ===== APPLICATION CASCADING FAILURES =====
NoMounts --> TraefikDown[Traefik Pod<br>PVC mount failed<br>Error state]
NoMounts --> AppPods1[Application Pods<br>PVC mount failed<br>Error state<br>cms, webapp, erp, clickhouse, etc.]
%% ===== BACKUP SYSTEM IMPACT =====
NoQuorum --> NFSDown[NFS Share-Manager Pods<br>Error state]
NFSDown --> BackupMount[ /mnt/backups/ NFS Mount<br>Unavailable]
%% ===== DISCOVERY & RECOVERY =====
Discovery[15:23:57<br>Incident Discovered] --> Assessment[15:24:05<br>Assessment Complete]
Assessment --> Identify[15:24:10<br>Root Cause: CSI Driver Unregistered]
Identify --> CheckData[15:24:15<br>Verify Volume Health]
CheckData --> DataIntact[All 12 volumes:<br>state=attached<br>robustness=healthy]
%% ===== RECOVERY ATTEMPTS =====
Identify --> Attempt1[15:24:50<br>Attempt 1: Touch HelmChart Manifest]
Attempt1 --> Partial1[Only 1 manager pod affected]
Partial1 --> NeedMore[Insufficient recovery]
NeedMore --> Attempt2[15:32:15<br>Attempt 2: Delete All Longhorn Pods]
Attempt2 --> HelmReconcile[HelmChart Controller<br>Recreates All 24 Pods]
HelmReconcile --> Progress[15+ Pods Running<br>Managers, Engine-Image, Some CSI]
Progress --> Blocked[Driver-Deployer<br>Stuck in Init:0/1]
Blocked --> Investigate[15:34:30<br>Investigate wait-longhorn-manager]
Investigate --> WaitLoop[Init container runs:<br>longhorn-manager wait -d longhorn-system]
WaitLoop --> WaitingManagers[Waiting for all managers<br>to pass readiness probes]
%% ===== CURRENT STATE (15:35:30) =====
WaitingManagers --> CurrentState
subgraph CurrentState["Current State<br>15:35:30 UTC"]
direction TB
Resolved[Resolved ✅] --> ManagersOk[Manager Pods:<br>2/2, 1/2, 2/2 Running<br>pi1, pi2, pi3]
Resolved --> EngineOk[Engine Image:<br>3/3 Running]
Resolved --> CSIPartial[CSI Sidecars:<br>~50% Running]
Resolved --> VolumeData[Volume Data:<br>All intact]
BlockedNow[Blocked ❌] --> DriverDeployer[Driver Deployer:<br>Init:0/1 8+ min<br>waiting for managers]
BlockedNow --> CSIPluginAll[CSI Plugin:<br>0/3 Error all ]
BlockedNow --> UI[Longhorn UI:<br>0/2 CrashLoop]
BlockedNow --> ShareMgr[Share Manager:<br>0/2 Error]
BlockedNow --> NFSPod[RWX NFS:<br>ContainerCreating]
BlockedNow --> AppImpact[Application Impact:<br>~30 pods still failed<br>down from 43]
end
%% ===== RECOVERY PATH =====
CurrentState --> NextStep[Next: Resolve driver-deployer<br>wait-longhorn-manager blockage]
NextStep --> CheckHealth[Check manager health endpoints<br>https://<ip>:9501/v1/healthz]
CheckHealth -->|If healthy| WaitContainerIssue[Wait container bug/timeout]
CheckHealth -->|If unhealthy| FixManagers[Investigate manager readiness]
WaitContainerIssue --> Option1[Option 1: Delete driver-deployer pod]
WaitContainerIssue --> Option2[Option 2: Touch manifest again]
FixManagers --> CheckLogs[Check manager container logs]
CheckLogs --> ResolveManagers[Fix manager readiness]
Option1 --> CSIDriver[CSI Driver deployed]
Option2 --> CSIDriver
ResolveManagers --> CSIDriver
CSIDriver --> CSISocketRestored[CSI Socket Restored]
CSISocketRestored --> PodsRecover[All Longhorn pods recover]
PodsRecover --> PVCMounts[PVC Mounts resume]
PVCMounts --> AppRecovery[Application pods auto-recover]
AppRecovery --> ResolvedState[Resolved ✅]
%% ===== STYLES =====
classDef event fill:#10b981,color:#fff,stroke:#059669
classDef impact fill:#d97706,color:#000,stroke:#b45309
classDef action fill:#3b82f6,color:#fff,stroke:#2563eb
classDef resolved fill:#10b981,color:#fff,stroke:#059669
classDef blocked fill:#ef4444,color:#fff,stroke:#dc2626
classDef current fill:#8b5cf6,color:#fff,stroke:#7c3aed
class Start,Crash,KubeletCrash,DockerCrash,K3sCrash event
class Overlay2,DockerFail,CSIUnreg,CSISocket,NoQuorum,NoMounts impact
class Discovery,Assessment,Identify,CheckData,Attempt1,Attempt2,Investigate action
class ManagersOk,EngineOk,CSIPartial,VolumeData resolved
class DriverDeployer,CSIPluginAll,UI,ShareMgr,NFSPod,AppImpact blocked
class WaitLoop,CurrentState,NextStep,CheckHealth,Option1,Option2,ResolvedState current
classDef subtitle fill:#64748b,color:#fff,stroke:#475569,font-size:12px
class CurrentState,CurrentStateLabel subtitle

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,416 @@
---
title: PVC Recovery — Post-Reinstall Volume Restoration
incident_id: 2026-04-13-001
date: 2026-04-14
status: Mostly Resolved
operator: Claude Code
---
# PVC Recovery — Post-Reinstall Volume Restoration
## Situation as of 2026-04-14
Longhorn has been fully reinstalled and is healthy. The cluster nodes are all Ready. However,
**all application volumes are inaccessible** because the nuclear cleanup deleted the Longhorn
Volume/Engine/Replica CRDs, and the reinstalled Longhorn has no knowledge of the old volumes.
### Longhorn Health (verified)
```
NAME READY STATUS AGE
csi-attacher (3 pods) 1/1 Running 30m
csi-provisioner (3 pods) 1/1 Running 30m
csi-resizer (3 pods) 1/1 Running 30m
csi-snapshotter (3 pods) 1/1 Running 30m
engine-image-ei-b4bcf0a5 (3 pods) 1/1 Running 31m
instance-manager (3 pods) 1/1 Running 30m
longhorn-csi-plugin (3 pods) 3/3 Running 30m
longhorn-driver-deployer 1/1 Running 31m
longhorn-manager (3 pods) 2/2 Running 14m
longhorn-ui (2 pods) 1/1 Running 31m
CSIDriver driver.longhorn.io: Registered (AGE: 110d — restored)
```
Longhorn only knows about 3 volumes (crowdsec-config, crowdsec-db, traefik) — all newly provisioned
after reinstall. The other 9 volumes are missing from Longhorn's knowledge.
---
## Backup Files Available
| File | Location | Contents | Gap |
|------|----------|----------|-----|
| `backup_20260413.volumes` | `/home/pi/arcodange/backups/k3s_pvc/` | PV + PVC YAML (kubectl get -A pv,pvc) | No Longhorn CRDs |
| `longhorn_metadata_20260413.yaml` | `/home/pi/arcodange/backups/k3s_pvc/` | Engines + Replicas CRDs | **No Volume CRDs** |
**Critical gap:** The metadata backup was collected with `kubectl get -n longhorn-system volumes.longhorn.io,replicas.longhorn.io,engines.longhorn.io -o yaml` but the resulting file contains only Engines and Replicas in 3 separate Lists. The Volume CRDs are absent.
Attempting `kubectl apply -f longhorn_metadata_20260413.yaml` fails with:
```
Error from server (Invalid): admission webhook "validator.longhorn.io" denied the request:
volume does not exist for engine
```
The webhook requires Volume CRDs to exist before Engines can be created. Without Volume CRDs in the
backup, the metadata file cannot be applied as-is.
---
## Data Survival Assessment
### Pi1 — Replica directories
Pi1 is the control plane. Its old replica directories were **deleted** during the nuclear cleanup.
Only 3 new directories exist (created after reinstall):
```
pvc-01b93e30-...-b1530c1d (crowdsec-config — NEW)
pvc-4785dc60-...-2f031b60 (crowdsec-db — NEW)
pvc-5391fa2b-...-0e2ff956 (traefik — NEW)
```
### Pi2 — Replica directories (OLD data preserved)
```
pvc-01b93e30-...-8649439a (crowdsec-config — new post-reinstall)
pvc-1251909b-...-e7a20fdf ← OLD DATA (clickhouse 16Gi)
pvc-14ccc47e-...-09021065 ← OLD DATA (crowdsec-db old PV)
pvc-4785dc60-...-4b48fdf1 (crowdsec-db — new post-reinstall)
pvc-5391fa2b-...-d3503612 (traefik — new post-reinstall)
pvc-63244de1-...-6076eb08 (unknown — not in engine backup)
pvc-6d2ea1c7-...-c7f287d8 ← OLD DATA (audit-vault 10Gi)
pvc-7971918e-...-2028617e ← OLD DATA (erp 50Gi)
pvc-88e18c7f-...-910583f6 ← OLD DATA (prometheus-server 8Gi)
pvc-abc7666c-...-34bec9b0 (unknown — not in engine backup)
pvc-aed7f2c4-...-41c20064 ← OLD DATA (alertmanager 2Gi)
pvc-ca5567d3-...-b537ca60 ← OLD DATA (data-vault 10Gi)
pvc-cc8a3cbb-...-cd16e459 ← OLD DATA (old traefik 128Mi)
pvc-cdd434d1-...-b2695689 ← OLD DATA (url-shortener 128Mi)
pvc-d1d5482b-...-e0a8cdbc ← OLD DATA (redis 1Gi)
pvc-efda1d2f-...-30c849a6 ← OLD DATA (backups-rwx 50Gi)
pvc-f9fe3504-...-20f64e9e ← OLD DATA (old crowdsec-config 100Mi)
pvc-fca13978-...-4749b404 (unknown — not in engine backup)
```
### Pi3 — Replica directories (OLD data preserved, multiple dirs per volume)
```
pvc-01b93e30-...-29592f50 (crowdsec-config — new post-reinstall)
pvc-1251909b-...-1163420b ← OLD DATA (clickhouse — replica 1)
pvc-1251909b-...-3a569b0a ← OLD DATA (clickhouse — replica 2)
pvc-1251909b-...-ccd05947 ← OLD DATA (clickhouse — replica 3 or stale)
pvc-14ccc47e-...-3856d64d ← OLD DATA (old crowdsec-db)
pvc-2e60385f-...-48e27d5a (unknown)
pvc-4785dc60-...-869f0e99 (crowdsec-db — new post-reinstall)
pvc-5391fa2b-...-958cd868 (traefik — new post-reinstall)
pvc-6d2ea1c7-...-0e73550d ← OLD DATA (audit-vault — dir 1)
pvc-6d2ea1c7-...-787ffefa ← OLD DATA (audit-vault — dir 2)
pvc-6d2ea1c7-...-e0f58d64 ← OLD DATA (audit-vault — dir 3 or stale)
pvc-7971918e-...-33191046 ← OLD DATA (erp — dir 1)
pvc-7971918e-...-88fc1dfc ← OLD DATA (erp — dir 2)
pvc-7971918e-...-b5c5530d ← OLD DATA (erp — dir 3 or stale)
pvc-88e18c7f-...-5d508830 ← OLD DATA (prometheus-server — dir 1)
pvc-88e18c7f-...-92c0ebfd ← OLD DATA (prometheus-server — dir 2)
pvc-88e18c7f-...-deea6182 ← OLD DATA (prometheus-server — dir 3 or stale)
pvc-abe09e90-...-a748d11b (unknown)
pvc-aed7f2c4-...-3452358f ← OLD DATA (alertmanager — dir 1)
pvc-aed7f2c4-...-826f05aa ← OLD DATA (alertmanager — dir 2)
pvc-ca5567d3-...-0ed6f691 ← OLD DATA (data-vault — dir 1)
pvc-ca5567d3-...-808d72b4 ← OLD DATA (data-vault — dir 2)
pvc-ca5567d3-...-9051ef48 ← OLD DATA (data-vault — dir 3 or stale)
pvc-cc8a3cbb-...-011b54b3 ← OLD DATA (old traefik — dir 1)
pvc-cc8a3cbb-...-a24fd91e ← OLD DATA (old traefik — dir 2)
pvc-cdd434d1-...-70197659 ← OLD DATA (url-shortener — dir 1)
pvc-cdd434d1-...-998f49ff ← OLD DATA (url-shortener — dir 2)
pvc-d1d5482b-...-6a730f00 ← OLD DATA (redis — dir 1)
pvc-d1d5482b-...-75da16fd ← OLD DATA (redis — dir 2)
pvc-efda1d2f-...-62fb04c9 ← OLD DATA (backups-rwx — dir 1)
pvc-efda1d2f-...-688f30f5 ← OLD DATA (backups-rwx — dir 2)
pvc-efda1d2f-...-69454dd0 ← OLD DATA (backups-rwx — dir 3 or stale)
pvc-f9fe3504-...-418df608 ← OLD DATA (old crowdsec-config)
```
**Note on multiple directories per volume on pi3:** Normal replicas = 1 dir per volume per node.
Multiple directories indicate either: rebuild attempts from before the nuclear cleanup, or stale
snapshots. Must verify by checking `.img` file sizes before renaming.
---
## Volume → PVC Mapping (from backup_20260413.volumes)
| PV Name | PVC | Namespace | Size | Status |
|---------|-----|-----------|------|--------|
| `pvc-1251909b-3cef-40c6-881c-3bb6e929a596` | `clickhouse-storage-clickhouse-0` | tools | 16Gi | Terminating |
| `pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70` | `audit-hashicorp-vault-0` | tools | 10Gi | Terminating |
| `pvc-7971918e-e47f-4739-a976-965ea2d770b4` | `erp` | erp | 50Gi | Terminating |
| `pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9` | `prometheus-server` | tools | 8Gi | Terminating |
| `pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4` | `storage-prometheus-alertmanager-0` | tools | 2Gi | Terminating |
| `pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635` | `data-hashicorp-vault-0` | tools | 10Gi | Terminating |
| `pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90` | `traefik` | kube-system | 128Mi | Terminating |
| `pvc-cdd434d1-88b4-4588-8fd2-8c7eafc56d07` | `url-shortener` | url-shortener | 128Mi | Terminating |
| `pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce` | `redis-storage-redis-0` | tools | 1Gi | Terminating |
| `pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa` | `backups-rwx` | longhorn-system | 50Gi | Lost |
| `pvc-14ccc47e-0b8c-49d4-97bb-70e550f644b0` | `crowdsec-db-pvc` | tools | 1Gi | already replaced |
| `pvc-f9fe3504-70ce-4401-8cda-bc6bb68bc1bf` | `crowdsec-config-pvc` | tools | 100Mi | already replaced |
CrowdSec volumes (`pvc-14ccc47e`, `pvc-f9fe3504`) are the old PVs — CrowdSec already got new volumes
(`pvc-4785dc60`, `pvc-01b93e30`) and is running. These old dirs can be cleaned up later.
---
## Recovery Plan
### Why not restore PVCs
New PVCs will be created by the workloads themselves when they restart. Restoring old PVCs would
conflict with both the stuck Terminating ones and any new ones pods may already be creating.
**Restore PVs only** — strip `claimRef` so they become `Available`, and new PVCs bind to them via
`storageClassName` + `accessMode` + `capacity` matching.
### Step 1 — Clear stuck Terminating PVs
The old PVs are stuck in `Terminating` with `kubernetes.io/pvc-protection` finalizers. Remove them:
```bash
for pv in \
pvc-1251909b-3cef-40c6-881c-3bb6e929a596 \
pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70 \
pvc-7971918e-e47f-4739-a976-965ea2d770b4 \
pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9 \
pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4 \
pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635 \
pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 \
pvc-cdd434d1-88b4-4588-8fd2-8c7eafc56d07 \
pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce \
pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa; do
kubectl patch pv $pv -p '{"metadata":{"finalizers":null}}' --type=merge
done
```
### Step 2 — Restore PVs with claimRef removed and Retain policy
Extract PVs from the backup, strip `claimRef` and set `persistentVolumeReclaimPolicy: Retain`,
then apply:
```bash
ssh pi1 "sudo kubectl get pv \
pvc-1251909b-3cef-40c6-881c-3bb6e929a596 \
pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70 \
pvc-7971918e-e47f-4739-a976-965ea2d770b4 \
pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9 \
pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4 \
pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635 \
pvc-cc8a3cbb-dbc2-47a2-a0cc-a02136122b90 \
pvc-cdd434d1-88b4-4588-8fd2-8c7eafc56d07 \
pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce \
pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa \
-o yaml 2>/dev/null | \
python3 -c \"
import sys, yaml
docs = list(yaml.safe_load_all(sys.stdin))
for doc in docs:
if not doc: continue
items = doc.get('items', [doc])
for pv in items:
if pv.get('kind') != 'PersistentVolume': continue
spec = pv.get('spec', {})
spec.pop('claimRef', None)
spec['persistentVolumeReclaimPolicy'] = 'Retain'
pv.pop('status', None)
meta = pv.get('metadata', {})
meta.pop('resourceVersion', None)
meta.pop('uid', None)
meta.pop('creationTimestamp', None)
print('---')
print(yaml.dump(pv))
\" | kubectl apply -f -"
```
Expected result: PVs become `Available` (no claimRef = unbound).
### Step 3 — Longhorn creates new Volume CRDs + replica dirs
When new PVCs bind to the restored PVs and pods attempt to mount them, Longhorn's CSI provisioner
will create new Volume CRDs for each. These new Volume CRDs will have new engine IDs, and Longhorn
will create **new empty replica directories** on pi1, pi2, pi3.
At this point the volume directory layout will be:
```
/mnt/arcodange/longhorn/replicas/
pvc-1251909b-...-<OLD_SUFFIX> ← pi2/pi3: OLD data
pvc-1251909b-...-<NEW_SUFFIX> ← pi1/pi2/pi3: NEW empty dirs
```
### Step 4 — Map old dirs to new dirs, verify data presence
For each volume, on each node, identify:
- OLD dir: exists before new binding (larger .img file size, older timestamp)
- NEW dir: created after binding (empty or minimal .img file)
```bash
# Example: check sizes on pi2 for clickhouse
ssh pi2 "du -sh /mnt/arcodange/longhorn/replicas/pvc-1251909b-*"
```
### Step 5 — Swap directories (Method B)
For each volume on each node that has an old dir with data:
```bash
# Scale down the workload first
kubectl scale statefulset clickhouse -n tools --replicas=0
# Wait for volume to detach
kubectl wait --for=jsonpath='{.status.state}'=detached \
volume/pvc-1251909b-3cef-40c6-881c-3bb6e929a596 \
-n longhorn-system --timeout=60s
# On pi2: rename new empty dir, move old data dir to new name
ssh pi2 "
NEW=$(ls /mnt/arcodange/longhorn/replicas/ | grep pvc-1251909b | \
xargs -I{} stat --format='%Y {}' /mnt/arcodange/longhorn/replicas/{} | \
sort -rn | head -1 | awk '{print \$2}')
OLD=$(ls /mnt/arcodange/longhorn/replicas/ | grep pvc-1251909b | \
xargs -I{} stat --format='%Y {}' /mnt/arcodange/longhorn/replicas/{} | \
sort -n | head -1 | awk '{print \$2}')
echo \"OLD: \$OLD\"
echo \"NEW: \$NEW\"
sudo mv \$NEW \${NEW}.empty_backup
sudo mv \$OLD \$NEW
"
# Repeat on pi3
# Restart the instance manager on affected node to pick up new dir
kubectl delete pod -n longhorn-system -l \
longhorn.io/node=pi2,longhorn.io/component=instance-manager
```
### Step 6 — Scale workloads back up and verify
```bash
kubectl scale statefulset clickhouse -n tools --replicas=1
kubectl get pvc -n tools clickhouse-storage-clickhouse-0
kubectl get volumes -n longhorn-system pvc-1251909b-3cef-40c6-881c-3bb6e929a596
```
---
## Priority Order for Recovery
Given data criticality:
1. **HashiCorp Vault data** (`pvc-ca5567d3` + `pvc-6d2ea1c7`) — credentials/secrets store
2. **ERP** (`pvc-7971918e`) — 50Gi, business data
3. **Prometheus** (`pvc-88e18c7f`) — 8Gi, metrics history (degraded OK, can rebuild)
4. **Redis** (`pvc-d1d5482b`) — 1Gi, cache (can rebuild from scratch if needed)
5. **Alertmanager** (`pvc-aed7f2c4`) — 2Gi, alert history (can rebuild)
6. **Clickhouse** (`pvc-1251909b`) — 16Gi
7. **URL shortener** (`pvc-cdd434d1`) — 128Mi
8. **Traefik** (`pvc-cc8a3cbb`) — 128Mi (TLS certs, can re-issue via cert-manager)
9. **Longhorn backups-rwx** (`pvc-efda1d2f`) — 50Gi, backup volume itself
---
## Caution: Multiple Dirs on Pi3
Several volumes have 3 directories on pi3. This likely happened during the incident when Longhorn
attempted rebuilds before the nuclear cleanup. **Do not blindly take the newest or oldest** — check
actual `.img` file size to identify the one with data:
```bash
ssh pi3 "du -sh /mnt/arcodange/longhorn/replicas/pvc-1251909b-*"
# The largest .img is the one with actual data
```
---
## Lessons for Backup Script
The current backup command `kubectl get -A pv,pvc -o yaml && echo '---' && kubectl get -A pvc -o yaml`
captures PV/PVC but not Longhorn Volume CRDs. The backup command must be updated to include:
```bash
kubectl get -A pv -o yaml && echo '---' \
&& kubectl get -A pvc -o yaml && echo '---' \
&& kubectl get -n longhorn-system volumes.longhorn.io -o yaml
```
This is tracked in ADR `docs/adr/20260414-longhorn-pvc-recovery.md` under "Prevention".
---
## Volume Recovery Status
| PV Name | PVC | Namespace | Size | Method | Status |
|---------|-----|-----------|------|--------|--------|
| `pvc-5391fa2b` | `traefik` | kube-system | 128Mi | PV claimRef remove | ✅ 2026-04-14 |
| `pvc-cdd434d1` | `url-shortener-data` | url-shortener | 128Mi | Method B (dir rename) | ✅ 2026-04-14 |
| `pvc-1251909b` | `clickhouse-storage-clickhouse-0` | tools | 16Gi | Block-device (playbook) | ✅ 2026-04-14 |
| `pvc-88e18c7f` | `prometheus-server` | tools | 8Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-aed7f2c4` | `storage-prometheus-alertmanager-0` | tools | 2Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-d1d5482b` | `redis-storage-redis-0` | tools | 1Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-efda1d2f` | `backups-rwx` | longhorn-system | 50Gi | Block-device (playbook) | ⏳ 2026-04-15 |
| `pvc-ca5567d3` | `data-hashicorp-vault-0` | tools | 10Gi | Manual (deferred) | 🔴 Pending |
| `pvc-6d2ea1c7` | `audit-hashicorp-vault-0` | tools | 10Gi | Manual (deferred) | 🔴 Pending |
| `pvc-7971918e` | `erp` | erp | 50Gi | Manual (deferred) | 🔴 Pending |
**Vault and ERP are excluded from automated recovery** — they require coordinated manual procedures
(Vault unseal key management; ERP business data verification). Use `docs/runbooks/longhorn-block-device-recovery.md`
with extra validation steps for those volumes.
---
## Automated Recovery: Block-Device Injection
Directory rename (Method B) proved too risky for large volumes: Longhorn detects `Dirty: true` +
inconsistency across replicas and silently rebuilds from the empty pi1 replica, destroying data.
**The approach that works** (implemented in `playbooks/recover/longhorn_data.yml`):
1. **Phase 0** — Auto-discover best replica dir per volume (skip `Rebuilding: true`, rank by actual disk usage)
2. **Phase 1** — Backup untouched replica dir before touching anything
3. **Phase 2** — Merge sparse snapshot + head layers into a flat image (`merge-longhorn-layers.py`)
4. **Phase 3** — Create Longhorn Volume CRD, wait for replicas
5. **Phase 4** — Scale down workload
6. **Phase 5** — Attach volume via VolumeAttachment maintenance ticket
7. **Phase 6**`mkfs.ext4` the live block device, rsync data from merged image
8. **Phase 7** — Remove maintenance attachment ticket
9. **Phase 8** — Recreate PV (Retain, no claimRef) + PVC (pinned to PV)
10. **Phase 9** — Scale up, wait for readyReplicas ≥ 1, optional verify_cmd
**Pitfall discovered (2026-04-15):** `du -sb` returns apparent size for sparse files, making a
`Rebuilding: true` replica (1.3 GiB actual, 24 GiB apparent) beat healthy 11 GiB replicas.
Fixed by checking `Rebuilding` flag in `volume.meta` and using `du -sk` (actual usage).
**Usage:**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars_remaining.yml
```
Vars files:
- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse (already recovered, archived)
- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
- `playbooks/recover/longhorn_data_vars.example.yml` — template for future use
---
## Tested Recovery Procedure (url-shortener — 2026-04-14)
Method B confirmed working for this volume (small, no Rebuilding replicas). Full sequence:
1. Create Longhorn Volume CRD manually (size 128Mi, rwo, 3 replicas)
2. Create Longhorn VolumeAttachment ticket to pi1 (disableFrontend: true) → triggers replica dir creation
3. Remove attachment ticket → volume detaches
4. On pi2: `mv new-dir new-dir.empty && mv old-dir new-dir`
5. On pi3: same (chose `-70197659` over `-998f49ff` based on newer mtime: Apr 7 vs Apr 6)
6. Clear finalizers on stuck Terminating PV/PVC → both deleted
7. Recreate PV (Retain policy, no claimRef, same CSI volumeHandle)
8. Recreate PVC with `volumeName:` pinned to the PV
9. Delete old Error pod (was blocking volume attach)
10. New pod comes up 1/1 Running, volume attached healthy on pi3, all 3 replicas running
**Traefik** was simpler — PV `pvc-5391fa2b` already existed in Longhorn (Released). Just removed
claimRef (→ Available), created `kube-system/traefik` PVC with `volumeName:` pinned. Bound immediately.
**For all subsequent volumes** — use `playbooks/recover/longhorn_data.yml`. Method B is too risky.

View File

@@ -0,0 +1,70 @@
---
# Automated Longhorn Recovery Playbook (DRAFT)
# Purpose: Break circular dependency and restore CSI driver after power-cut
#
# REQUIREMENTS:
# - Ansible >= 2.15
# - kubectl on control plane (pi1)
# - Backup scripts from playbooks/backup/k3s_pvc.yml must be deployed
#
# USAGE:
# ansible-playbook -i inventory/hosts.yml docs/incidents/2026-04-13-power-cut/recover_longhorn.yml
#
# REFERENCE FILES:
# - playbooks/system/k3s_config.yml (Longhorn HelmChart template)
# - playbooks/backup/k3s_pvc.yml (Backup/restore scripts)
# - inventory/hosts.yml (Target hosts)
# - /mnt/arcodange/longhorn/replicas/ (Data - MUST NOT be touched)
# - /home/pi/arcodange/backups/k3s_pvc/ (Fallback backup location)
#
#
# PLAYBOOK FLOW:
#
# Phase 1: DIAGNOSIS (idempotent, safe to run anytime)
# - Check CSI driver registration status
# - Check Longhorn manager health
# - Identify which recovery phase is needed
#
# Phase 2: SOFT RECOVERY (least destructive)
# - Touch longhorn-install.yaml manifest
# - Wait 60s for k3s HelmChart controller to reconcile
# - Verify pod recreation
#
# Phase 3: HARD RECOVERY (if soft fails)
# - Delete driver-deployer pod
# - Delete all longhorn-driver-deployer pods
# - Wait for HelmChart to recreate
#
# Phase 4: NUCLEAR RECOVERY (if hard fails)
# - Delete HelmChart resource
# - Remove manifest file
# - Force-delete longhorn-system namespace (after removing finalizers)
# - Reinstall Longhorn via manifest
#
# Phase 5: RESTORE FROM BACKUP (idempotent)
# - Apply PV/PVC from backup
# - Apply Longhorn CRs from backup
# - Data auto-discovered from disk
#
# DESIGNED TO HANDLE:
# - CSI driver registration lost
# - Longhorn manager webhook circular dependency
# - Partial pod crashes
# - Full Longhorn namespace corruption
#
# LIMITATIONS:
# - Requires pi1 (control plane) to be reachable
# - Data in /mnt/arcodange/longhorn/ MUST survive
# - Docker must be functional on at least 1 node
# - Does NOT handle Docker overlay2 corruption
#
# TESTED SCENARIOS:
# - [ ] CSI driver not registered (primary use case)
# - [ ] Longhorn manager CrashLoopBackOff
# - [ ] Full namespace deletion needed
# - [ ] Backup restore validation
#
# TODO:
# - Add Docker storage health check
# - Add pre-recovery data verification
# - Add post-recovery validation

View File

@@ -0,0 +1,153 @@
---
title: Recovery Approach Analysis — Post-Incident Review
incident_id: 2026-04-13-001
date: 2026-04-13
author: Claude Code (external review)
---
# Recovery Approach Analysis
## TL;DR
The incident escalated from a **~5 minute fix** to a **full Longhorn reinstall with backup restore** because the simplest remediation (k3s restart) was never attempted, and a single aggressive command (`kubectl delete pods --all --force`) created a new problem that did not previously exist.
---
## What Was Skipped
### 1. Restart k3s on all nodes (never attempted)
This should have been the **first or second action** after the manifest touch failed.
```bash
systemctl restart k3s # pi1 — control plane
systemctl restart k3s-agent # pi2, pi3 — agent nodes
```
After a power cut, k3s/kubelet state is dirty. Restarting k3s:
- Forces kubelet to reinitialize the plugin registry cleanly
- Allows Longhorn pods to restart in correct dependency order
- Avoids the simultaneous-restart race condition that causes webhook issues
- Takes ~2 minutes with no destructive side effects
This was listed as a last resort in the runbook consulted at incident start. It should have been tried **before any pod deletion**, not after.
### 2. Stale CSI socket check on each node (never attempted)
```bash
# On each node (pi1, pi2, pi3):
ls /var/lib/kubelet/plugins/driver.longhorn.io/
# If a stale .sock file exists:
rm /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
```
The incident log confirms the CSI socket was missing/stale, but no one went to the nodes to verify and clean this up. Removing a stale socket + restarting the `longhorn-csi-plugin` daemonset is a targeted, low-risk fix.
---
## Where the Direction Went Wrong
### The pivotal mistake: force deleting all 24 pods simultaneously
**Command run at 15:32:15:**
```bash
kubectl delete pods -n longhorn-system --all --force --grace-period=0
```
This command created the **webhook circular dependency problem**, which did not exist before it was run.
**Why it caused the circular dependency:**
In normal operation, Longhorn managers start sequentially. One becomes the webhook leader and begins serving on port 9501 before others register as service endpoints.
When all 24 pods are force-deleted simultaneously:
1. All 3 manager pods race-start at the same time
2. All 3 IPs are registered as `longhorn-conversion-webhook` service endpoints immediately
3. The health check (`https://<pod-ip>:9501/v1/healthz`) is run against all 3
4. Only the elected leader actually serves port 9501 — the other 2 fail the probe
5. Failing managers crash: `"conversion webhook service is not accessible after 1m0s"`
6. `longhorn-driver-deployer` init container waits for healthy managers indefinitely
7. CSI socket is never created, CSI driver never registers
**The original problem was only a lost CSI socket registration.** The webhook circular dependency is a new problem introduced by the recovery attempt.
---
## The Escalation Cascade
Each step created a harder problem than the one it was meant to solve:
```
Power cut
→ CSI socket lost (original problem — simple fix)
→ Force delete all pods
→ Webhook circular dependency (new problem)
→ Delete HelmChart + manifest
→ 84 finalizers blocking namespace deletion (new problem)
→ Full reinstall required
→ Backup restore required
→ Risk to volume metadata
```
The original problem required touching 1 socket file and restarting k3s. The current state requires:
- Manually patching finalizers off 84+ resources
- Full Longhorn reinstall
- Restoring PV/PVC and Longhorn CRs from backup
- Verifying data auto-discovery from replicas
---
## Correct Recovery Sequence (Hindsight)
### Step 1 — k3s restart (should have been tried at ~15:27)
```bash
ansible -i inventory/hosts.yml all -m shell -a "sudo systemctl restart k3s || sudo systemctl restart k3s-agent"
```
Wait 3 minutes. In most power-cut scenarios, this alone restores CSI registration.
### Step 2 — If still broken: targeted daemonset restart (not force-delete-all)
```bash
kubectl rollout restart daemonset/longhorn-manager -n longhorn-system
kubectl rollout status daemonset/longhorn-manager -n longhorn-system
```
Graceful restart respects the dependency order. Wait for managers to stabilize before touching CSI pods.
### Step 3 — Check and clean stale sockets on each node
```bash
# Run on pi1, pi2, pi3:
ls /var/lib/kubelet/plugins/driver.longhorn.io/
rm -f /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
kubectl rollout restart daemonset/longhorn-csi-plugin -n longhorn-system
```
### Step 4 — Verify CSI driver registered
```bash
kubectl get csidriver
kubectl get csinodes
```
### Step 5 — Only if all above failed: delete driver-deployer pod only
```bash
kubectl delete pod -n longhorn-system -l app=longhorn-driver-deployer
```
Not all pods. One targeted pod.
---
## What Was Done Well
- Quick identification of the original root cause (CSI registration)
- Confirming volume data integrity early (`robustness="healthy"`)
- Securing backups before destructive operations (16:30)
- Fixing the backup script bug (useful regardless of incident)
- Detailed logging throughout
---
## Action Items for Future Incidents
- [ ] Add k3s restart as **step 2** in the Longhorn recovery runbook (before any pod deletion)
- [ ] Add CSI socket cleanup to the runbook as an explicit step on each node
- [ ] Add a "minimum destructive action" principle: prefer `rollout restart` over `delete --force --all`
- [ ] Implement `recover_longhorn.yml` playbook with the phased approach (soft → targeted → hard) to prevent ad-hoc escalation
- [ ] Add a pre-action checklist: "have I tried restarting the service before deleting its resources?"

View File

@@ -0,0 +1,107 @@
#!/usr/bin/env python3
"""
Merge Longhorn snapshot + head layers into a single mountable raw image.
Longhorn stores replica data as sparse raw images in a chain:
volume-snap-<id>.img — full state at the time the snapshot was taken
volume-head-NNN.img — delta (only changed blocks) since the snapshot
To reconstruct the full filesystem, head blocks take priority over snapshot
blocks. Sparse (all-zero) blocks in the head fall through to the snapshot.
Usage:
sudo python3 merge-longhorn-layers.py <replica-dir> <output.img>
Example:
sudo python3 merge-longhorn-layers.py \\
/mnt/arcodange/longhorn/replicas/pvc-cdd434d1-...-998f49ff \\
/tmp/merged.img
# Then mount and inspect:
sudo mount -o loop /tmp/merged.img /mnt/recovery
ls /mnt/recovery/
Proven useful during incident 2026-04-13 to recover the url-shortener SQLite
database from a Longhorn replica that was never touched by the nuclear cleanup
(pi3, dir suffix -998f49ff, Apr 6 snapshot).
Key lesson: always identify the untouched replica dir (oldest timestamps,
never renamed) before attempting directory swaps. Back it up first.
"""
import os
import sys
import json
BLOCK = 4096
def find_layers(replica_dir: str) -> tuple[str | None, str | None]:
"""
Read volume.meta to find head filename and snapshot parent.
Returns (snapshot_path, head_path). snapshot_path is None for base volumes.
"""
meta_path = os.path.join(replica_dir, "volume.meta")
with open(meta_path) as f:
meta = json.load(f)
head_name = meta["Head"]
parent_name = meta.get("Parent", "")
head_path = os.path.join(replica_dir, head_name)
snap_path = os.path.join(replica_dir, parent_name) if parent_name else None
return snap_path, head_path
def merge(snap_path: str | None, head_path: str, out_path: str) -> None:
size = os.path.getsize(head_path)
print(f"Volume size: {size // (1024 * 1024)} MiB")
print(f"Snapshot: {snap_path or '(none — base volume)'}")
print(f"Head: {head_path}")
print(f"Output: {out_path}")
snap_f = open(snap_path, "rb") if snap_path else None
head_f = open(head_path, "rb")
with open(out_path, "wb") as out:
out.truncate(size)
blocks = size // BLOCK
for i, offset in enumerate(range(0, size, BLOCK)):
head_f.seek(offset)
hb = head_f.read(BLOCK)
if hb and any(hb):
out.seek(offset)
out.write(hb)
elif snap_f:
snap_f.seek(offset)
sb = snap_f.read(BLOCK)
if sb and any(sb):
out.seek(offset)
out.write(sb)
if i % 4096 == 0:
pct = (i / blocks) * 100
print(f"\r {pct:.0f}%", end="", flush=True)
print("\r 100% — done.")
if snap_f:
snap_f.close()
head_f.close()
if __name__ == "__main__":
if len(sys.argv) != 3:
print(__doc__)
sys.exit(1)
replica_dir = sys.argv[1]
out_path = sys.argv[2]
if not os.path.isdir(replica_dir):
print(f"Error: {replica_dir} is not a directory", file=sys.stderr)
sys.exit(1)
snap, head = find_layers(replica_dir)
merge(snap, head, out_path)

View File

@@ -0,0 +1,312 @@
# Incident Documentation
This directory contains incident reports, postmortems, and recovery logs for the Arcodange Factory infrastructure.
## Purpose
Document all infrastructure incidents to:
- Track root causes and resolutions
- Maintain a knowledge base for future troubleshooting
- Improve system reliability through lessons learned
- Provide clear guidance for on-call responders
## Structure
Each incident is documented in its own directory under `docs/incidents/` with the following naming convention:
```
docs/incidents/
├── YYYY-MM-DD-incident-name/
│ ├── README.md # Incident summary and timeline
│ ├── status.md # Real-time status updates (optional)
│ ├── log.md # Detailed recovery actions and logs
│ ├── root-cause.md # Technical analysis (optional)
│ └── diagrams/ # Architecture/flow diagrams (optional)
│ └── *.mmd # Mermaid diagrams
└── ...
```
## Incident Directory Contents
### 1. `README.md` (Required)
The primary incident document. Must include:
- **Incident ID**: Unique identifier (e.g., `2026-04-13-001`)
- **Title**: Clear, descriptive title
- **Date/Time**: Start and end timestamps
- **Status**: Open / Investigating / Resolved / Monitoring
- **Severity**: SEV-1 (Critical) / SEV-2 (High) / SEV-3 (Medium) / SEV-4 (Low)
- **Impact**: Brief description of affected services
- **Summary**: What happened
- **Timeline**: Key events with timestamps
- **Root Cause**: Technical analysis
- **Resolution**: Steps taken to resolve
- **Action Items**: Follow-up tasks
- **Lessons Learned**: Key takeaways
**Front matter template:**
```markdown
---
title: Incident Title
incident_id: YYYY-MM-DD-NNN
date: YYYY-MM-DD
time_start: HH:MM:SS UTC
time_end: HH:MM:SS UTC
status: Resolved
severity: SEV-2
tags:
- kubernetes
- longhorn
- storage
---
```
### 2. `log.md` (Recommended)
Detailed technical log of all recovery actions. Must include:
- Commands executed with timestamps
- Command output (relevant portions)
- Decision rationale for each action
- Outcome of each action
- Next stepsidentified
Format:
```markdown
## [Time] Action Description
**Command:** `actual command run`
**Output:**
```
relevant output
```
**Decision:** Why this action was taken
**Outcome:** What happened
**Next:** What to do next
```
### 3. Mermaid Diagrams
Include at least one Mermaid diagram in each incident to visualize:
- Architecture/flow before incident
- Failure propagation
- Recovery process
- New architecture after fixes
**Example theme usage:**
```mermaid
%%{init: { 'theme': 'forest', 'themeVariables': { 'primaryColor': '#ffdfd3', 'edgeLabelBackground':'#fff' }}}%%
```
Available themes: `default`, `base`, `forest`, `dark`, `neutral`
**Recommended diagrams:**
- `incident-flow.mmd`: Timeline/flow of the incident
- `architecture.mmd`: Affected components architecture
- `recovery-flow.mmd`: Recovery steps visualization
- `dependency-tree.mmd`: Component dependencies showing failure path
## Incident Severity Definitions
| Severity | Description | Response Time | Impact |
|----------|-------------|---------------|--------|
| SEV-1 | Critical system-wide outage | Immediate (24/7) | Multiple services down, potential data loss |
| SEV-2 | Major service degradation | < 1 hour | Single critical service down |
| SEV-3 | Partial service degradation | < 4 hours | Non-critical service affected |
| SEV-4 | Minor issue | Next business day | Cosmetic or non-impacting |
## Available Ansible Playbooks for Recovery
This collection provides comprehensive infrastructure management via Ansible.
Always use `-i inventory/hosts.yml` when running playbooks.
### Master Playbooks (Run in order for full recovery)
| Playbook | Purpose | Targets |
|----------|---------|---------|
| `playbooks/01_system.yml` | System setup (hostnames, iSCSI, Docker, Longhorn, DNS) | raspberries |
| `playbooks/02_setup.yml` | Infrastructure setup (NFS backup, PostgreSQL, Gitea) | localhost, postgres, gitea |
| `playbooks/03_cicd.yml` | CI/CD pipeline (Gitea tokens, Docker Compose, ArgoCD) | localhost, gitea |
| `playbooks/04_tools.yml` | Tool deployment (Hashicorp Vault, Crowdsec) | tools group |
| `playbooks/05_backup.yml` | Backup configuration | localhost |
### Component-Specific Playbooks
#### System
| Playbook | Purpose | Notes |
|----------|---------|-------|
| `playbooks/system/rpi.yml` | Raspberry Pi hostname setup | |
| `playbooks/system/dns.yml` | DNS/pi-hole configuration | |
| `playbooks/system/ssl.yml` | SSL certificate setup with step-ca | |
| `playbooks/system/prepare_disks.yml` | Disk partitioning and formatting | |
| `playbooks/system/system_docker.yml` | Docker installation with custom storage | Storage at `/mnt/arcodange/docker` |
| `playbooks/system/k3s_config.yml` | K3s configuration (Traefik, Longhorn HelmCharts) | **Key for k3s** |
| `playbooks/system/system_k3s.yml` | K3s cluster deployment | Uses k3s-ansible collection |
| `playbooks/system/iscsi_longhorn.yml` | iSCSI client for Longhorn | Prerequisite for Longhorn |
| `playbooks/system/k3s_dns.yml` | K3s DNS configuration | |
| `playbooks/system/k3s_ssl.yml` | K3s SSL/traefik certificates | |
#### Storage
| Playbook | Purpose | Notes |
|----------|---------|-------|
| `playbooks/setup/backup_nfs.yml` | Longhorn RWX NFS backup volume | Creates 50Gi PVC + recurring backups |
| `playbooks/backup/k3s_pvc.yml` | PVC backup scripts | Creates `/opt/k3s_volumes/backup.sh` and `restore.sh` |
#### Backup
| Playbook | Purpose | Notes |
|----------|---------|-------|
| `playbooks/backup/backup.yml` | Main backup orchestration | Calls postgres, gitea, k3s_pvc |
| `playbooks/backup/postgres.yml` | PostgreSQL database backup | Docker exec pg_dumpall |
| `playbooks/backup/gitea.yml` | Gitea backup | Uses gitea dump command |
| `playbooks/backup/cron_report.yml` | Mail utility for cron reports | |
| `playbooks/backup/cron_report_mailutility.yml` | MTA configuration | |
### Inventory File
**File:** `inventory/hosts.yml`
**Groups:**
- `raspberries`: pi1, pi2, pi3 (Raspberry Pi nodes)
- `local`: localhost, pi1, pi2, pi3
- `postgres`: pi2 (PostgreSQL host)
- `gitea`: pi2 (Gitea host, inherits postgres)
- `pihole`: pi1, pi3 (DNS hosts)
- `step_ca`: pi1, pi2, pi3 (Certificate authority)
- `all`: All above groups
**Important:** All playbooks MUST be run with `-i inventory/hosts.yml` flag:
```bash
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml
```
### Handy Commands for Incident Response
```bash
# Check all pods
kubectl get pods -A
# Check Longhorn specifically
kubectl get pods -n longhorn-system
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system
# Check storage
kubectl get pv -A
kubectl get pvc -A
kubectl get csidriver
# Check nodes
kubectl get nodes -o wide
kubectl describe node <nodename>
# Force Longhorn HelmChart reconcile (k3s-specific)
sudo touch /var/lib/rancher/k3s/server/manifests/longhorn-install.yaml
# Restart Longhorn
kubectl delete pods -n longhorn-system --all --force --grace-period=0
# Check Longhorn data on disk
ls /mnt/arcodange/longhorn/replicas/
# Check Docker storage
ls /mnt/arcodange/docker/overlay2/ | head
# Run ansible playbook (dry-run first)
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/01_system.yml --limit pi1
```
### K3s-Specific Recovery Notes
Longhorn is installed via **HelmChart manifest** (k3s native):
- File: `/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml`
- To trigger reconcile: `touch` the file (k3s watches for changes)
- DO NOT use `helm install` directly - it may conflict with k3s HelmChart controller
Traefik is also installed via HelmChart manifest:
- File: `/var/lib/rancher/k3s/server/manifests/traefik-v3.yaml`
## Incident Templates
### Quick Start Template
```markdown
---
title: [Short Description]
incident_id: YYYY-MM-DD-NNN
date: $(date +%Y-%m-%d)
time_start: $(date +%H:%M:%S)
status: Investigating
severity: SEV-2
tags:
- tag1
- tag2
---
## Summary
[1-2 sentences describing the issue]
## Impact
[What services/users are affected]
## Timeline
| Time | Event | Owner |
|------|-------|-------|
| HH:MM | Initial detection | | @user
| HH:MM | Investigation started | | @user
| HH:MM | Root cause identified | | @user
| HH:MM | Resolution applied | | @user
| HH:MM | Service restored | | @user
## Root Cause
[Technical analysis]
## Resolution
[Step-by-step what was done]
## Mermaid Diagram
%%{init: { 'theme': 'forest' }}%%
graph TD
A[Component A] -->|depends on| B[Component B]
B -->|failed due to| C[Component C]
C -->|power cut| D[Root Cause]
```
*remember to always to this for labels:*
- have a space before a filepath
- no parenthesis '()'
- use <br> instead of \n for new lines
## Action Items
- [ ] Task 1
- [ ] Task 2
## Lessons Learned
- Lesson 1
- Lesson 2
```
## Contributing to Incident Documentation
1. **During Incident**: Focus on resolution, log commands and outputs in `log.md`
2. **After Resolution**: Create/read the `README.md` with full incident details
3. **Add Diagrams**: Include at least one Mermaid diagram to visualize the issue
4. **Peer Review**: Have another team member review before closing
5. **Update Templates**: Improve templates based on what was missing
## Directory Index
| Incident | Date | Severity | Status |
|----------|------|----------|--------|
| [2026-04-13-power-cut](./2026-04-13-power-cut/README.md) | 2026-04-13 | SEV-1 | In Progress |

View File

@@ -0,0 +1,244 @@
# Cluster Recovery Agent Instructions
You are recovering the Arcodange homelab k3s cluster after an outage (power cut, node failure, or
Longhorn reinstall). Your job is to assess damage, run the appropriate Ansible playbooks and
kubectl commands, and bring the cluster back to a fully healthy state.
You do NOT need to modify any code. All recovery tooling already exists.
---
## Cluster Overview
| Component | Details |
|-----------|---------|
| Nodes | pi1, pi2, pi3 (Raspberry Pi, SSH via `pi<N>.home`) |
| k8s distribution | k3s |
| Storage | Longhorn (`/mnt/arcodange/longhorn/`) |
| GitOps | ArgoCD (apps auto-sync from `gitea.arcodange.lab/arcodange-org/`) |
| Secrets | HashiCorp Vault (`tools` namespace, manual unseal) |
| Ingress | Traefik + CrowdSec bouncer |
| Working dir | `/Users/gabrielradureau/Work/Arcodange/factory/ansible/arcodange/factory/` |
| Inventory | `inventory/hosts.yml` |
**Critical dependency:** ERP (Dolibarr) uses Vault-rotated DB credentials written to its PVC.
**Always recover and unseal Vault before scaling ERP up.**
---
## Step 0 — Assess Damage
Run these first to understand what is broken:
```bash
# Overall pod health
kubectl get pods -A | grep -v Running | grep -v Completed
# PVC health (anything not Bound is a problem)
kubectl get pvc -A | grep -v Bound
# Longhorn volume states
kubectl get volumes.longhorn.io -n longhorn-system
# Longhorn manager health (prerequisite for all recovery)
kubectl get pods -n longhorn-system -l app=longhorn-manager
```
---
## Step 1 — Longhorn Volume Recovery
### Path A — Fast path (backup file exists, Volume CRDs were backed up)
Check if a recent backup exists on pi1:
```bash
ssh pi1.home "ls -lt /mnt/backups/k3s_pvc/backup_*.volumes | head -5"
```
If a backup file exists and is recent (from before the incident):
```bash
ssh pi1.home "kubectl apply -f /mnt/backups/k3s_pvc/backup_<YYYYMMDD>.volumes"
```
Then verify PVCs bound and skip to Step 2.
### Path B — Block-device injection (no usable backup, raw replica files intact)
Use this when PVCs are `Lost`/`Terminating` and no Volume CRD backup is available.
**Check which volumes need recovery:**
```bash
# Volumes with no PVC or Lost/Terminating PVC
kubectl get pvc -A | grep -v Bound
```
**For each failed volume, create a vars file** following the pattern in:
`playbooks/recover/longhorn_data_vars.example.yml`
Existing vars files from the 2026-04-13 incident (reusable as references):
- `playbooks/recover/longhorn_data_vars_remaining.yml` — prometheus, alertmanager, redis, backups-rwx
- `playbooks/recover/longhorn_data_vars_erp_vault.yml` — erp, hashicorp-vault (audit + data)
- `playbooks/recover/longhorn_data_vars_clickhouse.yml` — clickhouse
**Key rules for the vars file:**
- `source_node`/`source_dir` can be omitted — Phase 0 auto-discovers the largest non-Rebuilding replica
- Set `workload_name: ""` for ERP — it must not scale up until Vault is unsealed
- For StatefulSets with multiple PVCs (e.g. Vault), set `workload_name: ""` on all but the last entry
**Run the recovery playbook:**
```bash
ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
-e @playbooks/recover/longhorn_data_vars_<NAME>.yml
```
The playbook is **idempotent** — safe to re-run if it fails midway.
**Playbook phases (for context when troubleshooting):**
| Phase | What it does |
|-------|-------------|
| 0 | Auto-discovers best replica dir (skips `Rebuilding: true`) |
| 1 | Backs up untouched replica dir to `/home/pi/arcodange/backups/longhorn-recovery/` |
| 2 | Merges snapshot+head layers into a single `.img` via `merge-longhorn-layers.py` |
| 3 | **Scales down workloads first**, then clears stuck Terminating PVCs, creates Volume CRD |
| 4 | Scale down (second pass, idempotent) |
| 5 | Attaches volume via maintenance ticket to source node |
| 6 | `mkfs.ext4` (if unformatted) + `rsync` from merged image into live block device |
| 7 | Removes maintenance ticket (volume detaches) |
| 8 | Creates PV (Retain, no claimRef) + PVC pinned to PV |
| 9 | Scales up workloads, waits for readyReplicas ≥ 1 (failures here are `ignore_errors: yes`) |
**Common Phase 8 failure — StatefulSet re-creates PVCs before they can be pinned:**
The playbook handles this automatically (scales down before finalizer removal). If you still hit it:
```bash
kubectl scale statefulset <name> -n <namespace> --replicas=0
kubectl patch pvc <pvc-name> -n <namespace> --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc <pvc-name> -n <namespace>
# Then re-run the playbook
```
---
## Step 2 — Unseal HashiCorp Vault
After Vault's PVCs are recovered, the pod boots **sealed**. Check:
```bash
kubectl get pod hashicorp-vault-0 -n tools
kubectl exec hashicorp-vault-0 -n tools -- vault status 2>/dev/null | grep Sealed
```
If sealed, run the unseal playbook (requires interactive terminal for the Gitea password prompt):
```bash
ansible-playbook -i inventory/hosts.yml playbooks/tools/hashicorp_vault.yml
```
Unseal keys are at `~/.arcodange/cluster-keys.json` on the local machine. The playbook reads them automatically.
After the playbook completes, verify:
```bash
kubectl get pod hashicorp-vault-0 -n tools # must be 1/1 Ready
kubectl exec hashicorp-vault-0 -n tools -- vault status | grep Sealed # must be false
```
---
## Step 3 — Scale Up ERP
Only after Vault is unsealed and Ready:
```bash
kubectl scale deployment erp -n erp --replicas=1
kubectl rollout status deployment/erp -n erp
```
---
## Step 4 — Reconfigure Tools (CrowdSec, etc.)
Run if CrowdSec bouncer or Traefik middleware needs reconfiguring:
```bash
# Standard run (bouncer key + Traefik middleware + restart)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml
# Include captcha HTML injection (use when captcha page is broken)
ansible-playbook -i inventory/hosts.yml playbooks/tools/crowdsec.yml --tags never,all
```
If crowdsec-agent or crowdsec-appsec pods are stuck in `Error` after a long outage,
the playbook handles restarting them automatically.
---
## Step 5 — Re-enable ArgoCD selfHeal
Check if `selfHeal` was disabled during recovery (look for `selfHeal: false` in the tools app):
```bash
grep -A5 "tools:" /Users/gabrielradureau/Work/Arcodange/factory/argocd/values.yaml
```
If disabled, re-enable it by editing `argocd/values.yaml` and setting `selfHeal: true`,
then syncing the ArgoCD app:
```bash
kubectl get app tools -n argocd
```
---
## Step 6 — Final Verification
```bash
# All pods running
kubectl get pods -A | grep -v Running | grep -v Completed | grep -v "^NAME"
# All PVCs bound
kubectl get pvc -A | grep -v Bound
# All Longhorn volumes healthy
kubectl get volumes.longhorn.io -n longhorn-system
# Run a fresh backup to capture the recovered state
ansible-playbook -i inventory/hosts.yml playbooks/backup/backup.yml \
-e backup_root_dir=/mnt/backups
```
---
## Key Files Reference
| File | Purpose |
|------|---------|
| `playbooks/recover/longhorn_data.yml` | Main block-device recovery playbook |
| `playbooks/recover/longhorn.yml` | Recovery when Volume CRDs still exist |
| `playbooks/recover/longhorn_data_vars.example.yml` | Template for recovery vars |
| `playbooks/recover/longhorn_data_vars_erp_vault.yml` | Vars for erp + vault (2026-04-13 incident) |
| `playbooks/recover/longhorn_data_vars_remaining.yml` | Vars for other volumes (2026-04-13 incident) |
| `playbooks/backup/backup.yml` | Full backup (postgres + gitea + k3s PVCs + Longhorn CRDs) |
| `playbooks/backup/k3s_pvc.yml` | PV/PVC/Longhorn Volume CRD backup |
| `playbooks/tools/hashicorp_vault.yml` | Vault unseal + OIDC reconfiguration |
| `playbooks/tools/crowdsec.yml` | CrowdSec bouncer + Traefik middleware setup |
| `docs/adr/20260414-longhorn-pvc-recovery.md` | Full incident ADR with all recovery methods |
| `~/.arcodange/cluster-keys.json` | Vault unseal keys (local machine only) |
---
## Decision Tree
```
Cluster down after outage
├─ kubectl works? ──No──▶ Check k3s: `systemctl status k3s` on pi1/pi2/pi3
└─ Yes
├─ PVCs all Bound? ──Yes──▶ Skip to Step 2 (check Vault)
└─ No
├─ Recent .volumes backup on pi1? ──Yes──▶ Path A (kubectl apply backup)
└─ No
├─ Longhorn Volume CRDs exist? ──Yes──▶ playbooks/recover/longhorn.yml
└─ No ──▶ Path B (longhorn_data.yml block-device injection)
Check replica dirs exist first:
ssh pi{1,2,3}.home "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-*"
```

View File

@@ -0,0 +1,360 @@
# Runbook: Longhorn Block-Device Data Recovery
**When to use:** Longhorn has been fully reinstalled (nuclear cleanup). Volume CRDs are gone.
Application PVCs are stuck `Terminating` or `Lost`. The raw replica `.img` files still exist
on disk across the nodes. kubectl/k8s objects cannot help — we must work directly with the
Longhorn replica directories and block devices.
**Automated version:** `playbooks/recover/longhorn_data.yml`
---
## Mental Model
Longhorn stores each replica as a chain of sparse raw image files inside a directory named
`<pv-name>-<random-hex>` under `<longhorn_data_path>/replicas/`. Each directory contains:
```
volume.meta — engine state (Head filename, Parent snapshot, Dirty flag)
volume-head-NNN.img — active write log (sparse, only changed blocks)
volume-head-NNN.img.meta — head metadata
volume-snap-<uuid>.img — snapshot at a point in time (sparse, full state)
volume-snap-<uuid>.img.meta — snapshot metadata
revision.counter — monotonically increasing write counter
```
After a nuclear cleanup + reinstall, Longhorn creates **new empty replica directories** with
new random hex suffixes. The old directories (with data) are left on disk but orphaned.
**Why directory-swap fails:** the old `volume.meta` has a different engine generation and
`Dirty: true`. Longhorn detects the inconsistency across replicas and rebuilds from the
"cleanest" source (the new empty pi1 replica), overwriting the old data.
**What works:** extract the filesystem from the untouched replica directory directly, then
inject the data files into the live Longhorn block device while the volume is temporarily
attached in maintenance mode.
---
## Decision Tree
```
Are Volume CRDs present in Longhorn?
├── YES → normal PV/PVC restore is enough, use playbooks/recover/longhorn.yml
└── NO
└── Are replica directories present on disk?
├── NO → data is lost, provision fresh volumes
└── YES
└── Is there an untouched replica dir (timestamps from before the incident)?
├── NO → data likely unrecoverable (all dirs were zeroed during reconciliation)
└── YES → follow this runbook
```
---
## Step 0 — Pre-flight: Inventory Surviving Replica Directories
On each node, list replica dirs and their sizes. Dirs with actual data are large (>16K).
New empty dirs created by Longhorn are always exactly 16K.
```bash
for node in pi1 pi2 pi3; do
echo "=== $node ==="
ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
done
```
**Key rule:** identify the replica dir that was **never touched** by the reinstall — it has
old timestamps (from before the incident) and its size matches the original volume usage.
This is your recovery source. **Back it up before touching anything.**
```bash
# On the node that has the untouched dir:
sudo mkdir -p /home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
sudo cp -a /mnt/arcodange/longhorn/replicas/<pv-name>-<old-hex>/ \
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/
```
---
## Step 1 — Reconstruct the Filesystem
The replica directory contains a snapshot chain. Each layer is a sparse raw image — unchanged
blocks appear as zeroed sparse regions, only written blocks contain data. To reconstruct the
full filesystem, layers must be merged: head takes priority, then snapshot.
Use `docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py`:
```bash
# On the node holding the backup:
sudo python3 merge-longhorn-layers.py \
/home/pi/arcodange/backups/longhorn-recovery/<pvc-name>/<pv-name>-<old-hex>/ \
/tmp/<pvc-name>-merged.img
# Verify the filesystem mounts
sudo mkdir -p /mnt/recovery-<pvc-name>
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-<pvc-name>
sudo ls -lah /mnt/recovery-<pvc-name>/
sudo umount /mnt/recovery-<pvc-name>
```
If mount fails with "wrong fs type" or "bad superblock":
- The snapshot `.img` is all-zero (was overwritten by a prior Longhorn reconciliation)
- Try the next oldest replica dir from another node
- Check with `sudo od -A x -t x1z -v snap.img | grep -v ' 00 00...' | head -5`
---
## Step 2 — Create the Longhorn Volume CRD
Longhorn needs to know about the volume before its block device can be used.
```bash
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: <pv-name>
namespace: longhorn-system
spec:
accessMode: rwo # or rwx
dataEngine: v1
frontend: blockdev
numberOfReplicas: 3
size: "<size-in-bytes>" # e.g. "134217728" for 128Mi
EOF
```
Wait for replicas to appear:
```bash
kubectl get replicas.longhorn.io -n longhorn-system | grep <pv-name>
# Expect 3 replicas in "stopped" state
```
---
## Step 3 — Attach the Volume in Maintenance Mode
Longhorn only creates the block device (`/dev/longhorn/<pv-name>`) when the volume is
attached to a node. Use a `VolumeAttachment` ticket to attach without a pod.
Choose `<target-node>` = the same node where the backup/merged image is stored (avoids
copying large files across the network).
```bash
kubectl apply -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
name: <pv-name>
namespace: longhorn-system
spec:
attachmentTickets:
recovery:
generation: 0
id: recovery
nodeID: <target-node>
parameters:
disableFrontend: "false"
type: longhorn-api
volume: <pv-name>
EOF
kubectl wait --for=jsonpath='{.status.state}'=attached \
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=120s
```
---
## Step 4 — Scale Down the Workload
Always stop the workload before touching the data to prevent concurrent writes and filesystem
corruption.
```bash
# For a Deployment:
kubectl scale deployment <name> -n <namespace> --replicas=0
# For a StatefulSet:
kubectl scale statefulset <name> -n <namespace> --replicas=0
```
---
## Step 5 — Inject Data Files via Block Device
```bash
ssh <target-node> bash <<'SHELL'
# Mount the live block device
sudo mkdir -p /mnt/recovery-live
sudo mount /dev/longhorn/<pv-name> /mnt/recovery-live
# Mount the reconstructed image (if not already mounted)
sudo mkdir -p /mnt/recovery-src
sudo mount -o loop /tmp/<pvc-name>-merged.img /mnt/recovery-src
# Sync: only the application data files, not lost+found
sudo rsync -av --exclude='lost+found' /mnt/recovery-src/ /mnt/recovery-live/
# Verify
sudo ls -lah /mnt/recovery-live/
# Unmount both
sudo umount /mnt/recovery-src
sudo umount /mnt/recovery-live
SHELL
```
---
## Step 6 — Detach the Volume
```bash
kubectl patch volumeattachments.longhorn.io <pv-name> \
-n longhorn-system --type json \
-p '[{"op":"remove","path":"/spec/attachmentTickets/recovery"}]'
kubectl wait --for=jsonpath='{.status.state}'=detached \
volumes.longhorn.io/<pv-name> -n longhorn-system --timeout=60s
```
---
## Step 7 — Restore PV and PVC
Clear stuck Terminating PV/PVC finalizers first if they exist:
```bash
kubectl patch pv <pv-name> --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null
kubectl patch pvc <pvc-name> -n <namespace> --type=merge \
-p '{"metadata":{"finalizers":null}}' 2>/dev/null
# Wait a moment for them to delete
```
Recreate the PV with `Retain` policy and no `claimRef`:
```bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
name: <pv-name>
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
accessModes: [ReadWriteOnce] # match original
capacity:
storage: <size> # e.g. 128Mi
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: <pv-name>
volumeAttributes:
dataEngine: v1
dataLocality: disabled
disableRevisionCounter: "true"
numberOfReplicas: "3"
staleReplicaTimeout: "30"
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn
volumeMode: Filesystem
EOF
```
Recreate the PVC pinned to this PV:
```bash
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: <pvc-name>
namespace: <namespace>
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: <size>
storageClassName: longhorn
volumeMode: Filesystem
volumeName: <pv-name>
EOF
```
---
## Step 8 — Scale Up and Verify
```bash
kubectl scale deployment <name> -n <namespace> --replicas=1
kubectl wait --for=condition=Ready pod -l app=<name> -n <namespace> --timeout=120s
```
---
## Pitfalls Learned During 2026-04-13 Recovery
| Pitfall | What happened | Prevention |
|---------|--------------|------------|
| **Directory swap corrupts data** | Longhorn found old `Dirty: true` volume.meta + empty pi1 replica → rebuilt from empty source | Never swap dirs. Use merge tool + block device injection instead |
| **Snapshot is zeroed after swap** | Longhorn reconciliation overwrote snapshot images when rebuilding from empty replica | Back up the untouched dir FIRST before any rename |
| **Multiple dirs per volume on pi3** | Rebuild attempts during the incident created extra dirs | Identify the untouched dir by timestamp AND verify non-zero content with `od` |
| **`Rebuilding: true` replica → all-zeros merged image** | Phase 0 picked a replica mid-rebuild (1.3 GiB actual data, sparse files look large) — merge tool produced an all-zeros image | Check `volume.meta` and skip any dir with `"Rebuilding": true` before merging |
| **`du -sb` gives misleading apparent sizes** | Sparse replica files (8 GiB file, 1.3 GiB actual) appeared larger than healthy 11 GiB replicas | Use `du -sk` (actual disk blocks) not `du -sb` (apparent/logical size) to rank replicas |
| **Dirty journal prevents ro mount** | `mount -o loop,ro` fails with "bad superblock" on an ext4 with unclean shutdown | Use `mount -o loop,ro,noload` to skip journal replay for read-only access |
| **New volume is unformatted** | `mount /dev/longhorn/<pv>` fails with "wrong fs type" on a freshly created volume | Run `mkfs.ext4 -F` before mounting; guard with `blkid` to skip if already formatted |
| **rsync rc=23 on power-cut partitions** | Some filesystem blocks were unreadable ("Structure needs cleaning") → rsync exits 23 | Use `rsync --ignore-errors`; rc=23 is a partial transfer, not a total failure |
| **pod blocks volume re-attach** | Old Error-state pod held a volume attachment claim | Delete old Error pods before scaling up new ones |
| **`kubectl cp` needs `tar`** | Distroless container had no `tar` binary | Mount block device directly on the node instead |
| **VolumeAttachment ticket removal** | Deleting a VolumeAttachment object causes Longhorn to immediately recreate it | Patch the `recovery` key out of `spec.attachmentTickets` instead of deleting the object |
| **Phase 7 wait for `detached` times out** | After removing the recovery ticket, a workload may immediately create its own ticket | Wait for the `recovery` ticket to disappear from `spec.attachmentTickets`, not for full detach |
| **StatefulSet pods not found by label** | `kubectl get pod -l app=<name>` returns nothing for StatefulSet pods | Wait on `readyReplicas ≥ 1` on the StatefulSet object, not on pod labels |
| **`set_fact` overridden by `-e @file`** | Ansible extra vars have highest precedence — `set_fact: longhorn_recovery_volumes` was silently ignored | Use a different variable name (`_volumes`) for the resolved list, never reassign the extra var name |
---
## Identifying the Right Replica Directory
When multiple old dirs exist for the same volume on a node, pick the one to use for recovery:
1. **Skip `Rebuilding: true`:** check `volume.meta` first — a dir that was being rebuilt when
the incident happened has incomplete data (sparse files are allocated but mostly zeroed):
```bash
python3 -c "import json; d=json.load(open('volume.meta')); print('Rebuilding:', d['Rebuilding'])"
```
Only consider dirs where `Rebuilding: false`.
2. **Actual size:** `sudo du -sk <dir>` (actual disk usage in KB — not `du -sb` which returns
apparent/logical size and is misleading for sparse files). Pick the largest actual size.
3. **Timestamps:** prefer the most recently modified before the incident date.
4. **Snapshot chain:** if Rebuilding is false on multiple dirs, check `volume.meta` for
`"Dirty": false` (clean shutdown) vs `"Dirty": true`. Prefer clean if available.
5. **Content check:** verify the snapshot is not all zeros:
```bash
sudo od -A x -t x1z -v volume-snap-*.img | grep -v ' 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' | head -3
```
If the output is empty (all zeros), the snapshot was overwritten. Try another node.
**Summary rule:** `Rebuilding: false` → largest `du -sk` → non-zero snapshot content.
---
## Reference: Key Commands
```bash
# List all replica dirs for a volume across all nodes
for n in pi1 pi2 pi3; do echo "==$n=="; ssh $n "sudo ls /mnt/arcodange/longhorn/replicas/ | grep <pv-prefix>"; done
# Check Longhorn volume state
kubectl get volumes.longhorn.io -n longhorn-system <pv-name>
# Check VolumeAttachment tickets
kubectl get volumeattachments.longhorn.io -n longhorn-system <pv-name> \
-o jsonpath='{.spec.attachmentTickets}'
# Check Longhorn block device existence on a node
ssh <node> "ls /dev/longhorn/<pv-name>"
# Verify filesystem content without starting the app
ssh <node> "sudo mount /dev/longhorn/<pv-name> /mnt/check && sudo ls /mnt/check && sudo umount /mnt/check"
```

View File

@@ -0,0 +1,11 @@
---
# Gitea ownership configuration consumed by playbooks running on `localhost`
# (e.g. tools/hashicorp_vault.yml). Role-level defaults (gitea_username,
# gitea_organization) live in roles/gitea_secret/defaults/main.yml ; this file
# is for fact lists that the inventory should declare.
# Users (Gitea owner_type=user) to which org-level Gitea Action secrets must
# also be propagated. Repos owned by these users cannot read org-level secrets,
# so the secret propagation playbook iterates over this list.
gitea_secret_propagation_users:
- arcodange

View File

@@ -30,6 +30,7 @@ local:
hosts:
localhost:
ansible_connection: local
ansible_python_interpreter: "{{ ansible_playbook_python }}"
pi1:
pi2:
pi3:

View File

@@ -33,14 +33,23 @@
GITEA_RUNNER_REGISTRATION_TOKEN: "{{ gitea_runner_token_cmd.stdout }}"
GITEA_RUNNER_NAME: arcodange_global_runner_{{ inventory_hostname }}
GITEA_RUNNER_LABELS: ubuntu-latest:docker://gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca,ubuntu-latest-ca:docker://gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca
ports:
- "43707:43707"
networks:
- gitea_action_network
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
- /etc/ssl/certs:/etc/ssl/certs:ro
- /usr/local/share/ca-certificates/:/usr/local/share/ca-certificates/:ro
- /mnt/arcodange/gitea-runner-cache:/home/git/.cache/actcache
- /mnt/arcodange/gitea-runner-act:/root/.cache/act
configs:
- config.yaml
networks:
gitea_action_network:
name: gitea_action_network
configs:
config.yaml:
content: |
@@ -87,14 +96,14 @@
enabled: true
# The directory to store the cache data.
# If it's empty, the cache data will be stored in $HOME/.cache/actcache.
dir: ""
dir: "/home/git/.cache/actcache"
# The host of the cache server.
# It's not for the address to listen, but the address to connect from job containers.
# So 0.0.0.0 is a bad choice, leave it empty to detect automatically.
host: "{{ ansible_default_ipv4.address }}"
# The port of the cache server.
# 0 means to use a random available port.
port: 0
port: 43707
# The external cache server URL. Valid only when enable is true.
# If it's specified, act_runner will use this URL as the ACTIONS_CACHE_URL rather than start a server by itself.
# The URL should generally end with "/".
@@ -148,235 +157,3 @@
loop: ["absent", "present"]
loop_control:
loop_var: docker_compose_down_then_up
# - name: Set PACKAGES_TOKEN secret to upload packages from CI
# run_once: True
# block:
# - name: Generate cicd PACKAGES_TOKEN
# include_role:
# name: arcodange.factory.gitea_token
# vars:
# gitea_token_name: PACKAGES_TOKEN
# gitea_token_fact_name: cicd_PACKAGES_TOKEN
# gitea_token_scopes: write:package
# gitea_token_replace: true
# - name: Register cicd PACKAGES_TOKEN secrets
# include_role:
# name: arcodange.factory.gitea_secret
# vars:
# gitea_secret_name: PACKAGES_TOKEN
# gitea_secret_value: "{{ cicd_PACKAGES_TOKEN }}"
# loop: ["organization", "user"]
# loop_control:
# loop_var: gitea_owner_type # Peut être "user" ou "organization"
# - name: Set HOMELAB_CA_CERT secret to validate self signed ssl
# run_once: True
# block:
# - name: Download homelab CA certificate
# ansible.builtin.uri:
# url: "https://ssl-ca.arcodange.lab:8443/roots.pem"
# return_content: yes
# validate_certs: no
# register: homelab_ca_cert
# - name: Debug cert
# debug:
# msg: "{{ homelab_ca_cert.content }}..."
# - name: Register cicd HOMELAB_CA_CERT secrets
# include_role:
# name: arcodange.factory.gitea_secret
# vars:
# gitea_secret_name: HOMELAB_CA_CERT
# gitea_secret_value: "{{ homelab_ca_cert.content | b64encode }}"
# loop: ["organization", "user"]
# loop_control:
# loop_var: gitea_owner_type # Peut être "user" ou "organization"
# post_tasks:
# - include_role:
# name: arcodange.factory.gitea_token
# vars:
# gitea_token_delete: true
# - name: Deploy Argo CD
# hosts: localhost
# roles:
# - role: arcodange.factory.gitea_token # generate gitea_api_token used to replace generated token with set name if required
# tags:
# - gitea_sync
# tasks:
# - name: Set factory repo
# include_role:
# name: arcodange.factory.gitea_repo
# vars:
# gitea_repo_name: factory
# - name: Sync other repos
# tags: gitea_sync
# include_role:
# name: arcodange.factory.gitea_sync
# apply:
# tags: gitea_sync
# - name: Generate Argo CD token
# include_role:
# name: arcodange.factory.gitea_token
# vars:
# gitea_token_name: ARGOCD_TOKEN
# gitea_token_fact_name: argocd_token
# gitea_token_scopes: read:repository,read:package
# gitea_token_replace: true
# - name: Figure out k3s master node
# shell:
# kubectl get nodes -l node-role.kubernetes.io/control-plane=true -o name | sed s'#node/##'
# register: get_k3s_master_node
# changed_when: false
# - name: Get kubernetes server internal url
# command: >-
# echo https://kubernetes.default.svc
# # {%raw%}
# # kubectl get svc/kubernetes -o template="{{.spec.clusterIP}}:{{(index .spec.ports 0).port}}"
# # {%endraw%}
# register: get_k3s_internal_server_url
# changed_when: false
# - set_fact:
# k3s_master_node: "{{ get_k3s_master_node.stdout }}"
# k3s_internal_server_url: "{{ get_k3s_internal_server_url.stdout }}"
# - name: Read Step CA root certificate from k3s master
# become: true
# delegate_to: "{{ k3s_master_node }}"
# slurp:
# src: /home/step/.step/certs/root_ca.crt
# register: step_ca_root_cert
# - name: Decode Step CA root certificate
# set_fact:
# step_ca_root_cert_pem: "{{ step_ca_root_cert.content | b64decode }}"
# - name: Install Argo CD
# become: true
# delegate_to: "{{ k3s_master_node }}"
# vars:
# gitea_credentials:
# username: arcodange
# password: "{{ argocd_token }}"
# argocd_helm_values: # https://github.com/argoproj/argo-helm/blob/main/charts/argo-cd/values.yaml
# global:
# domain: argocd.arcodange.lab
# configs:
# cm:
# kustomize.buildOptions: "--enable-helm"
# helm.enablePostRenderer: "true"
# exec.enabled: "true"
# params:
# server.insecure: true # let k3s traefik do TLS termination
# ansible.builtin.copy:
# dest: /var/lib/rancher/k3s/server/manifests/argocd.yaml
# content: |-
# apiVersion: v1
# kind: Namespace
# metadata:
# name: argocd
# ---
# apiVersion: v1
# kind: ConfigMap
# metadata:
# name: argocd-tls-certs-cm
# namespace: argocd
# data:
# gitea.arcodange.lab: |
# {{ step_ca_root_cert_pem | indent(4) }}
# ---
# apiVersion: helm.cattle.io/v1
# kind: HelmChart
# metadata:
# name: argocd
# namespace: kube-system
# spec:
# repo: https://argoproj.github.io/argo-helm
# chart: argo-cd
# targetNamespace: argocd
# valuesContent: |-
# {{ argocd_helm_values | to_nice_yaml | indent( width=4 ) }}
# ---
# apiVersion: networking.k8s.io/v1
# kind: Ingress
# metadata:
# name: argocd-server-ingress
# namespace: argocd
# annotations:
# # For Traefik v2.x
# traefik.ingress.kubernetes.io/router.entrypoints: websecure
# traefik.ingress.kubernetes.io/router.tls: "true"
# traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
# traefik.ingress.kubernetes.io/router.tls.domains.0.main: arcodange.lab
# traefik.ingress.kubernetes.io/router.tls.domains.0.sans: argocd.arcodange.lab
# traefik.ingress.kubernetes.io/router.middlewares: localIp@file
# spec:
# rules:
# - host: argocd.arcodange.lab
# http:
# paths:
# - path: /
# pathType: Prefix
# backend:
# service:
# name: argocd-server
# port:
# number: 80 #TLS is terminated at Traefik
# ---
# apiVersion: v1
# kind: Secret
# metadata:
# name: gitea-arcodangeorg-factory-repo
# namespace: argocd
# labels:
# argocd.argoproj.io/secret-type: repository
# stringData:
# type: git
# url: https://gitea.arcodange.lab/arcodange-org/factory
# ---
# apiVersion: v1
# kind: Secret
# metadata:
# name: gitea-arcodangeorg-repo-creds
# namespace: argocd
# labels:
# argocd.argoproj.io/secret-type: repo-creds
# stringData:
# type: git
# url: https://gitea.arcodange.lab/arcodange-org
# password: {{ gitea_credentials.password }}
# username: {{ gitea_credentials.username }}
# ---
# apiVersion: argoproj.io/v1alpha1
# kind: Application
# metadata:
# name: factory
# namespace: argocd
# spec:
# project: default
# source:
# repoURL: https://gitea.arcodange.lab/arcodange-org/factory
# targetRevision: HEAD
# path: argocd
# destination:
# server: {{ k3s_internal_server_url }}
# namespace: argocd
# syncPolicy:
# automated:
# prune: true
# selfHeal: true
# - name: touch manifests/argocd.yaml to trigger update
# delegate_to: "{{ k3s_master_node }}"
# ansible.builtin.file:
# path: /var/lib/rancher/k3s/server/manifests/argocd.yaml
# state: touch
# become: true
# post_tasks:
# - include_role:
# name: arcodange.factory.gitea_token
# apply:
# tags: gitea_sync
# tags:
# - gitea_sync
# vars:
# gitea_token_delete: true

View File

@@ -24,12 +24,15 @@
- name: define backup command
set_fact:
backup_cmd: |-
echo "
$(kubectl get -A pv -o yaml)
---
$(kubectl get -A pvc -o yaml)
"
# PVs + PVCs + Longhorn Volume CRDs (critical for fast recovery — without Volume CRDs,
# Longhorn cannot re-associate orphaned replica dirs after a reinstall and forces
# full block-device injection recovery. See docs/adr/20260414-longhorn-pvc-recovery.md)
backup_cmd: >-
kubectl get -A pv,pvc -o yaml
&& echo '---'
&& kubectl get -A volumes.longhorn.io -o yaml
&& echo '---'
&& kubectl get -A settings.longhorn.io -o yaml
- name: test backup_cmd
ansible.builtin.shell: |
@@ -65,19 +68,34 @@
#!/bin/bash
set -e
BACKUP_DIR="{{ backup_dir }}"
PRIMARY_BACKUP_DIR="{{ backup_dir }}"
FALLBACK_BACKUP_DIR="/home/pi/arcodange/backups/k3s_pvc"
# Check if fallback directory exists and has backups
if [ -d "$FALLBACK_BACKUP_DIR" ] && ls "$FALLBACK_BACKUP_DIR"/*.volumes 1>/dev/null 2>&1; then
BACKUP_DIR="$FALLBACK_BACKUP_DIR"
echo "Using fallback backup directory: $BACKUP_DIR"
elif [ -d "$PRIMARY_BACKUP_DIR" ] && ls "$PRIMARY_BACKUP_DIR"/*.volumes 1>/dev/null 2>&1; then
BACKUP_DIR="$PRIMARY_BACKUP_DIR"
else
echo "No backup directory found"
exit 1
fi
if [ -z "$1" ]; then
FILE=$(ls -1t "$BACKUP_DIR"/backup_*.volumes | head -n 1)
echo "Aucune date fournie, restauration du dernier dump : $FILE"
echo "No date provided, restoring latest dump: $FILE"
else
FILE="$BACKUP_DIR/backup_$1.volumes"
if [ ! -f "$FILE" ]; then
echo "Fichier $FILE introuvable"
echo "File $FILE not found"
exit 1
fi
fi
kubectl apply -f "$FILE"
echo "Restauration des volumes k3s terminée."
echo "K3S volumes restoration complete."
echo "NOTE: file includes PVs, PVCs, and Longhorn Volume CRDs."
echo "If Longhorn replica dirs are still orphaned after this restore,"
echo "fall back to: ansible-playbook playbooks/recover/longhorn_data.yml"

View File

@@ -5,3 +5,4 @@ pihole_dns_domain: lab
pihole_ports: '8081o,443os,[::]:8081o,[::]:443os' # web interface
pihole_gravity_conf: /etc/gravity-sync/gravity-sync.conf # should not be changed
pihole_custom_dns: {}
pihole_upstream_dns: ["8.8.8.8", "1.1.1.1", "8.8.4.4"] # Explicit upstream DNS servers

View File

@@ -98,3 +98,17 @@
address=/{{ host }}.home/{{ hostvars[host].preferred_ip }}
{% endfor %}
notify: Restart Pi-hole
- name: Configure explicit upstream DNS servers for Pi-hole
copy:
dest: /etc/dnsmasq.d/99-upstream.conf
owner: root
group: root
mode: '0644'
content: |
# Generated by Ansible Explicit upstream DNS servers
# Fixes issue where Pi-hole relies on DHCP-provided DNS which may be unavailable
{% for dns_server in pihole_upstream_dns %}
server={{ dns_server }}
{% endfor %}
notify: Restart Pi-hole

View File

@@ -0,0 +1,536 @@
---
- name: Recover Longhorn from Power Cut - CSI Driver Registration Loss
hosts: raspberries:&local
gather_facts: yes
become: yes
vars:
# Backup locations
primary_backup_dir: "/mnt/backups/k3s_pvc"
fallback_backup_dir: "/home/pi/arcodange/backups/k3s_pvc"
scripts_dir: "/opt/k3s_volumes"
# Longhorn configuration
longhorn_manifest_path: "/var/lib/rancher/k3s/server/manifests/longhorn-install.yaml"
longhorn_namespace: "longhorn-system"
longhorn_chart_name: "longhorn-install"
longhorn_chart_namespace: "kube-system"
# Data paths (DO NOT MODIFY - points to actual volume data)
longhorn_data_path: "/mnt/arcodange/longhorn"
tasks:
# ========================================================================
# PHASE 0: Pre-flight Checks
# ========================================================================
- name: Verify data directory exists on control plane
ansible.builtin.stat:
path: "{{ longhorn_data_path }}"
register: data_dir
when: inventory_hostname == 'pi1'
run_once: true
- name: FAIL if data directory missing
ansible.builtin.fail:
msg: "CRITICAL: Longhorn data directory {{ longhorn_data_path }} does not exist. Aborting recovery."
when: inventory_hostname == 'pi1' and not data_dir.stat.exists
run_once: true
- name: Check for fallback backups on pi1
ansible.builtin.shell: ls {{ fallback_backup_dir }}/backup_*.volumes 2>/dev/null
register: fallback_backup_check
changed_when: false
when: inventory_hostname == 'pi1'
run_once: true
ignore_errors: yes
- name: Check for primary backups on pi1
ansible.builtin.shell: ls {{ primary_backup_dir }}/backup_*.volumes 2>/dev/null
register: primary_backup_check
changed_when: false
when: inventory_hostname == 'pi1'
run_once: true
ignore_errors: yes
- name: Set backup fact
ansible.builtin.set_fact:
has_backups: "{{ (fallback_backup_check.rc == 0 and fallback_backup_check.stdout | trim != '') or (primary_backup_check.rc == 0 and primary_backup_check.stdout | trim != '') }}"
when: inventory_hostname == 'pi1'
run_once: true
- name: FAIL if no backups found
ansible.builtin.fail:
msg: "No backup files found in {{ primary_backup_dir }} or {{ fallback_backup_dir }}. Cannot proceed."
when: inventory_hostname == 'pi1' and not has_backups | bool
run_once: true
# ========================================================================
# PHASE 1: Diagnosis - Check Current State
# ========================================================================
- name: Gather Longhorn namespace status
block:
- name: Check if longhorn-system namespace exists
kubernetes.core.k8s_info:
kind: Namespace
name: "{{ longhorn_namespace }}"
register: longhorn_ns
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Check CSI driver registration
kubernetes.core.k8s_info:
kind: CSIDriver
name: driver.longhorn.io
register: csi_driver
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Check Longhorn manager pods
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: managers
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Set recovery_phase fact
ansible.builtin.set_fact:
recovery_phase: "none"
run_once: true
delegate_to: localhost
- name: Determine recovery phase needed
ansible.builtin.set_fact:
recovery_phase: >-
{% if csi_driver.failed %}
soft
{% elif managers.failed or managers.resources | default([]) | selectattr('status.phase', 'defined') | selectattr('status.phase', 'ne', 'Running') | list | length > 0 %}
hard
{% elif longhorn_ns.failed %}
none
{% else %}
none
{% endif %}
run_once: true
delegate_to: localhost
- name: Display recovery diagnosis
ansible.builtin.debug:
msg: "Diagnosis: recovery_phase={{ recovery_phase | default('none') }}. CSI Driver exists: {{ not csi_driver.failed | bool }}, Managers healthy: {{ managers.failed | ternary('unknown', managers.resources | default([]) | selectattr('status.phase', 'defined') | selectattr('status.phase', 'eq', 'Running') | list | length >= 3) | bool }}"
run_once: true
delegate_to: localhost
when: inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 2: Soft Recovery - Touch Manifest
# ========================================================================
- name: Execute soft recovery - touch Longhorn manifest
block:
- name: Touch longhorn-install.yaml manifest
ansible.builtin.file:
path: "{{ longhorn_manifest_path }}"
state: touch
register: manifest_touch
when: inventory_hostname == 'pi1'
- name: Wait for k3s to detect manifest change
ansible.builtin.pause:
minutes: 1
when: manifest_touch is changed
- name: Check if Longhorn pods are recreating
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
register: longhorn_pods
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Verify soft recovery success
ansible.builtin.set_fact:
soft_recovery_success: >-
{{ (longhorn_pods.resources | default([]) | selectattr('metadata.creationTimestamp', 'defined') | list | length) >= 10 }}
run_once: true
delegate_to: localhost
when: recovery_phase == 'soft' and inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 3: Hard Recovery - Delete Driver-Deployer
# ========================================================================
- name: Execute hard recovery - delete driver-deployer pods
block:
- name: Get driver-deployer pods
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-driver-deployer
register: driver_deployer_pods
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Delete driver-deployer pods
kubernetes.core.k8s:
state: absent
kind: Pod
namespace: "{{ longhorn_namespace }}"
name: "{{ item.metadata.name }}"
force: yes
grace_period: 0
loop: "{{ driver_deployer_pods.resources | default([]) }}"
when: driver_deployer_pods.resources | default([]) | length > 0
run_once: true
delegate_to: localhost
- name: Wait for HelmChart to recreate driver-deployer
ansible.builtin.pause:
minutes: 2
- name: Check driver-deployer status
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-driver-deployer
register: new_driver_deployer
ignore_errors: yes
run_once: true
delegate_to: localhost
when: (recovery_phase == 'hard' or (recovery_phase == 'soft' and not soft_recovery_success | default(false))) and inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 4: Nuclear Recovery - Full Reinstall
# ========================================================================
- name: Execute nuclear recovery - full Longhorn reinstall
block:
# Step 1: Delete HelmChart
- name: Delete Longhorn HelmChart
kubernetes.core.k8s:
state: absent
kind: HelmChart
namespace: "{{ longhorn_chart_namespace }}"
name: "{{ longhorn_chart_name }}"
force: yes
grace_period: 0
register: helmchart_deleted
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Wait for HelmChart to be fully removed
ansible.builtin.pause:
seconds: 30
when: helmchart_deleted is changed
run_once: true
# Step 2: Remove Longhorn manifest from filesystem
- name: Remove Longhorn manifest file
ansible.builtin.file:
path: "{{ longhorn_manifest_path }}"
state: absent
when: inventory_hostname == 'pi1'
register: manifest_removed
# Step 3: Remove finalizers from all Longhorn resources
- name: Get list of all Longhorn CRDs
kubernetes.core.k8s_info:
kind: CustomResourceDefinition
label_selectors:
- app=longhorn
register: longhorn_crds
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Get all Longhorn CR instances
kubernetes.core.k8s_info:
kind: "{{ item.spec.names.kind }}"
namespace: "{{ longhorn_namespace }}"
api_version: "{{ item.spec.group ~ '/' ~ item.spec.versions[0].name }}"
register: cr_instances
ignore_errors: yes
loop: "{{ longhorn_crds.resources | default([]) }}"
run_once: true
delegate_to: localhost
- name: Remove finalizers from all Longhorn CR instances
kubernetes.core.k8s_json_patch:
kind: "{{ item.0.spec.names.kind }}"
namespace: "{{ longhorn_namespace }}"
name: "{{ item.1.metadata.name }}"
api_version: "{{ item.0.spec.group ~ '/' ~ item.0.spec.versions[0].name }}"
patch:
- op: replace
path: /metadata/finalizers
value: []
loop: >-
{% set results = [] %}
{% for crd in longhorn_crds.resources | default([]) %}
{% for instance in hostvars['localhost']['cr_instances'].results | default([]) %}
{% if instance.crd == crd %}
{% set results = results.append([crd, instance.resources[0] if instance.resources else {}]) %}
{% endif %}
{% endfor %}
{% endfor %}
{{ results }}
when: cr_instances.results | default([]) | length > 0
run_once: true
delegate_to: localhost
ignore_errors: yes
# Step 4: Remove finalizers from PVCs
- name: Get all PVCs with longhorn storage class
kubernetes.core.k8s_info:
kind: PersistentVolumeClaim
register: all_pvcs
ignore_errors: yes
run_once: true
delegate_to: localhost
- name: Remove finalizers from PVCs
kubernetes.core.k8s_json_patch:
kind: PersistentVolumeClaim
namespace: "{{ item.metadata.namespace }}"
name: "{{ item.metadata.name }}"
patch:
- op: replace
path: /metadata/finalizers
value: []
loop: "{{ all_pvcs.resources | default([]) | selectattr('spec.storageClassName', 'defined') | selectattr('spec.storageClassName', 'match', 'longhorn.*') | list }}"
run_once: true
delegate_to: localhost
ignore_errors: yes
# Step 5: Remove namespace finalizers
- name: Remove finalizers from longhorn-system namespace
kubernetes.core.k8s_json_patch:
kind: Namespace
name: "{{ longhorn_namespace }}"
patch:
- op: replace
path: /spec/finalizers
value: []
run_once: true
delegate_to: localhost
ignore_errors: yes
- name: Delete longhorn-system namespace
kubernetes.core.k8s:
state: absent
kind: Namespace
name: "{{ longhorn_namespace }}"
force: yes
grace_period: 0
run_once: true
delegate_to: localhost
ignore_errors: yes
- name: Wait for namespace deletion
ansible.builtin.pause:
seconds: 15
run_once: true
# Step 6: Reinstall Longhorn via manifest
- name: Deploy Longhorn HelmChart manifest
ansible.builtin.copy:
dest: "{{ longhorn_manifest_path }}"
content: |
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
annotations:
helmcharts.cattle.io/managed-by: helm-controller
finalizers:
- wrangler.cattle.io/on-helm-chart-remove
name: longhorn-install
namespace: kube-system
spec:
version: v1.9.1
chart: longhorn
repo: https://charts.longhorn.io
failurePolicy: abort
targetNamespace: longhorn-system
createNamespace: true
valuesContent: |-
defaultSettings:
defaultDataPath: {{ longhorn_data_path }}
when: inventory_hostname == 'pi1'
register: manifest_deployed
- name: Trigger k3s reconcile by touching manifest
ansible.builtin.file:
path: "{{ longhorn_manifest_path }}"
state: touch
when: manifest_deployed is changed and inventory_hostname == 'pi1'
- name: Wait for Longhorn pods to be created
ansible.builtin.pause:
minutes: 3
when: manifest_deployed is changed
run_once: true
when: >-
(recovery_phase == 'hard' and not new_driver_deployer.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length > 0)
or (recovery_phase == 'soft' and not soft_recovery_success | default(false) and not new_driver_deployer.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length > 0)
or recovery_phase == 'none'
run_once: true
# ========================================================================
# PHASE 5: Restore from Backup
# ========================================================================
- name: Execute restore from backup
block:
- name: Determine backup directory to use
ansible.builtin.set_fact:
backup_dir_to_use: >-
{% if fallback_backup_dir and lookup('fileglob', fallback_backup_dir ~ '/backup_*.volumes') | length > 0 %}
{{ fallback_backup_dir }}
{% elif primary_backup_dir and lookup('fileglob', primary_backup_dir ~ '/backup_*.volumes') | length > 0 %}
{{ primary_backup_dir }}
{% else %}
""
{% endif %}
run_once: true
delegate_to: localhost
- name: FAIL if no backup directory found
ansible.builtin.fail:
msg: "No valid backup directory found with backup_*.volumes files"
when: backup_dir_to_use == ""
run_once: true
- name: Find latest backup file
ansible.builtin.set_fact:
latest_backup: >-
{% set files = lookup('fileglob', backup_dir_to_use ~ '/backup_*.volumes', wantlist=True) | sort(attribute='stat.mtime', reverse=True) %}
{% if files | length > 0 %}
{{ files[0].path }}
{% endif %}
run_once: true
delegate_to: localhost
- name: FAIL if no backup files found
ansible.builtin.fail:
msg: "No backup files found in {{ backup_dir_to_use }}"
when: latest_backup | default('') == ''
run_once: true
- name: Wait for Longhorn managers to be ready
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: managers_status
until: >-
{{ (managers_status.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length) >= 1 }}
retries: 30
delay: 10
run_once: true
delegate_to: localhost
- name: Apply PV/PVC backup
kubernetes.core.k8s:
state: present
src: "{{ latest_backup }}"
run_once: true
delegate_to: localhost
- name: Find Longhorn metadata backup
ansible.builtin.set_fact:
longhorn_backup: >-
{% set lh_files = lookup('fileglob', backup_dir_to_use ~ '/longhorn_metadata_*.yaml', wantlist=True) | sort(attribute='stat.mtime', reverse=True) %}
{% if lh_files | length > 0 %}
{{ lh_files[0].path }}
{% endif %}
run_once: true
delegate_to: localhost
- name: Apply Longhorn metadata backup (if exists)
kubernetes.core.k8s:
state: present
src: "{{ longhorn_backup | default(omit) }}"
namespace: "{{ longhorn_namespace }}"
when: longhorn_backup | default('') != ''
run_once: true
delegate_to: localhost
when: inventory_hostname == 'pi1'
run_once: true
# ========================================================================
# PHASE 6: Post-Recovery Verification
# ========================================================================
- name: Verify recovery success
block:
- name: Check CSI driver registration
kubernetes.core.k8s_info:
kind: CSIDriver
name: driver.longhorn.io
register: csi_final
until: csi_final.resources | length > 0
retries: 10
delay: 10
run_once: true
delegate_to: localhost
- name: Check Longhorn manager health
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: managers_final
until: >-
{{ (managers_final.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length) >= 3 }}
retries: 15
delay: 10
run_once: true
delegate_to: localhost
- name: Check CSI socket exists (on pi1)
ansible.builtin.stat:
path: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
register: csi_socket
when: inventory_hostname == 'pi1'
- name: Verify volume data is still present
ansible.builtin.stat:
path: "{{ longhorn_data_path }}/replicas"
register: replicas_dir
when: inventory_hostname == 'pi1'
- name: Display recovery summary
ansible.builtin.debug:
msg: |
===== Longhorn Recovery Summary =====
CSI Driver Registered: {{ not csi_final.failed | bool | ternary('✓', '✗') }}
Managers Running: {{ (managers_final.resources | default([]) | selectattr('status.phase', 'eq', 'Running') | list | length) }}/3
CSI Socket Exists: {{ csi_socket.stat.exists | default(false) | bool | ternary('✓', '✗') }}
Volume Data Present: {{ replicas_dir.stat.exists | default(false) | bool | ternary('✓', '✗') }}
Backup Used: {{ latest_backup | default('none') }}
======================================
run_once: true
when: inventory_hostname == 'pi1'
run_once: true

View File

@@ -0,0 +1,914 @@
---
# Longhorn Block-Device Data Recovery Playbook
#
# PURPOSE:
# Recover application data directly from raw Longhorn replica files when Volume CRDs
# are missing (e.g. after a nuclear cleanup + reinstall). Bypasses k8s objects entirely
# and works at the block-device level.
#
# WHEN TO USE:
# - Longhorn has been fully reinstalled (Volume CRDs are gone)
# - Application PVCs are stuck Terminating / Lost
# - The raw replica .img files still exist on disk
# → See docs/runbooks/longhorn-block-device-recovery.md for the manual equivalent
#
# WHEN NOT TO USE:
# - Volume CRDs still exist → use playbooks/recover/longhorn.yml instead
# - All replica dirs were zeroed by Longhorn reconciliation (data is unrecoverable)
#
# USAGE:
# ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
# -e @vars/recovery_volumes.yml
#
# VARS FILE FORMAT (vars/recovery_volumes.yml):
# longhorn_recovery_volumes:
# - pv_name: pvc-abc123 # Longhorn volume name (== PV name)
# pvc_name: myapp-data # PVC name in the namespace
# namespace: myapp # namespace where the PVC lives
# size_bytes: "134217728" # volume size in bytes (string)
# size_human: 128Mi # human-readable, used in PVC spec
# access_mode: ReadWriteOnce # ReadWriteOnce or ReadWriteMany
# workload_kind: Deployment # Deployment or StatefulSet
# workload_name: myapp # name of the workload to scale down/up
# source_node: pi3 # [OPTIONAL] node with untouched replica dir
# source_dir: pvc-abc123-998f49ff # [OPTIONAL] exact replica dir name
# verify_cmd: "" # optional: command to run inside pod to verify data after recovery
#
# source_node and source_dir are auto-discovered (largest dir >16K across all nodes)
# when not specified. Override manually only to force a specific replica dir.
#
# REQUIREMENTS:
# - python3 on all cluster nodes
# - kubectl configured on the Ansible controller (localhost)
# - longhorn-system namespace running and healthy before this playbook starts
# - kubernetes.core collection: ansible-galaxy collection install kubernetes.core
#
# TESTED SCENARIO:
# 2026-04-13 power cut — nuclear Longhorn reinstall — url-shortener SQLite recovery
# Proven working as of 2026-04-14.
- name: Longhorn Block-Device Data Recovery
hosts: localhost
gather_facts: no
vars:
longhorn_data_path: /mnt/arcodange/longhorn
longhorn_namespace: longhorn-system
longhorn_nodes: [pi1, pi2, pi3]
merge_tool_local: "{{ playbook_dir }}/../../docs/incidents/2026-04-13-power-cut/tools/merge-longhorn-layers.py"
merge_tool_remote: /home/pi/merge-longhorn-layers.py
backup_base: /home/pi/arcodange/backups/longhorn-recovery
merged_base: /tmp/longhorn-recovery-merged
recovery_mount: /mnt/recovery-src
live_mount: /mnt/recovery-live
longhorn_recovery_volumes: [] # override with -e @vars/recovery_volumes.yml
tasks:
# =========================================================================
# PRE-FLIGHT
# =========================================================================
- name: "Pre-flight | Fail fast if no volumes defined"
ansible.builtin.fail:
msg: >
No recovery volumes defined. Pass -e @vars/recovery_volumes.yml with a
longhorn_recovery_volumes list. See playbook header for format.
when: longhorn_recovery_volumes | length == 0
- name: "Pre-flight | Verify merge tool exists locally"
ansible.builtin.stat:
path: "{{ merge_tool_local }}"
register: merge_tool_stat
delegate_to: localhost
- name: "Pre-flight | Fail if merge tool missing"
ansible.builtin.fail:
msg: "merge-longhorn-layers.py not found at {{ merge_tool_local }}"
when: not merge_tool_stat.stat.exists
- name: "Pre-flight | Check Longhorn is healthy"
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ longhorn_namespace }}"
label_selectors:
- app=longhorn-manager
register: lh_managers
delegate_to: localhost
- name: "Pre-flight | Fail if Longhorn managers are not running"
ansible.builtin.fail:
msg: >
Longhorn managers not running (found {{ lh_managers.resources | default([]) |
selectattr('status.phase', 'eq', 'Running') | list | length }} Running pods).
Ensure Longhorn is healthy before attempting data recovery.
when: >
(lh_managers.resources | default([]) |
selectattr('status.phase', 'eq', 'Running') | list | length) < 1
- name: "Pre-flight | Summary"
ansible.builtin.debug:
msg: >
Longhorn healthy ({{ lh_managers.resources |
selectattr('status.phase', 'eq', 'Running') | list | length }} managers running).
Recovering {{ longhorn_recovery_volumes | length }} volume(s):
{{ longhorn_recovery_volumes | map(attribute='pv_name') | list | join(', ') }}
# =========================================================================
# PHASE 0 — AUTO-DISCOVER BEST REPLICA DIR (when source_node/source_dir absent)
# =========================================================================
- name: "Phase 0 | Scan replica dirs on all nodes"
ansible.builtin.shell: |
result=""
for dir in {{ longhorn_data_path }}/replicas/{{ item.1.pv_name }}-*; do
[ -d "$dir" ] || continue
# Skip replicas that were being rebuilt — their data is incomplete
meta="$dir/volume.meta"
if [ -f "$meta" ]; then
rebuilding=$(python3 -c "import json; d=json.load(open('$meta')); print(d.get('Rebuilding', False))" 2>/dev/null)
[ "$rebuilding" = "True" ] && continue
fi
# Use actual disk usage (not apparent/sparse size) to rank replicas
size=$(du -sk "$dir" 2>/dev/null | cut -f1)
name=$(basename "$dir")
result="$result\n$size $name"
done
printf '%b' "$result" | grep -v '^$' || true
delegate_to: "{{ item.0 }}"
become: yes
loop: "{{ longhorn_nodes | product(longhorn_recovery_volumes) | list }}"
loop_control:
label: "{{ item.0 }}: {{ item.1.pv_name }}"
register: dir_scan_raw
changed_when: false
when: item.1.source_node | default('') == '' or item.1.source_dir | default('') == ''
- name: "Phase 0 | Pick best source (largest dir with data, >16K)"
ansible.builtin.set_fact:
_discovered_sources: "{{ _build | from_json }}"
vars:
_build: >-
{% set ns = namespace(result={}) %}
{% for res in dir_scan_raw.results | default([]) %}
{% if not res.skipped | default(false) and res.stdout | default('') != '' %}
{% set node = res.item.0 %}
{% set vol = res.item.1.pv_name %}
{% for line in res.stdout_lines %}
{% set parts = line.split() %}
{% if parts | length == 2 %}
{% set size = parts[0] | int %}
{% set dir = parts[1] %}
{% if size > 16384 and (vol not in ns.result or size > ns.result[vol].size) %}
{# size is in KB (from du -sk); 16384 KB = 16 MiB minimum real replica #}
{% set _ = ns.result.update({vol: {'node': node, 'dir': dir, 'size': size}}) %}
{% endif %}
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
{{ ns.result | to_json }}
- name: "Phase 0 | Show discovered sources"
ansible.builtin.debug:
msg: >-
{% for vol in longhorn_recovery_volumes %}
{{ vol.pv_name }}:
{% if vol.source_node | default('') != '' %}
source: MANUAL → {{ vol.source_node }}/{{ vol.source_dir }}
{% elif vol.pv_name in _discovered_sources %}
source: AUTO → {{ _discovered_sources[vol.pv_name].node }}/{{ _discovered_sources[vol.pv_name].dir }}
({{ (_discovered_sources[vol.pv_name].size / 1024 / 1024) | round(0) | int }} MiB)
{% else %}
source: NOT FOUND — no dir >16K on any node for this volume
{% endif %}
{% endfor %}
- name: "Phase 0 | Fail if source not found for any volume"
ansible.builtin.fail:
msg: >
No replica dir with data found for {{ item.pv_name }} on any node
({{ longhorn_nodes | join(', ') }}). Check that the replica files survived.
loop: "{{ longhorn_recovery_volumes }}"
loop_control:
label: "{{ item.pv_name }}"
when: >
item.source_node | default('') == '' and
item.source_dir | default('') == '' and
item.pv_name not in _discovered_sources
- name: "Phase 0 | Initialize merged volume list"
ansible.builtin.set_fact:
_merged_volumes: []
- name: "Phase 0 | Append each volume with resolved source"
ansible.builtin.set_fact:
_merged_volumes: "{{ _merged_volumes + [item | combine(_source)] }}"
vars:
_manual: "{{ item.source_node | default('') != '' and item.source_dir | default('') != '' }}"
_source: "{{ _manual | bool | ternary(
{'source_node': item.source_node, 'source_dir': item.source_dir},
{'source_node': _discovered_sources[item.pv_name].node,
'source_dir': _discovered_sources[item.pv_name].dir}) }}"
loop: "{{ longhorn_recovery_volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 0 | Apply resolved volume list"
ansible.builtin.set_fact:
_volumes: "{{ _merged_volumes }}"
# =========================================================================
# PHASE 1 — UPLOAD MERGE TOOL AND BACK UP REPLICA DIRS
# =========================================================================
- name: "Phase 1 | Upload merge tool to source nodes"
ansible.builtin.command: >
scp -o StrictHostKeyChecking=no
{{ merge_tool_local }}
pi@{{ item.source_node }}.home:{{ merge_tool_remote }}
delegate_to: localhost
become: no
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }} → {{ item.source_node }}"
changed_when: true
- name: "Phase 1 | Create backup directory on source node"
ansible.builtin.file:
path: "{{ backup_base }}/{{ item.pvc_name }}"
state: directory
mode: "0755"
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 1 | Check if backup already exists (skip if re-running)"
ansible.builtin.stat:
path: "{{ backup_base }}/{{ item.pvc_name }}/{{ item.source_dir }}/volume.meta"
register: backup_exists
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 1 | Back up untouched replica dir (safe copy before any operation)"
ansible.builtin.shell: >
cp -a {{ longhorn_data_path }}/replicas/{{ item.item.source_dir }}
{{ backup_base }}/{{ item.item.pvc_name }}/
delegate_to: "{{ item.item.source_node }}"
become: yes
loop: "{{ backup_exists.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
when: not item.stat.exists
changed_when: true
- name: "Phase 1 | Verify backup contains volume.meta"
ansible.builtin.stat:
path: "{{ backup_base }}/{{ item.pvc_name }}/{{ item.source_dir }}/volume.meta"
register: backup_meta
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 1 | Fail if backup is incomplete"
ansible.builtin.fail:
msg: >
Backup for {{ item.item.pvc_name }} is missing volume.meta — the source dir
{{ item.item.source_dir }} may not exist or backup copy failed.
loop: "{{ backup_meta.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
when: not item.stat.exists
# =========================================================================
# PHASE 2 — RECONSTRUCT FILESYSTEMS FROM REPLICA LAYERS
# =========================================================================
- name: "Phase 2 | Create merged output directory"
ansible.builtin.file:
path: "{{ merged_base }}"
state: directory
mode: "0755"
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 2 | Check if merged image already exists"
ansible.builtin.stat:
path: "{{ merged_base }}/{{ item.pvc_name }}.img"
register: merged_exists
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
- name: "Phase 2 | Merge snapshot + head layers into single image"
ansible.builtin.command: >
python3 {{ merge_tool_remote }}
{{ backup_base }}/{{ item.item.pvc_name }}/{{ item.item.source_dir }}
{{ merged_base }}/{{ item.item.pvc_name }}.img
delegate_to: "{{ item.item.source_node }}"
become: yes
loop: "{{ merged_exists.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
when: not item.stat.exists
changed_when: true
register: merge_output
- name: "Phase 2 | Show merge output"
ansible.builtin.debug:
msg: "{{ item.stdout_lines | default([]) }}"
loop: "{{ merge_output.results | default([]) }}"
loop_control:
label: "{{ item.item.item.pvc_name | default('') }}"
when: item.stdout_lines is defined
- name: "Phase 2 | Test mount merged image to verify filesystem"
ansible.builtin.shell: |
mkdir -p {{ recovery_mount }}-{{ item.pvc_name }}
mount -o loop,ro,noload {{ merged_base }}/{{ item.pvc_name }}.img {{ recovery_mount }}-{{ item.pvc_name }}
ls {{ recovery_mount }}-{{ item.pvc_name }}/
umount {{ recovery_mount }}-{{ item.pvc_name }}
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
register: mount_test
changed_when: false
- name: "Phase 2 | Show filesystem contents"
ansible.builtin.debug:
msg: "{{ item.item.pvc_name }}: {{ item.stdout_lines }}"
loop: "{{ mount_test.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
# =========================================================================
# PHASE 3 — CREATE LONGHORN VOLUME CRDs
# =========================================================================
# Scale down StatefulSets BEFORE removing PVC finalizers.
# StatefulSet controllers auto-recreate PVCs as soon as they are deleted; if we
# remove finalizers while the StatefulSet is still running, the controller
# immediately provisions a new empty PVC (bound to a fresh volume), making the
# PVC spec immutable by the time Phase 8 tries to pin it to our recovered PV.
# Deployments are less urgent here but scaled early for consistency.
- name: "Phase 3 | Pre-scale down Deployments (before PVC finalizer removal)"
kubernetes.core.k8s_scale:
kind: Deployment
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'Deployment' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 3 | Pre-scale down StatefulSets (before PVC finalizer removal)"
kubernetes.core.k8s_scale:
kind: StatefulSet
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'StatefulSet' and item.workload_name != ''
ignore_errors: yes
# Clear any stuck Terminating PVs/PVCs BEFORE creating Volume CRDs.
# If old Terminating PVCs still exist when we create the Volume CRD, Longhorn
# associates them and deletes the Volume CRD when the PVC finishes terminating.
- name: "Phase 3 | Check PVC state before touching finalizers"
ansible.builtin.shell: >
kubectl get pvc {{ item.pvc_name }} -n {{ item.namespace }}
-o jsonpath='{.metadata.deletionTimestamp}' 2>/dev/null || true
register: pvc_deletion_ts
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.pvc_name }}"
changed_when: false
- name: "Phase 3 | Remove finalizers from stuck PV (if Terminating)"
ansible.builtin.shell: >
kubectl patch pv {{ item.pv_name }} --type=merge
-p '{"metadata":{"finalizers":null}}' 2>/dev/null || true
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
changed_when: false
- name: "Phase 3 | Remove finalizers from stuck PVC (if Terminating)"
ansible.builtin.shell: >
kubectl patch pvc {{ item.pvc_name }} -n {{ item.namespace }}
--type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true
delegate_to: localhost
loop: "{{ pvc_deletion_ts.results }}"
loop_control:
label: "{{ item.item.namespace }}/{{ item.item.pvc_name }}"
when: item.stdout != ''
changed_when: false
- name: "Phase 3 | Wait for stuck PVCs to fully delete before creating Volume CRDs"
kubernetes.core.k8s_info:
kind: PersistentVolumeClaim
name: "{{ item.item.pvc_name }}"
namespace: "{{ item.item.namespace }}"
register: pvc_pre_check
until: pvc_pre_check.resources | default([]) | length == 0
retries: 12
delay: 5
delegate_to: localhost
loop: "{{ pvc_deletion_ts.results }}"
loop_control:
label: "{{ item.item.namespace }}/{{ item.item.pvc_name }}"
when: item.stdout != ''
- name: "Phase 3 | Check if Longhorn Volume CRD already exists"
kubernetes.core.k8s_info:
kind: Volume
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: volume_crd_check
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 3 | Create Longhorn Volume CRD"
kubernetes.core.k8s:
state: present
definition:
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: "{{ item.item.pv_name }}"
namespace: "{{ longhorn_namespace }}"
spec:
accessMode: "{{ item.item.access_mode | lower | replace('readwriteonce', 'rwo') | replace('readwritemany', 'rwx') }}"
dataEngine: v1
frontend: blockdev
numberOfReplicas: 3
size: "{{ item.item.size_bytes }}"
delegate_to: localhost
loop: "{{ volume_crd_check.results }}"
loop_control:
label: "{{ item.item.pv_name }}"
when: item.resources | default([]) | length == 0
- name: "Phase 3 | Wait for Longhorn replicas to appear (stopped state)"
kubernetes.core.k8s_info:
kind: Replica
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
label_selectors:
- "longhornvolume={{ item.pv_name }}"
register: replicas_check
until: replicas_check.resources | default([]) | length >= 1
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 3 | Wait for Volume status to be populated (webhook cache)"
kubernetes.core.k8s_info:
kind: Volume
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: vol_ready
until: >
(vol_ready.resources | default([]) | first | default({}) ).status.state | default('') != ''
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
# =========================================================================
# PHASE 4 — SCALE DOWN WORKLOADS
# =========================================================================
- name: "Phase 4 | Scale down Deployments"
kubernetes.core.k8s_scale:
kind: Deployment
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'Deployment' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 4 | Scale down StatefulSets"
kubernetes.core.k8s_scale:
kind: StatefulSet
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 0
wait: yes
wait_timeout: 60
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'StatefulSet' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 4 | Delete any lingering Error-state pods that may hold volume attachments"
ansible.builtin.shell: |
kubectl get pods -n {{ item.namespace }} \
--field-selector='status.phase=Failed' -o name | xargs -r kubectl delete -n {{ item.namespace }}
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}"
changed_when: false
ignore_errors: yes
# =========================================================================
# PHASE 5 — ATTACH VOLUME VIA MAINTENANCE TICKET
# =========================================================================
- name: "Phase 5 | Create VolumeAttachment maintenance ticket"
kubernetes.core.k8s:
state: present
definition:
apiVersion: longhorn.io/v1beta2
kind: VolumeAttachment
metadata:
name: "{{ item.pv_name }}"
namespace: "{{ longhorn_namespace }}"
spec:
attachmentTickets:
recovery:
generation: 0
id: recovery
nodeID: "{{ item.source_node }}"
parameters:
disableFrontend: "false"
type: longhorn-api
volume: "{{ item.pv_name }}"
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }} → {{ item.source_node }}"
- name: "Phase 5 | Wait for volume to reach attached state"
kubernetes.core.k8s_info:
kind: Volume
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: vol_state
until: >
(vol_state.resources | default([]) | first | default({}) ).status.state | default('') == 'attached'
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 5 | Verify block device exists on target node"
ansible.builtin.stat:
path: "/dev/longhorn/{{ item.pv_name }}"
register: blockdev_check
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 5 | Fail if block device not present"
ansible.builtin.fail:
msg: >
Block device /dev/longhorn/{{ item.item.pv_name }} not found on
{{ item.item.source_node }} after volume attached — check Longhorn logs.
loop: "{{ blockdev_check.results }}"
loop_control:
label: "{{ item.item.pv_name }}"
when: not item.stat.exists
# =========================================================================
# PHASE 6 — INJECT DATA INTO LIVE BLOCK DEVICE
# =========================================================================
- name: "Phase 6 | Inject data via block device (mount, rsync, umount)"
ansible.builtin.shell: |
LIVE="{{ live_mount }}-{{ item.pvc_name }}"
SRC="{{ recovery_mount }}-{{ item.pvc_name }}"
BLOCKDEV="/dev/longhorn/{{ item.pv_name }}"
MERGED="{{ merged_base }}/{{ item.pvc_name }}.img"
# Always unmount on exit (success or partial failure)
cleanup() {
mountpoint -q "$SRC" && umount "$SRC" || true
mountpoint -q "$LIVE" && umount "$LIVE" || true
}
trap cleanup EXIT
mkdir -p "$LIVE" "$SRC"
# Format if not already formatted (idempotent — safe on re-run)
if ! blkid "$BLOCKDEV" | grep -q 'TYPE='; then
mkfs.ext4 -F "$BLOCKDEV"
fi
# Mount live block device if not already mounted
if ! mountpoint -q "$LIVE"; then
mount "$BLOCKDEV" "$LIVE"
fi
# Mount merged recovery image read-only if not already mounted
if ! mountpoint -q "$SRC"; then
mount -o loop,ro,noload "$MERGED" "$SRC"
fi
# Sync data — exclude lost+found
# --ignore-errors: continue past unreadable files (e.g. corrupted parts from power cut)
# rc=23 (partial transfer) is treated as success — bulk data transferred
rsync -av --ignore-errors --exclude='lost+found' "$SRC/" "$LIVE/" || \
{ RC=$?; [ $RC -eq 23 ] && echo "WARNING: rsync rc=23 (some files unreadable in source — expected for power-cut partitions)" || exit $RC; }
delegate_to: "{{ item.source_node }}"
become: yes
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pvc_name }}"
register: inject_output
changed_when: true
- name: "Phase 6 | Show rsync output"
ansible.builtin.debug:
msg: "{{ item.stdout_lines | default([]) }}"
loop: "{{ inject_output.results }}"
loop_control:
label: "{{ item.item.pvc_name }}"
# =========================================================================
# PHASE 7 — DETACH VOLUME
# =========================================================================
- name: "Phase 7 | Remove recovery attachment ticket"
kubernetes.core.k8s_json_patch:
kind: VolumeAttachment
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
patch:
- op: remove
path: /spec/attachmentTickets/recovery
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
ignore_errors: yes
- name: "Phase 7 | Wait for recovery ticket to be gone"
kubernetes.core.k8s_info:
kind: VolumeAttachment
api_version: longhorn.io/v1beta2
namespace: "{{ longhorn_namespace }}"
name: "{{ item.pv_name }}"
register: va_state
until: >
(va_state.resources | default([]) | first | default({}) ).spec.attachmentTickets.recovery is not defined
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
# =========================================================================
# PHASE 8 — RESTORE PV AND PVC
# =========================================================================
- name: "Phase 8 | Create PersistentVolume (Retain, no claimRef)"
kubernetes.core.k8s:
state: present
definition:
apiVersion: v1
kind: PersistentVolume
metadata:
name: "{{ item.pv_name }}"
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
accessModes:
- "{{ item.access_mode }}"
capacity:
storage: "{{ item.size_human }}"
csi:
driver: driver.longhorn.io
fsType: ext4
volumeHandle: "{{ item.pv_name }}"
volumeAttributes:
dataEngine: v1
dataLocality: disabled
disableRevisionCounter: "true"
numberOfReplicas: "3"
staleReplicaTimeout: "30"
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn
volumeMode: Filesystem
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 8 | Wait for PV to be Available or Bound"
kubernetes.core.k8s_info:
kind: PersistentVolume
name: "{{ item.pv_name }}"
register: pv_state
until: >
(pv_state.resources | default([]) | first | default({}) ).status.phase | default('')
in ['Available', 'Bound']
retries: 12
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.pv_name }}"
- name: "Phase 8 | Check if PVC already bound to correct PV"
ansible.builtin.shell: >
kubectl get pvc {{ item.pvc_name }} -n {{ item.namespace }}
-o jsonpath='{.spec.volumeName}' 2>/dev/null || true
register: pvc_current_volume
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.pvc_name }}"
changed_when: false
- name: "Phase 8 | Create PersistentVolumeClaim pinned to PV"
kubernetes.core.k8s:
state: present
definition:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "{{ item.item.pvc_name }}"
namespace: "{{ item.item.namespace }}"
spec:
accessModes:
- "{{ item.item.access_mode }}"
resources:
requests:
storage: "{{ item.item.size_human }}"
storageClassName: longhorn
volumeMode: Filesystem
volumeName: "{{ item.item.pv_name }}"
delegate_to: localhost
loop: "{{ pvc_current_volume.results }}"
loop_control:
label: "{{ item.item.namespace }}/{{ item.item.pvc_name }}"
when: item.stdout != item.item.pv_name
- name: "Phase 8 | Wait for PVC to be Bound"
kubernetes.core.k8s_info:
kind: PersistentVolumeClaim
namespace: "{{ item.namespace }}"
name: "{{ item.pvc_name }}"
register: pvc_state
until: >
(pvc_state.resources | default([]) | first | default({}) ).status.phase | default('') == 'Bound'
retries: 12
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.pvc_name }}"
# =========================================================================
# PHASE 9 — SCALE UP AND VERIFY
# =========================================================================
- name: "Phase 9 | Scale up Deployments"
kubernetes.core.k8s_scale:
kind: Deployment
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 1
wait: yes
wait_timeout: 120
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'Deployment' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 9 | Scale up StatefulSets"
kubernetes.core.k8s_scale:
kind: StatefulSet
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
replicas: 1
wait: yes
wait_timeout: 120
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_kind == 'StatefulSet' and item.workload_name != ''
ignore_errors: yes
- name: "Phase 9 | Wait for workload to report ready replicas"
kubernetes.core.k8s_info:
kind: "{{ item.workload_kind }}"
name: "{{ item.workload_name }}"
namespace: "{{ item.namespace }}"
register: workload_state
until: >
(workload_state.resources | default([]) | first | default({}) ).status.readyReplicas | default(0) | int >= 1
retries: 24
delay: 5
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.workload_name != ''
ignore_errors: yes
- name: "Phase 9 | Run optional verification command in pod"
ansible.builtin.shell: >
kubectl exec -n {{ item.namespace }}
$(kubectl get pod -n {{ item.namespace }}
-l statefulset.kubernetes.io/pod-name={{ item.workload_name }}-0
--no-headers -o custom-columns=':metadata.name' 2>/dev/null ||
kubectl get pod -n {{ item.namespace }} {{ item.workload_name }}-0
--no-headers -o custom-columns=':metadata.name' 2>/dev/null)
-- sh -c '{{ item.verify_cmd }}'
delegate_to: localhost
loop: "{{ _volumes }}"
loop_control:
label: "{{ item.namespace }}/{{ item.workload_name }}"
when: item.verify_cmd | default('') != ''
register: verify_output
changed_when: false
ignore_errors: yes
- name: "Phase 9 | Show verification output"
ansible.builtin.debug:
msg: "{{ item.stdout_lines | default([]) }}"
loop: "{{ verify_output.results | default([]) }}"
loop_control:
label: "{{ item.item.pvc_name | default('') }}"
when: item.stdout_lines is defined and item.item.verify_cmd | default('') != ''
# =========================================================================
# RECOVERY SUMMARY
# =========================================================================
- name: "Summary | Recovery complete"
ansible.builtin.debug:
msg: |
╔══════════════════════════════════════════════════════╗
║ Longhorn Block-Device Recovery Complete ║
╚══════════════════════════════════════════════════════╝
Volumes recovered:
{% for v in _volumes %}
• {{ v.pvc_name }} ({{ v.namespace }}) ← {{ v.source_node }}:{{ v.source_dir }}
{% endfor %}
Backups retained at: {{ backup_base }}/<pvc-name>/
Merged images at: {{ merged_base }}/<pvc-name>.img
Next steps:
1. Verify application data through the app UI / API
2. Repeat for remaining volumes (update vars file)
3. Run a fresh k8s_pvc backup once all volumes are healthy

View File

@@ -0,0 +1,84 @@
---
# Example vars file for playbooks/recover/longhorn_data.yml
#
# Usage:
# ansible-playbook -i inventory/hosts.yml playbooks/recover/longhorn_data.yml \
# -e @playbooks/recover/longhorn_data_vars.example.yml
#
# HOW TO FILL THIS IN:
#
# 1. Find untouched replica dirs across all nodes:
# for node in pi1 pi2 pi3; do
# echo "=== $node ==="
# ssh $node "sudo du -sh /mnt/arcodange/longhorn/replicas/pvc-<VOLUME>-* 2>/dev/null"
# done
# Pick the dir with the largest size (>16K) and oldest timestamps (from before the incident).
#
# 2. Get pv_name and pvc_name from PV/PVC backup:
# cat /home/pi/arcodange/backups/k3s_pvc/backup_*.volumes | grep -A5 "kind: PersistentVolume"
#
# 3. Get size_bytes from Longhorn volume spec or from:
# cat /mnt/arcodange/longhorn/replicas/<source_dir>/volume.meta
#
# 4. source_node = the node where the untouched dir lives
# source_dir = the exact directory name (e.g. pvc-abc123-998f49ff)
#
# Fields:
# pv_name — Longhorn volume name, equals the PV name (pvc-<uuid>) [REQUIRED]
# pvc_name — PVC name in the namespace [REQUIRED]
# namespace — namespace where the PVC lives [REQUIRED]
# size_bytes — volume capacity in bytes as a string (from volume spec) [REQUIRED]
# size_human — human-readable size for PVC spec (e.g. 128Mi, 8Gi) [REQUIRED]
# access_mode — ReadWriteOnce or ReadWriteMany [REQUIRED]
# workload_kind — Deployment or StatefulSet [REQUIRED]
# workload_name — name of the workload to scale down/up [REQUIRED]
# source_node — node holding the untouched replica dir (pi1/pi2/pi3) [OPTIONAL — auto-discovered]
# source_dir — exact replica dir name on source_node [OPTIONAL — auto-discovered]
# verify_cmd — shell command to run inside pod to confirm data after restore [OPTIONAL]
#
# source_node and source_dir are auto-discovered by Phase 0 (largest dir >16K across all
# nodes). Override them manually only if you want to force a specific replica dir.
longhorn_recovery_volumes:
# --- url-shortener (example, already recovered 2026-04-14) ---
- pv_name: pvc-cdd434d1-c8b4-4a75-acde-2978ec9febd4
pvc_name: url-shortener-data
namespace: url-shortener
size_bytes: "134217728"
size_human: 128Mi
access_mode: ReadWriteOnce
workload_kind: Deployment
workload_name: url-shortener
source_node: pi3
source_dir: pvc-cdd434d1-c8b4-4a75-acde-2978ec9febd4-998f49ff
verify_cmd: "sqlite3 /data/urls.db 'SELECT COUNT(*) FROM urls;'"
# --- traefik (example, already recovered 2026-04-14) ---
# - pv_name: pvc-<traefik-uuid>
# pvc_name: traefik-data
# namespace: traefik
# size_bytes: "134217728"
# size_human: 128Mi
# access_mode: ReadWriteOnce
# workload_kind: Deployment
# workload_name: traefik
# source_node: pi3
# source_dir: pvc-<traefik-uuid>-<hex>
# verify_cmd: ""
# --- vault (uncomment and fill for recovery) ---
# - pv_name: pvc-<vault-uuid>
# pvc_name: vault-data
# namespace: vault
# size_bytes: "1073741824"
# size_human: 1Gi
# access_mode: ReadWriteOnce
# workload_kind: StatefulSet
# workload_name: vault
# source_node: pi2
# source_dir: pvc-<vault-uuid>-<hex>
# verify_cmd: ""
# Add more volumes here following the same pattern.
# Process one at a time first to validate, then batch.

View File

@@ -0,0 +1,17 @@
---
# Recovery vars for Clickhouse
# Source: pi3, dir pvc-1251909b-...-1163420b (2.6G — largest, snapshot verified non-zero)
# Generated: 2026-04-14
longhorn_recovery_volumes:
- pv_name: pvc-1251909b-3cef-40c6-881c-3bb6e929a596
pvc_name: clickhouse-storage-clickhouse-0
namespace: tools
size_bytes: "17179869184" # 16Gi
size_human: 16Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: clickhouse
source_node: pi3
source_dir: pvc-1251909b-3cef-40c6-881c-3bb6e929a596-1163420b
verify_cmd: "clickhouse-client --query 'SHOW DATABASES'"

View File

@@ -0,0 +1,38 @@
---
# Recovery vars for erp and hashicorp-vault volumes
# source_node/source_dir omitted — auto-discovered by Phase 0
longhorn_recovery_volumes:
- pv_name: pvc-7971918e-e47f-4739-a976-965ea2d770b4
pvc_name: erp
namespace: erp
size_bytes: "53687091200"
size_human: 50Gi
access_mode: ReadWriteMany
workload_kind: Deployment
workload_name: "" # intentionally blank — ERP needs Vault unsealed first; scale up manually
verify_cmd: ""
# hashicorp-vault StatefulSet has two PVCs (audit + data).
# workload_name is set only on the last entry so the StatefulSet is scaled up
# once after both volumes are ready, not between them.
- pv_name: pvc-6d2ea1c7-9327-4992-a02c-93ae604eda70
pvc_name: audit-hashicorp-vault-0
namespace: tools
size_bytes: "10737418240"
size_human: 10Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: ""
verify_cmd: ""
- pv_name: pvc-ca5567d3-a682-4cee-8ff1-2b8e23260635
pvc_name: data-hashicorp-vault-0
namespace: tools
size_bytes: "10737418240"
size_human: 10Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: hashicorp-vault
verify_cmd: ""

View File

@@ -0,0 +1,47 @@
---
# Recovery vars for remaining volumes (prometheus, alertmanager, redis, backups-rwx)
# source_node and source_dir intentionally omitted — auto-discovered by Phase 0
longhorn_recovery_volumes:
- pv_name: pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9
pvc_name: prometheus-server
namespace: tools
size_bytes: "8589934592"
size_human: 8Gi
access_mode: ReadWriteOnce
workload_kind: Deployment
workload_name: prometheus-server
source_node: pi2
source_dir: pvc-88e18c7f-2cfd-45e3-be5b-78c31ab829e9-910583f6
verify_cmd: ""
- pv_name: pvc-aed7f2c4-1948-487a-8d10-d8a1372289b4
pvc_name: storage-prometheus-alertmanager-0
namespace: tools
size_bytes: "2147483648"
size_human: 2Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: prometheus-alertmanager
verify_cmd: ""
- pv_name: pvc-d1d5482b-81c8-4d7c-a528-7a57ef47a5ce
pvc_name: redis-storage-redis-0
namespace: tools
size_bytes: "1073741824"
size_human: 1Gi
access_mode: ReadWriteOnce
workload_kind: StatefulSet
workload_name: redis
verify_cmd: "redis-cli ping"
- pv_name: pvc-efda1d2f-1db8-46dd-9a97-3d11f1807ffa
pvc_name: backups-rwx
namespace: longhorn-system
size_bytes: "53687091200"
size_human: 50Gi
access_mode: ReadWriteMany
workload_kind: Deployment
workload_name: ""
verify_cmd: ""

View File

@@ -5,7 +5,7 @@
gather_facts: false
vars:
pihole_ip: "192.168.1.201"
pihole_ips: "{{ groups['pihole'] | map('extract', hostvars) | map(attribute='preferred_ip') | list }}"
coredns_namespace: "kube-system"
tasks:
@@ -23,5 +23,38 @@
arcodange.lab:53 {
errors
cache 30
forward . {{ pihole_ip }}:53
forward . {{ pihole_ips | map('regex_replace', '^(.*)$', '\1:53') | join(' ') }}
}
- name: "Mettre à jour le ConfigMap CoreDNS principal pour utiliser les Pi-holes HA"
kubernetes.core.k8s:
state: present
definition:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: "{{ coredns_namespace }}"
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
ttl 60
reload 15s
fallthrough
}
prometheus :9153
cache 30
loop
reload
import /etc/coredns/custom/*.override
import /etc/coredns/custom/*.server
forward . {{ pihole_ips | map('regex_replace', '^(.*)$', '\1:53') | join(' ') }}
}

View File

@@ -11,3 +11,17 @@
name: "{{ inventory_hostname }}"
become: yes
when: inventory_hostname != ansible_hostname
- name: Ensure dnsmasq user is in dip group for Pi-hole DNS
ansible.builtin.user:
name: dnsmasq
groups: dip
append: yes
when: "'pihole' in group_names"
- name: Disable dnsmasq service on Pi-hole nodes to avoid port 53 conflict with pihole-FTL
ansible.builtin.systemd:
name: dnsmasq
state: stopped
enabled: no
when: "'pihole' in group_names"

View File

@@ -35,12 +35,16 @@
state: directory
mode: '0755'
- name: Check if daemon.json exists
ansible.builtin.stat:
path: /etc/docker/daemon.json
register: docker_config_stat
- name: Lire la configuration Docker existante
ansible.builtin.command: "cat /etc/docker/daemon.json"
register: docker_config_raw
ignore_errors: yes
changed_when: false
when: (ansible.builtin.stat.path='/etc/docker/daemon.json').stat.exists
when: docker_config_stat.stat.exists
- name: Initialiser la variable de config Docker
ansible.builtin.set_fact:
@@ -82,12 +86,7 @@
- name: Ensure docker_config is a dictionary
ansible.builtin.set_fact:
docker_config: >
{% if docker_config is mapping %}
{{ docker_config }}
{% else %}
{}
{% endif %}
docker_config: "{{ docker_config if docker_config is mapping else {} }}"
- name: Écrire la configuration mise à jour
ansible.builtin.copy:

View File

@@ -147,6 +147,13 @@
redisCacheDatabase: "0"
redisCacheUnreachableBlock: false
- name: Supprimer les pods crowdsec en état Error pour forcer leur redémarrage
ansible.builtin.shell: |
kubectl get pods -n tools -l k8s-app=crowdsec \
--field-selector=status.phase=Failed -o name | xargs -r kubectl delete -n tools
changed_when: false
ignore_errors: yes
- name: Redémarrer traefik pour prendre la nouvelle configuration du middleware
block:
# ---------------------

View File

@@ -36,6 +36,11 @@
# WARNING : this disables AND wipes ALL gitea_cicd_* per-app JWT roles
# (created by tools/hashicorp-vault/iac/) every time it runs. Default is OFF
# to preserve those roles across normal ansible runs ; opt-in only when you
# really want to rebuild the OIDC backend from scratch (e.g. config drift on
# bound_issuer or similar).
- name: Delete existing Gitea OIDC backends if they exist
include_tasks: vault_cmd.yml
vars:
@@ -48,6 +53,7 @@
- gitea_jwt
loop_control:
loop_var: backend_name
when: vault_oidc_force_reset | default(false) | bool
- name: use tofu to provision vault
block:
@@ -106,3 +112,23 @@
'OIDC_CLIENT_SECRET': gitea_app.secret,
}) | b64encode }}
gitea_owner_type: 'org' # value != 'user'
# Also propagate the same secret to user-owned namespaces. Gitea Action secrets
# are scoped per owner, so repos under a user account cannot read org-level
# secrets. Extend this list if other personal-namespace apps need vault auth.
- name: Propagate vault_oauth__sh_b64 to user-owned namespaces
include_role:
name: arcodange.factory.gitea_secret
vars:
gitea_secret_name: vault_oauth__sh_b64
gitea_secret_value: >-
{{ lookup('ansible.builtin.template', 'oidc_jwt_token.sh.j2', template_vars = {
'GITEA_BASE_URL': 'https://gitea.arcodange.lab',
'OIDC_CLIENT_ID': gitea_app.id,
'OIDC_CLIENT_SECRET': gitea_app.secret,
}) | b64encode }}
gitea_owner_type: 'user'
gitea_owner_name: '{{ item }}'
loop: '{{ gitea_secret_propagation_users }}'
loop_control:
label: '{{ item }}'

View File

@@ -1,4 +1,5 @@
{{- range $app_name, $app_attr := .Values.gitea_applications -}}
{{- $org := default "arcodange-org" $app_attr.org -}}
---
apiVersion: argoproj.io/v1alpha1
kind: Application
@@ -14,16 +15,20 @@ metadata:
spec:
project: default
source:
repoURL: https://gitea.arcodange.lab/arcodange-org/{{ $app_name }}
repoURL: https://gitea.arcodange.lab/{{ $org }}/{{ $app_name }}
targetRevision: HEAD
path: chart
destination:
server: https://kubernetes.default.svc
namespace: {{ $app_name }}
syncPolicy:
{{- if $app_attr.syncPolicy }}
{{- toYaml $app_attr.syncPolicy | nindent 4 }}
{{- else }}
automated:
prune: true
selfHeal: true
{{- end }}
syncOptions:
- CreateNamespace=true
{{ end }}

View File

@@ -6,16 +6,30 @@ gitea_applications:
annotations: {}
tools:
annotations: {}
syncPolicy:
automated:
prune: true
selfHeal: true
webapp:
annotations:
argocd-image-updater.argoproj.io/image-list: webapp=gitea.arcodange.lab/arcodange-org/webapp:latest
argocd-image-updater.argoproj.io/webapp.update-strategy: digest
telegram-gateway:
org: arcodange
annotations:
argocd-image-updater.argoproj.io/image-list: telegram-gateway=gitea.arcodange.lab/arcodange/telegram-gateway:latest
argocd-image-updater.argoproj.io/telegram-gateway.update-strategy: digest
erp:
annotations: {}
cms:
annotations:
argocd-image-updater.argoproj.io/image-list: cms=gitea.arcodange.lab/arcodange-org/cms:latest
argocd-image-updater.argoproj.io/cms.update-strategy: digest
dance-lessons-coach:
org: arcodange
annotations:
argocd-image-updater.argoproj.io/image-list: dance-lessons-coach=gitea.arcodange.lab/arcodange/dance-lessons-coach:latest
argocd-image-updater.argoproj.io/dance-lessons-coach.update-strategy: digest
argocd_image_updater_chart_values:
config:

35
doc/README.md Normal file
View File

@@ -0,0 +1,35 @@
[Factory](../README.md) > **Doc**
# Documentation Factory
> **Last Updated:** 2026-05-31
> **Status:** Production · maintenue activement
> **Related:** [README racine du dépôt](../README.md) · [Collection Ansible `arcodange.factory`](../ansible/arcodange/factory/README.md)
## C'est quoi ?
Le dossier `doc/` rassemble la documentation de la plateforme Arcodange (k3s + Gitea + ArgoCD + OpenTofu + Vault + Postgres, auto-hébergée sur 3 Raspberry Pi). On y trouve deux familles :
- les **ADR** (records de décisions d'architecture), qui expliquent *pourquoi* la plateforme est faite ainsi ;
- les **runbooks**, qui expliquent *comment* exécuter une procédure opérationnelle de bout en bout.
> [!NOTE]
> Convention du dépôt (cf. PR #8) : la documentation vit sous `doc/` (**singulier**). Un ancien dossier `docs/` (pluriel) a pu traîner non-suivi par git — ne pas y ajouter de contenu.
## Sections
| Section | Ce qu'on y trouve | Statut |
|---|---|---|
| [Runbooks](runbooks/README.md) | Procédures opérationnelles pas-à-pas (créer une app, etc.) | ✅ |
| [ADR](adr/README.md) | Décisions d'architecture + checklist de mise en place de la plateforme | ✅ |
## Légende de statut
✅ actif · 🟡 dégradé/beta · 🔴 critique/EOL · ⚠️ problème connu · ❌ désactivé
## Comment éditer cette documentation
1. **Ajouter une page** → la créer depuis le template adéquat **et** ajouter sa ligne dans la table d'index du `README.md` parent.
2. **Supprimer une page** → la marquer *Décommissionnée (date)* d'abord ; supprimer le fichier et sa ligne d'index ensemble une fois qu'elle est vraiment partie.
3. **Garder les liens croisés bidirectionnels** → quand on lie A→B, ajouter B→A.
4. **Mettre à jour `Last Updated:`** en tête de la racine concernée après tout changement de structure.

View File

@@ -0,0 +1,261 @@
[← ADRs](.) · [factory](../..) · **20260509 — telegram-gateway auth**
> **Cross-references** (bidirectionnel : chaque fichier listé doit citer cette ADR en tête)
>
> - **Code** (repo `arcodange/telegram-gateway`) :
> [`auth.go`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/auth.go) ·
> [`handler_auth.go`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/handler_auth.go) ·
> [`allowlist.go`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/allowlist.go) ·
> [`server.go`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/server.go) ·
> [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/chart/values.yaml)
> - **User docs** :
> [`AUTH.md`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/AUTH.md) ·
> [`HOWTO_ADD_BOT.md`](https://gitea.arcodange.lab/arcodange/telegram-gateway/src/branch/main/HOWTO_ADD_BOT.md)
> - **Related ADR** :
> [`20260407-network-architecture.md`](20260407-network-architecture.md) (Cloudflare / Traefik / CrowdSec stack)
> - **Implementation plan** : `~/.claude/plans/pour-les-notifications-on-inherited-seal.md` § Phase 1.5
# ADR 20260509: Telegram Gateway — Authentication Layer
## Status
Proposed
## Context
Le service `telegram-gateway` (Phase 1, livré le 2026-05-09) expose des bots Telegram via webhooks publics sur `tg.arcodange.fr/bot/<slug>`. À ce stade :
- Tout utilisateur Telegram qui connaît le handle d'un bot peut le DM et déclencher son handler.
- Le gateway valide le `secret_token` Telegram (qui prouve que **Telegram** envoie le webhook), pas l'identité du **user** derrière le message.
- Avant d'ouvrir le gateway à d'autres bots utiles (commandes `/build`, scripts Ollama, etc.), il faut un protocole d'authentification.
Le besoin métier :
- Un **bot principal** (`@arcodange_factory_bot`, slug interne `factory`) sert de point d'auth.
- Une commande **`/auth <code>`** valide une session pour l'utilisateur Telegram qui l'envoie.
- Les autres bots du gateway ne répondent qu'aux **utilisateurs déjà authentifiés**, **par défaut** (secure-by-default).
- En garde-fou supplémentaire, une **allowlist d'IDs Telegram** peut filtrer les utilisateurs autorisés à parler aux bots, indépendamment de l'auth (silent-drop avant tout traitement).
## Decision
### 1. Identité utilisateur
Telegram n'expose **pas l'IP** de l'utilisateur côté bot. La clé stable est **`from.id`** (Telegram user ID, `int64`, identique pour un même compte sur tous les devices). On l'utilise comme identifiant de session.
> Hors scope : auth liée au device/IP — nécessiterait un canal d'auth séparé (web UI sur LAN, etc.).
### 2. Stockage de session
- **Redis** (`redis.tools.svc.cluster.local:6379`, déjà déployé dans le namespace `tools`).
- Clé : `tg-gw:auth:<from.id>` → valeur `1` (ou JSON metadata si on enrichit plus tard).
- TTL : **24 h par défaut**, configurable via env `AUTH_SESSION_TTL` (Go duration : `12h`, `7d`, etc.).
- Refresh : chaque `/auth` réussi remet le TTL à zéro.
### 3. Bot principal & commandes
Le bot `factory` passe du handler `echo` au handler `auth`. Le handler `auth` reconnaît :
| Commande | Effet |
|---|---|
| `/start` | Message d'accueil + liste des commandes disponibles |
| `/auth <code>` | Compare `<code>` à `AUTH_SECRET` en constant-time ; si OK → SET Redis, deleteMessage du message original (replay defense), reply "✅ Authentifié pour 24 h" |
| `/whoami` | Affiche le user_id et le TTL restant (ou "non authentifié") |
| `/logout` | DEL Redis, reply "Déconnecté" |
| _autre_ | Rappel des commandes |
### 4. Garde-fou allowlist
Env `ALLOWED_USERS` : CSV de `from.id` Telegram (`12345,67890`). Comportement :
- Vide ou absent → ouvert à tous (rétro-compat Phase 1).
- Set → tout `from.id` hors-liste fait l'objet d'un **silent-drop** (HTTP 200 vide vers Telegram, log INFO côté gateway, **pas de réponse au user**).
- Le silent-drop intervient **avant** la gate auth. Permet de masquer l'existence des bots à des inconnus.
### 5. Gate `requireAuth` par bot — secure-by-default
Champ booléen dans `chart/values.yaml`, par bot. Sémantique :
- **Default = `true`** (secure-by-default). Tout bot omet ce champ → gated.
- Pour rendre un bot public, ajout explicite `requireAuth: false`.
- Pour `handler: auth` (le bot principal), `requireAuth` est **forcé à `false`** automatiquement (chicken-and-egg : si l'auth elle-même est gated, personne ne peut s'authentifier).
```yaml
bots:
factory:
handler: auth # requireAuth auto-forcé à false
pingbot:
handler: echo # requireAuth: true (implicite, défaut)
statusbot:
handler: echo
requireAuth: false # opt-out explicite, bot public
```
Lorsque `requireAuth: true` et que le user n'est pas authentifié :
> 🔒 Authentifie-toi d'abord avec `/auth <code>` chez @arcodange_factory_bot
… puis ack 200 à Telegram. Le handler du bot n'est **pas** appelé.
### 6. Fail-at-startup
Si `AUTH_SECRET` est vide ET au moins un bot a `handler=auth` ou `requireAuth: true` (y compris par défaut) → le pod **échoue au boot** avec un message clair. Évite le scénario "auth silencieusement off, bots accessibles à tous sans le savoir". Avec un défaut `requireAuth: true`, en pratique tout déploiement exige `AUTH_SECRET` (sauf si tous les bots font opt-out explicite).
## Architecture Diagrams
### 1. Flow `/auth` (login)
```mermaid
%%{init: {'theme':'neutral'}}%%
sequenceDiagram
participant U as Utilisateur
participant TG as Telegram
participant GW as telegram-gateway
participant R as Redis (tools)
U->>TG: /auth s3cr3t (DM @arcodange_factory_bot)
TG->>GW: POST /bot/factory<br/>X-Telegram-Bot-Api-Secret-Token: …
GW->>GW: verify secret_token (Telegram→GW)
GW->>GW: check ALLOWED_USERS (si configuré)
GW->>GW: factory.handler = auth, parse "/auth s3cr3t"
GW->>GW: subtle.ConstantTimeCompare(s3cr3t, AUTH_SECRET)
alt Code valide
GW->>R: SET tg-gw:auth:<from.id> EX 24h
GW->>TG: deleteMessage (replay defense)
GW->>TG: sendMessage "✅ Authentifié pour 24h"
GW->>TG: 200 OK (ack webhook)
TG->>U: "✅ Authentifié pour 24h"
else Code invalide
GW->>TG: sendMessage "❌ Mauvais code"
GW->>TG: 200 OK
TG->>U: "❌ Mauvais code"
end
```
### 2. Accès à un bot gated (`requireAuth: true`, défaut)
```mermaid
%%{init: {'theme':'neutral'}}%%
sequenceDiagram
participant U as Utilisateur
participant TG as Telegram
participant GW as telegram-gateway
participant R as Redis
participant H as Bot handler (echo / http / shell…)
U->>TG: ping (DM @autre_bot)
TG->>GW: POST /bot/autre_bot
GW->>GW: verify secret_token + parse Update
GW->>GW: ALLOWED_USERS check
GW->>R: EXISTS tg-gw:auth:<from.id>
alt Authentifié
R-->>GW: 1
GW->>H: Handler.Handle(update, bot)
H->>TG: sendMessage (réponse métier)
GW->>TG: 200 OK
else Non authentifié
R-->>GW: 0
GW->>TG: sendMessage "🔒 /auth chez @arcodange_factory_bot"
GW->>TG: 200 OK
end
```
### 3. Décision globale à l'arrivée d'un webhook
```mermaid
%%{init: {'theme':'neutral'}}%%
graph TD
%% classDef avec contraste explicite : fond clair → texte sombre
classDef ok fill:#d4edda,stroke:#28a745,color:#155724;
classDef block fill:#f8d7da,stroke:#dc3545,color:#721c24;
classDef neutral fill:#e2e3e5,stroke:#6c757d,color:#383d41;
Start[Webhook POST /bot/&lt;slug&gt;]:::neutral
SecretCheck{secret_token<br/>match ?}:::neutral
AllowlistCheck{from.id ∈<br/>ALLOWED_USERS ?}:::neutral
HandlerKind{handler == auth ?}:::neutral
AuthGate{requireAuth ?<br/>+ session valide ?}:::neutral
Reject401[401 Unauthorized]:::block
SilentDrop[200 vide<br/>silent drop]:::block
Forbidden[reply &quot;🔒 /auth …&quot;<br/>200 OK]:::block
AuthHandler[handler auth<br/>/auth /whoami /logout]:::ok
BotHandler[Bot handler<br/>echo / http / shell]:::ok
Start --> SecretCheck
SecretCheck -- non --> Reject401
SecretCheck -- oui --> AllowlistCheck
AllowlistCheck -- non --> SilentDrop
AllowlistCheck -- oui --> HandlerKind
HandlerKind -- oui --> AuthHandler
HandlerKind -- non --> AuthGate
AuthGate -- pas autorisé --> Forbidden
AuthGate -- OK --> BotHandler
```
## Consequences
### Positive
- **Confidentialité** : les bots métier ne répondent qu'aux comptes Telegram authentifiés, **par défaut**.
- **Défense en profondeur** : `ALLOWED_USERS` (allowlist), `secret_token` (Telegram→GW), `AUTH_SECRET` (user→bot), TTL session.
- **UX simple** : un `/auth <code>` ponctuel, valide 24 h.
- **Pas de migration** côté Phase 2/3 : la gate s'insère cleanly avant l'enqueue ou le forward.
- **Replay defense** : le message contenant le code est supprimé du chat après login réussi.
- **Secure-by-default** : un nouveau bot ajouté au gateway exige une session sans rien à configurer.
### Negative
- **Code partagé** : `AUTH_SECRET` global (pas TOTP/per-user). Si compromis → rotation manuelle (changer Secret + redeploy).
- **Pas de rate-limit** sur `/auth` : un utilisateur dans `ALLOWED_USERS` peut bruteforce le code en pratique. Mitigation : `ALLOWED_USERS` agit en floor, et 128+ bits de code rendent le bruteforce inutile dans la fenêtre de TTL.
- **Dépendance Redis** : si Redis tombe, plus aucun user n'est considéré authentifié → tous les bots gated répondent "🔒". Acceptable (fail-closed) ; Phase 1 a déjà restauré Redis cleanly.
- **Pas de session multi-device explicite** : `from.id` est le même sur tous les devices d'un compte → l'auth couvre déjà tous les devices, ce qui est le comportement attendu.
## Alternatives Considered
### Alternative 1 : auth par IP
**Rejetée**. Telegram n'expose pas l'IP du user au bot. Aurait nécessité un canal d'auth secondaire (web UI sur LAN, page d'accueil arcodange.fr) et un binding device. Coût significatif pour un bénéfice ambigu.
### Alternative 2 : TOTP / OTP rotatif
**Rejetée à ce stade**. Plus sécurisé que le code partagé mais ajoute :
- Une étape d'enrôlement (afficher un QR code, scanner avec une app).
- Une horloge synchronisée côté gateway et côté user.
- De la complexité utilisateur (sortir l'app à chaque /auth).
À reconsidérer si le code partagé fuit régulièrement ou si on ouvre à plus d'utilisateurs.
### Alternative 3 : Postgres au lieu de Redis pour les sessions
**Rejetée**. Postgres serait nécessaire pour Phase 2 (queue durable), mais pour des sessions à TTL court, Redis est l'outil idiomatique :
- Latence sub-ms.
- TTL natif (`SET … EX 86400`).
- Déjà déployé et utilisé (CrowdSec bouncer).
### Alternative 4 : pas de session, vérification du code à chaque message
**Rejetée**. UX terrible (devoir re-taper le code à chaque DM) et n'apporte rien (le code en clair traîne plus longtemps en chat).
### Alternative 5 : `requireAuth: false` par défaut (insecure-by-default)
**Rejetée** (initialement retenue, puis renversée). Avoir `requireAuth: false` par défaut signifie qu'un bot ajouté sans précaution est accessible à tous. Avec un gateway pensé "private by design", le défaut sécurisé `true` cadre bien mieux.
## Plan d'implémentation
Voir `~/.claude/plans/pour-les-notifications-on-inherited-seal.md` § Phase 1.5.
Résumé des fichiers touchés :
- **Nouveaux** (repo `arcodange/telegram-gateway`) : `auth.go`, `handler_auth.go`, `allowlist.go`, `AUTH.md`
- **Modifiés** : `telegram_types.go`, `telegram.go`, `handlers.go`, `config.go`, `server.go`, `main.go`, `go.mod`, `chart/values.yaml`, `chart/templates/deployment.yaml`, `HOWTO_ADD_BOT.md`
- **Cluster** : `kubectl patch secret telegram-gateway-bots` pour ajouter `AUTH_SECRET` et (optionnel) `ALLOWED_USERS`
## Success Metrics
- `/auth <wrong>` → 100 % refus, 0 SET Redis.
- `/auth <right>` → 100 % succès, deleteMessage best-effort exécuté.
- Bot avec `requireAuth: true` (défaut) répond le message "🔒 …" à 100 % des users non authentifiés.
- Session expire effectivement après TTL (vérif via `kubectl exec redis-0 -- redis-cli TTL …`).
- Aucun secret (code, token bot) dans les logs.
- Latence ajoutée par la gate < 5 ms (Redis EXISTS local).

View File

@@ -18,9 +18,9 @@
- [ ] terrakube
- [ ] prometheus/grafana
- [ ] ansible AWX
- [ ] setup hello world web app
- [ ] manage postgres credentials
- [ ] protect public endpoint (crowdsec)
- [ ] setup hello world web app — 📖 procédure complète : [runbook « Nouvelle application web »](../runbooks/new-web-app/README.md)
- [ ] manage postgres credentials → [base de données](../runbooks/new-web-app/02-database.md) + [Vault plateforme](../runbooks/new-web-app/03-vault-platform.md)
- [ ] protect public endpoint (crowdsec) → [chart : ingress public](../runbooks/new-web-app/04-helm-chart.md)
> [!NOTE]
> Reference: [Arcodange _**Factory**_ Ansible Collection](/ansible/arcodange/factory/README.md)

21
doc/runbooks/README.md Normal file
View File

@@ -0,0 +1,21 @@
[Factory](../../README.md) > [Doc](../README.md) > **Runbooks**
# Runbooks
> **Scope :** procédures opérationnelles pas-à-pas de la plateforme Arcodange. Chaque runbook se lit du début à la fin et mène à un résultat vérifiable. Pour le *pourquoi* (décisions d'architecture), voir les [ADR](../adr/README.md).
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef node fill:#059669,stroke:#047857,color:#fff
OP["Opérateur"]:::node --> RB["Runbook<br>(procédure ordonnée)"]:::node --> RES["Résultat vérifiable<br>(app en ligne, etc.)"]:::node
```
## Index
| Runbook | Résumé | Statut |
|---|---|---|
| [Nouvelle application web](new-web-app/README.md) | Créer une app web de zéro dans un nouveau dépôt Gitea : dépôt, base de données, Vault, chart Helm, Terraform, CI, ArgoCD | ✅ |
> [!TIP]
> Pour ajouter un runbook : créer un dossier `kebab-case/` avec son `README.md` (front door : intro + diagramme + index), puis ajouter sa ligne ci-dessus.

View File

@@ -0,0 +1,88 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **1. Dépôt Gitea**
# 1. Créer le dépôt Gitea
> **Status:** ✅ Active
> **Downstream:** [2. Base de données](02-database.md), [3. Vault plateforme](03-vault-platform.md)
> **Related:** [Conventions de nommage](conventions.md) · [6. Workflows CI](06-ci-workflows.md)
---
## Summary
Tout part d'un dépôt Gitea nommé exactement `<app>` (voir [conventions](conventions.md)). Le créer **sous l'organisation `arcodange-org`** n'est pas un détail : c'est ce qui lui fait **hériter automatiquement des secrets Actions d'organisation** dont tous les workflows dépendent. Le dépôt contient un squelette fixe que les étapes suivantes viennent remplir.
## Pourquoi sous `arcodange-org`
Les workflows `.gitea/workflows/*` (voir [étape 6](06-ci-workflows.md)) référencent des secrets qui ne sont **pas** définis dans le dépôt mais au niveau de l'organisation et hérités par tous ses dépôts :
| Secret d'organisation | Usage |
|---|---|
| `HOMELAB_CA_CERT` | CA interne (base64) pour parler en TLS à `vault.arcodange.lab` |
| `vault_oauth__sh_b64` | Script (base64) qui réalise l'échange OIDC Gitea → JWT Vault |
| `PACKAGES_TOKEN` | Token de push vers le registre d'images `gitea.arcodange.lab` |
Ces secrets sont propagés par le rôle Ansible [`gitea_secret`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/roles/gitea_secret/defaults/main.yml) (`gitea_owner_type: organization`).
> [!IMPORTANT]
> Un dépôt créé **hors** `arcodange-org` (par ex. sous l'org `arcodange`) n'hérite pas forcément de ces secrets. Si tu surcharges l'org (cf. `org:` à l'[étape 7](07-argocd-register.md)), assure-toi que les mêmes secrets y existent.
## Options pour créer le dépôt
| Méthode | Quand | Comment |
|---|---|---|
| UI Gitea | one-shot manuel | `https://gitea.arcodange.lab` → New Repository sous `arcodange-org` |
| MCP Gitea | depuis un agent | outil `mcp__gitea__create_repo` (cf. règle « Gitea = MCP, pas `gh` » du guide global) |
| Rôle Ansible `gitea_repo` | reproductible/inventaire | [`roles/gitea_repo`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/roles/gitea_repo/defaults/main.yml) |
| Ressource Terraform `gitea_repository` | tout-en-IaC | dans [`factory/iac`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/iac) (provider `go-gitea/gitea` déjà configuré) |
## Squelette minimal du dépôt
```
<app>/
├── chart/ # chart Helm — ArgoCD déploie CE dossier (path: chart) → étape 4
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/
├── iac/ # Terraform/OpenTofu de l'app → étape 5
│ ├── providers.tf
│ ├── backend.tf
│ └── main.tf
├── .gitea/workflows/ # CI (tofu apply + build image) → étape 6
│ ├── vault.yaml
│ └── dockerimage.yaml # uniquement si l'app build sa propre image
├── Dockerfile # uniquement si l'app build sa propre image
├── README.md
└── .gitignore
```
> [!IMPORTANT]
> Le dossier du chart **doit** s'appeler `chart/` et être à la racine : le template ArgoCD impose `path: chart` (cf. [étape 7](07-argocd-register.md)). Pas de `helm/`, pas de sous-dossier.
## `.gitignore` recommandé
Aligné sur les dépôts existants (exclut tout secret local) :
```gitignore
.terraform
.terraform.*
.env
*.key
secrets/
.DS_Store
```
## Le bot `tofu_module_reader`
La CI de l'app clone le module Terraform partagé `tools` **en SSH** (cf. [étape 5](05-app-terraform.md)). C'est l'utilisateur restreint `tofu_module_reader` (créé dans [`factory/iac/gitea_tofu_ci_user.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/iac/gitea_tofu_ci_user.tf), clé privée dans Vault `kvv1/gitea/tofu_module_reader`) qui sert d'identité de lecture. Rien à faire de spécial, mais le dépôt `tools` doit rester lisible par ce bot.
## Notes / contraintes
- Le nom du dépôt = `<app>` **exactement** (kebab-case minuscule). Voir [conventions](conventions.md) — ce nom se propage partout.
- Pas besoin de protéger `main` autrement que par la convention worktree/PR de l'équipe ; ArgoCD suit `HEAD`.
## Related
- [2. Base de données](02-database.md) — provisionner la base, en parallèle de l'étape 3.
- [3. Vault plateforme](03-vault-platform.md) — déclarer l'app côté Vault, en parallèle de l'étape 2.
- [6. Workflows CI](06-ci-workflows.md) — consomme les secrets d'org hérités ici.

View File

@@ -0,0 +1,93 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **2. Base de données**
# 2. Provisionner la base de données
> **Status:** ✅ Active
> **Upstream:** [1. Dépôt Gitea](01-gitea-repo.md)
> **Downstream:** [5. Terraform de l'app](05-app-terraform.md)
> **Related:** [3. Vault plateforme](03-vault-platform.md) · [4. Chart Helm](04-helm-chart.md) · [Conventions de nommage](conventions.md)
---
## Summary
La base de l'app et son **rôle propriétaire** `<app>_role` sont créés par un Terraform **côté plateforme** (`factory/postgres/iac`), pas dans le dépôt de l'app. On ajoute simplement le nom de l'app à une liste, et la CI de `factory` applique. L'app, elle, ne se connectera **jamais** avec un mot de passe statique : elle obtiendra des identifiants éphémères de Vault (cf. [étape 4](04-helm-chart.md)) qui héritent de `<app>_role`.
## Action
Ajouter `"<app>"` au set `applications` de [`factory/postgres/iac/terraform.tfvars`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/postgres/iac/terraform.tfvars) :
```hcl
applications = [
"webapp",
"erp",
"crowdsec",
"plausible",
"dance-lessons-coach",
"<app>", # ← ajouter
]
```
Puis pousser : la CI applique automatiquement (voir plus bas).
## Ce que ça crée
[`postgres/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/postgres/iac/main.tf) itère `for_each` sur le set et crée, **par app** :
| Ressource | Nom | Rôle |
|---|---|---|
| `postgresql_role` | `<app>_role` | Rôle **non-login**, propriétaire de la base |
| `postgresql_grant_role` | `<app>_role``credentials_editor` (WITH ADMIN OPTION) | Laisse Vault rattacher les users dynamiques à ce rôle |
| `postgresql_database` | `<app>` | La base (owner `<app>_role`, `template0`, `alter_object_ownership`) |
| `postgresql_function` | `user_lookup()` (dans la base `<app>`) | Authentification pgbouncer (lit `pg_shadow`) |
| `postgresql_grant` | EXECUTE sur `user_lookup``pgbouncer_auth` | Autorise pgbouncer à résoudre les users |
Extrait clé :
```hcl
resource "postgresql_role" "app_role" {
for_each = var.applications
name = "${each.value}_role"
login = false # non-login : ne sert que de "porteur de droits"
}
resource "postgresql_database" "app_db" {
for_each = var.applications
name = each.value
owner = postgresql_role.app_role[each.value].name
template = "template0"
alter_object_ownership = true
}
```
## Comment c'est appliqué
Le workflow [`factory/.gitea/workflows/postgres.yaml`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/.gitea/workflows/postgres.yaml) se déclenche sur tout changement de `postgres/**/*.tf` ou `*.tfvars` :
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef ci fill:#059669,stroke:#047857,color:#fff
classDef db fill:#2563eb,stroke:#1e40af,color:#fff
PUSH["push tfvars"]:::ci --> JWT["OIDC Gitea → JWT Vault<br>(role gitea_cicd)"]:::ci
JWT --> READ["lit kvv1/postgres/credentials<br>→ TF_VAR_postgres_*"]:::ci
READ --> APPLY["tofu apply postgres/iac"]:::ci
APPLY --> DB["base app + app_role"]:::db
```
Le provider PostgreSQL pointe l'hôte `192.168.1.202` (`sslmode=disable`, `superuser=true`) et s'authentifie avec le compte `credentials_editor`, dont les identifiants sont dans Vault à `kvv1/postgres/credentials_editor/credentials`.
## Le modèle de connexion (à retenir)
> [!IMPORTANT]
> L'application **ne se connecte pas** directement à Postgres avec un user fixe. Elle vise **`pgbouncer.tools:5432`** et utilise des **users dynamiques courts** émis par Vault, qui héritent de `<app>_role` (donc des droits sur la base `<app>`). C'est l'étape 4 (chart + VSO) et l'étape 5 (rôle Vault `creds/<app>`) qui câblent ça. Ici, on ne fait qu'établir *la base et le rôle propriétaire*.
## Notes / contraintes
- `credentials_editor` est un compte unique partagé par toutes les apps, à fort privilège (il peut créer des rôles). Il sert aussi de compte de connexion au moteur Postgres de Vault (cf. [étape 3](03-vault-platform.md)).
- La fonction `user_lookup()` est indispensable au mode `auth_query` de pgbouncer ; elle est `security_definer` et n'est exécutable que par `pgbouncer_auth`.
## Related
- [3. Vault plateforme](03-vault-platform.md) — la connexion Vault→Postgres réutilise `credentials_editor` ; à faire en parallèle.
- [5. Terraform de l'app](05-app-terraform.md) — le module `app_roles` fait `GRANT <app>_role TO …` : il **exige** que `<app>_role` existe déjà (créé ici).
- [4. Chart Helm](04-helm-chart.md) — où la connexion `pgbouncer.tools` + creds dynamiques est configurée.

View File

@@ -0,0 +1,89 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **3. Vault plateforme**
# 3. Déclarer l'app côté Vault plateforme
> **Status:** ✅ Active
> **Upstream:** [1. Dépôt Gitea](01-gitea-repo.md)
> **Downstream:** [5. Terraform de l'app](05-app-terraform.md), [6. Workflows CI](06-ci-workflows.md)
> **Related:** [2. Base de données](02-database.md) · [4. Chart Helm](04-helm-chart.md) · [Conventions de nommage](conventions.md)
---
## Summary
Avant que la CI de l'app puisse gérer ses propres secrets, Vault doit connaître l'app : il lui faut un **rôle JWT de CI** (`gitea_cicd_<app>`) pour que la pipeline s'authentifie, une **policy CI** (`<app>-ops`) qui l'autorise à créer ses rôles Postgres/K8s, et une **policy runtime** (`<app>`) que le pod utilisera. Tout ça est généré par un module, depuis une seule ligne ajoutée côté plateforme dans le dépôt `tools`.
## Action
Ajouter une entrée pour l'app au set `applications` de [`tools/hashicorp-vault/iac/terraform.tfvars`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/terraform.tfvars) :
```hcl
applications = [
{ name = "webapp" },
{ name = "erp" },
{ name = "<app>" }, # ← ajouter
# options possibles :
# {
# name = "<app>"
# ops_policies = ["factory__cf_r2_arcodange_tf"] # policies ops supplémentaires (ex. token Cloudflare)
# service_account_names = ["cloudflared"] # SA additionnels autorisés à prendre la policy runtime
# service_account_namespaces = ["tools"] # namespaces additionnels
# },
]
```
Puis pousser : la CI [`tools/.gitea/workflows/vault.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/.gitea/workflows/vault.yaml) applique le Terraform.
## Ce que ça crée
Le module [`app_policy`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_policy/main.tf) (appelé en `for_each` depuis [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf)) crée, **par app** :
| Ressource Vault | Nom | À quoi ça sert |
|---|---|---|
| `vault_jwt_auth_backend_role` | `gitea_cicd_<app>` | **Identité de la CI** : la pipeline de l'app s'y authentifie (mount `gitea_jwt`, `user_claim=email`, `bound_audiences=[gitea_app_id]`) |
| `vault_policy` (ops) | `<app>-ops` | Droits CI : créer `postgres/roles/<app>*`, `auth/kubernetes/role/<app>*`, éditer `kvv2/<app>/*`, lire les secrets de bootstrap google/gitea |
| `vault_identity_group` | `<app>-ops` | Groupe Vault rattachant les comptes Gitea à la policy ops |
| `vault_policy` (runtime) | `<app>` | Droits **du pod** : lire `kvv2/data/<app>/*` et `postgres/creds/<app>*` |
Extraits clés :
```hcl
resource "vault_jwt_auth_backend_role" "gitea_jwt_cicd" {
backend = data.vault_auth_backend.gitea_jwt.path # "gitea_jwt"
role_name = "gitea_cicd_${local.name}"
token_policies = concat(["default"], var.ops_policies)
bound_audiences = [var.gitea_app_id]
user_claim = "email"
role_type = "jwt"
}
resource "vault_policy" "app" { # policy runtime du pod
name = local.name # = "<app>"
policy = data.vault_policy_document.app.hcl # read kvv2/data/<app>/* + postgres/creds/<app>*
}
```
## Prérequis plateforme (déjà là)
Ces fondations vivent dans [`tools/hashicorp-vault/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) et n'ont **pas** à être recréées :
- mounts `kvv2` (KV v2), `postgres` (moteur de bases de données), `transit` (cache VSO), auth `kubernetes` ;
- la **connexion Vault→Postgres** (`vault_database_secret_backend_connection`) qui se connecte à `pgbouncer.tools:5432/postgres` avec le compte `credentials_editor` (issu de l'[étape 2](02-database.md)) ;
- `var.gitea_app_id` = l'id de l'application OAuth2 Gitea, réglé une fois au setup ([`gitea_oidc_auth.yml`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/gitea_oidc_auth.yml)).
## Pourquoi cette étape vient avant la CI de l'app
> [!IMPORTANT]
> C'est ici que `gitea_cicd_<app>` **naît**. Le `iac/providers.tf` de l'app ([étape 5](05-app-terraform.md)) et le step `vault-action` du workflow ([étape 6](06-ci-workflows.md)) s'authentifient avec ce rôle. S'il n'existe pas encore, la toute première exécution de la CI de l'app échoue à l'authentification Vault. **Appliquer cette étape (et l'[étape 2](02-database.md)) avant de pousser le `iac/` de l'app.**
## Notes / contraintes
- Découpage des privilèges : la policy **`<app>-ops`** (CI, large) est distincte de la policy **`<app>`** (runtime, en lecture seule sur ses propres secrets). Le pod ne peut jamais créer de rôles.
- `ops_policies` permet d'octroyer à la CI des droits transverses (ex. `cms` lit un token Cloudflare R2 via `factory__cf_r2_arcodange_tf`).
## Related
- [2. Base de données](02-database.md) — fournit `credentials_editor`, réutilisé par la connexion Vault→Postgres.
- [5. Terraform de l'app](05-app-terraform.md) — s'authentifie avec `gitea_cicd_<app>` et crée `creds/<app>` + le rôle K8s `<app>`.
- [6. Workflows CI](06-ci-workflows.md) — le step `vault-action` et `tofu apply` utilisent `gitea_cicd_<app>`.
- [4. Chart Helm](04-helm-chart.md) — le pod utilise la policy runtime `<app>` via son ServiceAccount.

View File

@@ -0,0 +1,183 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **4. Chart Helm**
# 4. Le chart Helm de l'application
> **Status:** ✅ Active
> **Upstream:** [5. Terraform de l'app](05-app-terraform.md) (crée les rôles Vault que le chart consomme)
> **Downstream:** [7. Enregistrement ArgoCD](07-argocd-register.md) (déploie ce chart)
> **Related:** [2. Base de données](02-database.md) · [3. Vault plateforme](03-vault-platform.md) · [6. Workflows CI](06-ci-workflows.md) · [Conventions de nommage](conventions.md)
---
## Summary
Le dossier `chart/` est l'**unité de déploiement** : c'est lui qu'ArgoCD applique (`path: chart`). Il décrit le Deployment, le Service, l'Ingress, et surtout le câblage des secrets via le **Vault Secrets Operator (VSO)** : pas un mot de passe en clair, des identifiants Postgres **dynamiques** injectés à l'exécution. On part d'un `helm create <app>` puis on ajoute les CRD Vault, la ConfigMap, et on ajuste l'ingress.
## Structure du chart
```
chart/
├── Chart.yaml # name: <app>, version, appVersion (= tag image par défaut)
├── values.yaml # image, service, ingress, serviceAccount, autoscaling…
└── templates/
├── deployment.yaml # consomme vso-db-credentials + secretkv
├── service.yaml # ClusterIP
├── ingress.yaml # Traefik (voir patterns ci-dessous)
├── serviceaccount.yaml # SA <app> (serviceAccount.create: true)
├── config.yaml # ConfigMap : env non-secrets (host DB = pgbouncer)
├── vaultauth.yaml # VaultAuth : SA <app> ↔ rôle Vault <app>
├── vaultdynamicsecret.yaml # creds Postgres dynamiques (postgres/creds/<app>)
├── vaultsecret.yaml # config statique (kvv2/<app>/config)
├── hpa.yaml # désactivé par défaut
├── _helpers.tpl # name/fullname/labels
└── NOTES.txt
```
> [!TIP]
> Bootstrap : `helm create <app>` génère deployment/service/ingress/serviceaccount/hpa/_helpers/NOTES. Il reste à **ajouter** `config.yaml` (ConfigMap), les 3 CRD Vault (`vaultauth`, `vaultdynamicsecret`, `vaultsecret`), et à **ajuster** `values.yaml` (image, ingress) + `deployment.yaml` (envFrom). Copier ceux d'[`erp/chart`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart) ou [`webapp/chart`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart) est le plus rapide.
## Connexion à la base : via pgbouncer, jamais en direct
La ConfigMap pointe **`pgbouncer.tools`** (le pooler dans le namespace `tools`), port 5432, base `<app>`. Pas de host Postgres en dur, pas d'utilisateur statique.
```yaml
# chart/templates/config.yaml — exemple erp
data:
DOLI_DB_TYPE: pgsql
DOLI_DB_HOST: pgbouncer.tools # ← le pooler, pas postgres directement
DOLI_DB_HOST_PORT: !!str 5432
DOLI_DB_NAME: erp # = <app>
```
```yaml
# chart/templates/config.yaml — exemple webapp (chaîne de connexion)
data:
DATABASE_URL: postgres://pgbouncer_auth:pgbouncer_auth@pgbouncer.tools/postgres?sslmode=disable
```
Le **vrai** user/mot de passe vient du Secret K8s `vso-db-credentials` (voir ci-dessous), pas de la ConfigMap.
## Les secrets via VSO (le cœur)
Trois CRD du Vault Secrets Operator (extraits du chart `erp`) :
```yaml
# vaultauth.yaml — authentifie le pod auprès de Vault
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultAuth
spec:
method: kubernetes
mount: kubernetes
kubernetes:
role: erp # = rôle K8s Vault <app> (créé à l'étape 5)
serviceAccount: {{ include "erp.serviceAccountName" . }}
audiences: [vault]
```
```yaml
# vaultdynamicsecret.yaml — identifiants Postgres dynamiques
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultDynamicSecret
spec:
mount: postgres
path: creds/erp # = postgres/creds/<app>
destination:
create: true
name: vso-db-credentials # Secret K8s consommé par le Deployment
rolloutRestartTargets:
- kind: Deployment
name: {{ include "erp.fullname" . }} # redémarre le pod à chaque rotation
vaultAuthRef: auth
```
```yaml
# vaultsecret.yaml — config statique (mots de passe admin initiaux, etc.)
apiVersion: secrets.hashicorp.com/v1beta1
kind: VaultStaticSecret
spec:
type: kv-v2
mount: kvv2
path: erp/config # = kvv2/<app>/config (rempli à l'étape 5)
destination:
name: secretkv
create: true
refreshAfter: 24h
vaultAuthRef: auth
```
Le Deployment consomme ensuite `vso-db-credentials` (clés `username`/`password`) et `secretkv` (via `envFrom`/`secretKeyRef`).
### Flux runtime
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef pod fill:#b45309,stroke:#92400e,color:#fff
classDef vault fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef db fill:#2563eb,stroke:#1e40af,color:#fff
SA["Pod · SA app"]:::pod
VA["VaultAuth<br>role app (k8s)"]:::vault
VDS["VaultDynamicSecret<br>postgres/creds/app"]:::vault
VSS["VaultStaticSecret<br>kvv2/app/config"]:::vault
SEC["Secret K8s<br>vso-db-credentials + secretkv"]:::pod
PGB["pgbouncer.tools:5432"]:::db
DB["base app<br>(user dynamique → app_role)"]:::db
SA --> VA
VA --> VDS
VA --> VSS
VDS --> SEC
VSS --> SEC
SEC --> SA
SA --> PGB --> DB
```
> [!IMPORTANT]
> `serviceAccount.create` doit valoir `true` dans `values.yaml` : c'est ce SA `<app>` que `VaultAuth` lie au rôle Vault `<app>`. Sans lui, pas d'authentification, donc pas de creds DB.
## Ingress : interne `.lab` vs public `.fr`
Deux patterns selon l'exposition voulue (annotations Traefik dans `values.yaml`) :
| | Interne (`.lab`) — ex. `erp` | Public (`.fr`) — ex. `webapp` |
|---|---|---|
| `entrypoints` | `websecure` | `web` |
| TLS | `router.tls: "true"` + `certresolver: letsencrypt` | (terminé en amont) |
| Middleware | `localIp@file` (restreint au LAN) | `kube-system-crowdsec@kubernetescrd` (WAF) |
| `nodeSelector` | — | `kubernetes.io/hostname: pi1` (garde l'IP source, point d'entrée réseau) |
| Hôte | `<app>.arcodange.lab` | `<app>.arcodange.fr` |
```yaml
# values.yaml — ingress interne .lab (extrait erp)
ingress:
enabled: true
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
traefik.ingress.kubernetes.io/router.middlewares: localIp@file
hosts:
- host: erp.arcodange.lab
paths: [{ path: /, pathType: Prefix }]
```
## Image
| Cas | `image.repository` | Exemple |
|---|---|---|
| Image **maison** (build dans le dépôt) | `gitea.arcodange.lab/arcodange-org/<app>` | `webapp` (cf. [étape 6](06-ci-workflows.md)) |
| Image **publique** | l'image upstream | `erp``dolibarr/dolibarr` |
## Notes / contraintes
- Si l'app stocke des fichiers, ajouter un `pvc.yaml` (`storageClassName: longhorn`, `accessModes: [ReadWriteMany]`, annotation `helm.sh/resource-policy: keep` pour survivre à un `helm uninstall`) — cf. `erp/chart/templates/pvc.yaml`.
- Les CRD VSO supposent que le VSO tourne dans le cluster (déployé par `tools`) et que le rôle K8s `<app>` + le rôle `creds/<app>` existent (étape 5).
## Related
- [5. Terraform de l'app](05-app-terraform.md) — crée `postgres/creds/<app>`, le rôle K8s `<app>` et remplit `kvv2/<app>/config` que ces CRD consomment.
- [2. Base de données](02-database.md) — la base `<app>` et `pgbouncer.tools` ciblés ici.
- [3. Vault plateforme](03-vault-platform.md) — la policy runtime `<app>` qu'utilise `VaultAuth`.
- [6. Workflows CI](06-ci-workflows.md) — construit l'image référencée par `image.repository`.
- [Référence VSO/secrets faisant autorité](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/README.md) — explication détaillée VaultConnection/VaultAuth/VaultDynamicSecret côté `tools` (à ne pas dupliquer ici).

View File

@@ -0,0 +1,96 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **5. Terraform de l'app**
# 5. Le Terraform de l'application
> **Status:** ✅ Active
> **Upstream:** [2. Base de données](02-database.md), [3. Vault plateforme](03-vault-platform.md)
> **Downstream:** [4. Chart Helm](04-helm-chart.md), [6. Workflows CI](06-ci-workflows.md)
> **Related:** [Conventions de nommage](conventions.md)
---
## Summary
Le `iac/` du dépôt déclare les ressources Vault **propres à l'app** : un rôle Postgres dynamique (`postgres/creds/<app>`), un rôle d'authentification Kubernetes (`<app>`), et les secrets de config KV. Le gros du travail est fait par un **module partagé** (`app_roles`, dans `tools`) ; le dépôt se contente de l'appeler avec son nom et d'ajouter ses secrets spécifiques.
## Les trois fichiers
### `providers.tf` — s'authentifier à Vault avec le rôle CI de l'app
```hcl
terraform {
required_providers {
vault = { source = "vault", version = "4.4.0" }
}
}
provider "vault" {
address = "https://vault.arcodange.lab"
auth_login_jwt { # JWT fourni par la CI via TERRAFORM_VAULT_AUTH_JWT
mount = "gitea_jwt"
role = "gitea_cicd_<app>" # ← créé à l'étape 3 ; DOIT exister avant le 1er apply
}
}
```
### `backend.tf` — état distant sur GCS, préfixe par app
```hcl
terraform {
backend "gcs" {
bucket = "arcodange-tf"
prefix = "<app>/main" # ← un préfixe d'état dédié par app
}
}
```
### `main.tf` — appeler le module partagé + secrets de l'app
```hcl
module "app_roles" {
source = "git::ssh://git@192.168.1.202:2222/arcodange-org/tools.git//hashicorp-vault/iac/modules/app_roles?depth=1&ref=main"
name = "<app>"
# database = "<autre>" # optionnel ; par défaut = name
}
# Exemple : secrets de config statiques de l'app, écrits dans kvv2/<app>/config
resource "vault_kv_secret_v2" "config" {
mount = module.app_roles.mount_paths.kvv2 # "kvv2"
name = format("%sconfig", module.app_roles.kvv2_path_prefix) # "<app>/config"
data_json = jsonencode({
# … clés propres à l'app (ex. erp : DOLI_ADMIN_LOGIN/PASSWORD, DOLI_INSTANCE_UNIQUE_ID) …
})
}
```
## Ce que le module `app_roles` crée
[`modules/app_roles/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_roles/main.tf) :
| Ressource | Effet |
|---|---|
| `vault_database_secret_backend_role``postgres/creds/<app>` | À chaque demande : `CREATE ROLE "…" LOGIN PASSWORD … VALID UNTIL … ; GRANT <app>_role TO "…"`. Le user éphémère **hérite** de `<app>_role` (donc des droits sur la base). À la révocation : `REASSIGN OWNED … TO <app>_role` + `REVOKE`. |
| `vault_kubernetes_auth_backend_role``<app>` | Lie le SA `<app>` du namespace `<app>` aux policies `default` + `<app>` (TTL 3600 s). C'est ce que `VaultAuth` cible (étape 4). |
Sorties utiles : `mount_paths` (`{k8s, pg, kvv2}`), `kvv2_path_prefix` (`<app>/`), `name`, `database`.
## Dépendances (à respecter)
> [!IMPORTANT]
> Ce Terraform **suppose** que deux choses existent déjà :
> - le rôle Postgres `<app>_role` (créé à l'[étape 2](02-database.md)) — sinon le `GRANT <app>_role TO …` du rôle dynamique est invalide ;
> - le rôle JWT `gitea_cicd_<app>` et la policy `<app>` (créés à l'[étape 3](03-vault-platform.md)) — sinon l'authentification du provider échoue / le rôle K8s ne peut référencer la policy.
>
> Ce `iac/` est appliqué par la CI de l'app, voir [étape 6](06-ci-workflows.md). Ne pas pousser ce dossier avant d'avoir appliqué les étapes 2 et 3.
## Notes / contraintes
- Le module est récupéré **en SSH** via le bot `tofu_module_reader` (cf. [étape 1](01-gitea-repo.md)) ; `?ref=main&depth=1` épingle la branche et limite le clone.
- L'état est isolé par `prefix = "<app>/main"` : pas de collision entre apps dans le bucket `arcodange-tf`.
- `erp` et `webapp` montrent deux variantes : `erp` passe par `module "app_roles"` ; `webapp` inline encore les ressources (`vault_database_secret_backend_role` + `vault_kubernetes_auth_backend_role`) — préférer le module pour une nouvelle app.
## Related
- [2. Base de données](02-database.md) — fournit `<app>_role`.
- [3. Vault plateforme](03-vault-platform.md) — fournit `gitea_cicd_<app>` et la policy `<app>`.
- [4. Chart Helm](04-helm-chart.md) — consomme `postgres/creds/<app>`, le rôle K8s `<app>` et `kvv2/<app>/config`.
- [6. Workflows CI](06-ci-workflows.md) — applique ce `iac/`.

View File

@@ -0,0 +1,108 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **6. Workflows CI**
# 6. Les workflows CI (`.gitea/workflows/`)
> **Status:** ✅ Active
> **Upstream:** [1. Dépôt Gitea](01-gitea-repo.md) (secrets d'org), [3. Vault plateforme](03-vault-platform.md) (`gitea_cicd_<app>`)
> **Related:** [4. Chart Helm](04-helm-chart.md) · [5. Terraform de l'app](05-app-terraform.md) · [7. Enregistrement ArgoCD](07-argocd-register.md) · [Conventions de nommage](conventions.md)
---
## Summary
Deux workflows Gitea Actions vivent dans le dépôt : **`vault.yaml`** applique le Terraform de l'app (`iac/`) en s'authentifiant à Vault via OIDC, et **`dockerimage.yaml`** (optionnel) construit l'image et la pousse au registre Gitea. Le déploiement lui-même n'est pas dans la CI : c'est ArgoCD qui s'en charge ([étape 7](07-argocd-register.md)).
## `vault.yaml` — appliquer le `iac/` de l'app
Déclenché sur tout changement de `iac/*.tf`. Deux jobs : obtenir un JWT depuis Gitea, puis `tofu apply`.
```yaml
on:
workflow_dispatch: {}
push: { paths: ['iac/*.tf'] }
pull_request: { paths: ['iac/*.tf'] }
# job 1 : échange OIDC Gitea → JWT (script base64 fourni en secret d'org)
# run: echo -n "${{ secrets.vault_oauth__sh_b64 }}" | base64 -d | bash
# job 2 : lire les secrets de bootstrap puis appliquer
- name: read vault secret
uses: https://gitea.arcodange.lab/arcodange-org/vault-action.git@main
with:
url: https://vault.arcodange.lab
caCertificate: ${{ secrets.HOMELAB_CA_CERT }}
jwtGiteaOIDC: ${{ needs.gitea_vault_auth.outputs.gitea_vault_jwt }}
role: gitea_cicd_<app> # ← le rôle JWT de l'app (étape 3)
method: jwt
path: gitea_jwt
secrets: |
kvv1/google/credentials credentials | GOOGLE_BACKEND_CREDENTIALS ;
kvv1/gitea/tofu_module_reader ssh_private_key | TERRAFORM_SSH_KEY ;
- uses: actions/checkout@v4
- name: terraform apply
uses: dflook/terraform-apply@v1
with: { path: iac, auto_approve: true }
```
Les deux secrets lus servent au backend (clé GCS `GOOGLE_BACKEND_CREDENTIALS`) et au clone du module partagé en SSH (`TERRAFORM_SSH_KEY`, cf. [étape 5](05-app-terraform.md)).
> [!WARNING]
> **Piège du `role:` copié-collé.** Le `role:` du step `vault-action` **et** le `role` de `iac/providers.tf` doivent tous deux être `gitea_cicd_<app>`. L'exemple `erp` porte encore `role: gitea_cicd_webapp` dans son [`vault.yaml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) (reliquat de copier-coller) alors que son `providers.tf` utilise bien `gitea_cicd_erp`. Vérifie et aligne sur le nom de **ton** app, sinon la CI lit/écrit avec la mauvaise identité.
## `dockerimage.yaml` — construire l'image (si image maison)
À n'ajouter **que** si l'app build sa propre image (pas pour une image publique comme `erp`/Dolibarr). Déclenché au push sur `main`, en ignorant `README.md` et `chart/**` (changer le chart ne reconstruit pas l'image).
```yaml
on:
push: { branches: [main], paths-ignore: ['README.md', 'chart/**'] }
jobs:
build-and-push-image:
steps:
- uses: docker/login-action@v3
with:
registry: gitea.arcodange.lab
username: ${{ github.actor }}
password: ${{ secrets.PACKAGES_TOKEN }} # secret d'org (étape 1)
- uses: actions/checkout@v4
- run: |
TAGS="latest ${{ github.ref_name }}"
docker build -t app .
for TAG in $TAGS; do
docker tag app gitea.arcodange.lab/${{ github.repository }}:$TAG
docker push gitea.arcodange.lab/${{ github.repository }}:$TAG
done
```
L'image atterrit donc en `gitea.arcodange.lab/arcodange-org/<app>:latest` — exactement ce que `image.repository` du chart référence ([étape 4](04-helm-chart.md)). Un `Dockerfile` multi-stage à la racine convient (cf. [`webapp/Dockerfile`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/Dockerfile)).
## Vue d'ensemble
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart TB
classDef ci fill:#059669,stroke:#047857,color:#fff
classDef out fill:#7c3aed,stroke:#6d28d9,color:#fff
PUSH["push sur main"]:::ci
PUSH -->|"iac/*.tf modifié"| TF["vault.yaml<br>OIDC → JWT → tofu apply iac/"]:::ci
PUSH -->|"code modifié"| IMG["dockerimage.yaml<br>build + push image"]:::ci
TF --> VR["creds/app + rôle K8s app + KV"]:::out
IMG --> REG["registre gitea.arcodange.lab/arcodange-org/app"]:::out
```
## Déploiement automatique sur nouvelle image
Pour qu'ArgoCD redéploie quand une nouvelle image est poussée, on n'ajoute **rien** dans la CI : ce sont les annotations `argocd-image-updater` posées à l'[étape 7](07-argocd-register.md) (stratégie `digest`) qui surveillent le tag `latest`.
## Notes / contraintes
- `concurrency: cancel-in-progress` est activé sur les deux workflows : un nouveau push annule le run précédent sur la même ref.
- Le `vault-action` est lui-même un dépôt Gitea (`arcodange-org/vault-action`) épinglé `@main`.
## Related
- [3. Vault plateforme](03-vault-platform.md) — d'où vient `gitea_cicd_<app>`.
- [5. Terraform de l'app](05-app-terraform.md) — ce que `vault.yaml` applique.
- [4. Chart Helm](04-helm-chart.md) — `image.repository` = l'image poussée ici.
- [7. Enregistrement ArgoCD](07-argocd-register.md) — déploie, et porte les annotations d'auto-update.

View File

@@ -0,0 +1,91 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **7. Enregistrement ArgoCD**
# 7. Enregistrer l'app dans ArgoCD
> **Status:** ✅ Active
> **Upstream:** [4. Chart Helm](04-helm-chart.md), [6. Workflows CI](06-ci-workflows.md)
> **Related:** [Conventions de nommage](conventions.md) · [Checklist](08-checklist.md)
---
## Summary
ArgoCD fonctionne en **app-of-apps** : le chart `factory/argocd` lit une liste d'applications dans son `values.yaml` et génère une ressource `Application` par entrée. Enregistrer l'app = ajouter son nom à cette liste. ArgoCD se charge ensuite de cloner le dépôt, déployer `chart/` dans le namespace `<app>`, et resynchroniser à chaque push.
## Action
Ajouter `<app>` sous `gitea_applications` dans [`factory/argocd/values.yaml`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/argocd/values.yaml) :
```yaml
gitea_applications:
webapp: { … }
erp:
annotations: {}
<app>: # ← cas simple
annotations: {}
```
Variante avec **auto-déploiement** sur nouvelle image (recommandé pour une image maison) :
```yaml
<app>:
annotations:
argocd-image-updater.argoproj.io/image-list: <app>=gitea.arcodange.lab/arcodange-org/<app>:latest
argocd-image-updater.argoproj.io/<app>.update-strategy: digest
```
Options supplémentaires :
| Champ | Quand l'utiliser | Effet |
|---|---|---|
| `org: arcodange` | dépôt hors `arcodange-org` | change le `repoURL` (défaut `arcodange-org`) |
| `syncPolicy: …` | contrôle manuel | surcharge la policy (défaut : `automated {prune, selfHeal}`) |
## Ce que ça génère
Le template [`argocd/templates/apps.yaml`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/argocd/templates/apps.yaml) rend, pour chaque entrée :
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: <app>
namespace: argocd
spec:
project: default
source:
repoURL: https://gitea.arcodange.lab/arcodange-org/<app> # <org> = arcodange-org par défaut
targetRevision: HEAD
path: chart # ← d'où l'exigence du dossier chart/
destination:
server: https://kubernetes.default.svc
namespace: <app> # = nom de l'app
syncPolicy:
automated: { prune: true, selfHeal: true }
syncOptions: [ CreateNamespace=true ] # le namespace est créé tout seul
```
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef gitops fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef k8s fill:#2563eb,stroke:#1e40af,color:#fff
VAL["values.yaml<br>gitea_applications.app"]:::gitops --> APP["Application ArgoCD<br>app"]:::gitops
APP -->|"path: chart, HEAD"| SYNC["sync du dépôt<br>arcodange-org/app"]:::gitops
SYNC --> NS["namespace app<br>(CreateNamespace=true)"]:::k8s
NS --> DEP["Deployment + Service + Ingress + CRD Vault"]:::k8s
```
## Notes / contraintes
> [!IMPORTANT]
> `path: chart` et `namespace: <app>` sont **déduits du nom**, pas configurables par entrée. C'est pourquoi le dossier doit s'appeler `chart/` ([étape 1](01-gitea-repo.md)) et le nom doit être cohérent partout ([conventions](conventions.md)).
- Le chart `factory/argocd` est lui-même réconcilié par ArgoCD (app-of-apps racine) : committer `values.yaml` sur `main` suffit à faire apparaître/synchroniser la nouvelle `Application`. Pas de `kubectl apply` manuel.
- `prune: true` + `selfHeal: true` : ArgoCD supprime ce qui n'est plus dans le chart et réécrase les dérives manuelles. En tenir compte avant tout `kubectl edit`.
## Related
- [4. Chart Helm](04-helm-chart.md) — le contenu déployé (le dossier `chart/`).
- [6. Workflows CI](06-ci-workflows.md) — les annotations `argocd-image-updater` collaborent avec l'image poussée.
- [8. Checklist](08-checklist.md) — vérifier que l'`Application` passe `Healthy`/`Synced`.

View File

@@ -0,0 +1,77 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **8. Checklist**
# 8. Checklist & Definition of Done
> **Status:** ✅ Active
> **Upstream:** [7. Enregistrement ArgoCD](07-argocd-register.md)
> **Related:** [README du runbook](README.md) · [Conventions de nommage](conventions.md)
---
## Summary
Récapitulatif imprimable de toute la procédure, dans l'ordre des dépendances. À cocher au fur et à mesure ; chaque ligne renvoie à sa page détaillée.
## Ordre des dépendances
```
[01] Dépôt Gitea ──┬──> [02] base + <app>_role ──┐
└──> [03] gitea_cicd_<app> ────┤ (02 et 03 avant 05)
[04+05+06] chart/ + iac/ + .gitea/ ── push → CI
[07] ArgoCD ── déploie ── Runtime
```
## Checklist
**Préparation**
- [ ] Nom `<app>` choisi (kebab-case minuscule), cohérent partout — [conventions](conventions.md)
**1 · Dépôt** — [détails](01-gitea-repo.md)
- [ ] Dépôt `arcodange-org/<app>` créé (hérite `HOMELAB_CA_CERT`, `vault_oauth__sh_b64`, `PACKAGES_TOKEN`)
- [ ] Squelette en place : `chart/`, `iac/`, `.gitea/workflows/`, `.gitignore` (+ `Dockerfile` si image maison)
**2 · Base de données** — [détails](02-database.md)
- [ ] `"<app>"` ajouté à `factory/postgres/iac/terraform.tfvars`
- [ ] CI `factory` « Postgres » verte → base `<app>` + rôle `<app>_role` créés
**3 · Vault plateforme** — [détails](03-vault-platform.md)
- [ ] `{ name = "<app>" }` ajouté à `tools/hashicorp-vault/iac/terraform.tfvars`
- [ ] CI `tools` « Vault » verte → `gitea_cicd_<app>`, policies `<app>` / `<app>-ops` créées
> [!IMPORTANT]
> Ne pas pousser le `iac/` de l'app (étape 5/6) tant que **2 et 3** ne sont pas appliquées : la CI Terraform de l'app en dépend.
**4 · Chart Helm** — [détails](04-helm-chart.md)
- [ ] `Chart.yaml` (`name: <app>`), `values.yaml` (image, ingress, `serviceAccount.create: true`)
- [ ] `config.yaml` pointe `pgbouncer.tools` / base `<app>`
- [ ] `vaultauth.yaml` (role `<app>`), `vaultdynamicsecret.yaml` (`creds/<app>`), `vaultsecret.yaml` (`kvv2/<app>/config`)
- [ ] Ingress choisi : interne `.lab` (websecure + letsencrypt + `localIp@file`) ou public `.fr` (web + crowdsec + `nodeSelector pi1`)
**5 · Terraform de l'app** — [détails](05-app-terraform.md)
- [ ] `providers.tf` : role `gitea_cicd_<app>`
- [ ] `backend.tf` : prefix `<app>/main`
- [ ] `main.tf` : `module "app_roles"` (`name = "<app>"`) + secrets `kvv2/<app>/config`
**6 · Workflows CI** — [détails](06-ci-workflows.md)
- [ ] `vault.yaml` : `role: gitea_cicd_<app>` **aligné** (pas un reliquat copié-collé) ⚠️
- [ ] `dockerimage.yaml` + `Dockerfile` (si image maison) → push `gitea.arcodange.lab/arcodange-org/<app>`
- [ ] Push → CI « vault » verte (`creds/<app>` + rôle K8s `<app>` + KV créés), CI « image » verte
**7 · ArgoCD** — [détails](07-argocd-register.md)
- [ ] `<app>` ajouté sous `gitea_applications` dans `factory/argocd/values.yaml` (+ annotations image-updater si voulu)
- [ ] Commit sur `main``Application` `<app>` apparaît dans ArgoCD
## Definition of Done
- [ ] `Application` ArgoCD `<app>` = **Synced** + **Healthy**
- [ ] Pod `Running` dans le namespace `<app>` (SA `<app>`)
- [ ] Secret K8s `vso-db-credentials` présent et **roté** par VSO (TTL ~1 h) ; le pod redémarre à la rotation
- [ ] L'app répond sur son ingress (`<app>.arcodange.lab` ou `<app>.arcodange.fr`)
- [ ] Connexion DB OK via `pgbouncer.tools` avec un user dynamique héritant de `<app>_role`
## Related
- [README du runbook](README.md) — vue d'ensemble + carte de bout en bout.
- [Conventions de nommage](conventions.md) — la cohérence du nom `<app>`, source de la plupart des ratés.

View File

@@ -0,0 +1,104 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > **Nouvelle application web**
# Mettre en service une nouvelle application web
> **Last Updated:** 2026-05-31
> **Status:** ✅ Procédure courante
> **Related:** [Conventions de nommage](conventions.md) · [Checklist](08-checklist.md) · [ADR CI/CD](../../adr/03_cicd_gitea_action_argocd.md) · [ADR Vault](../../adr/04_tool_hashicorp_vault.md)
## C'est quoi ?
Ce runbook décrit, de zéro, comment faire vivre une nouvelle application web sur la plateforme Arcodange. Le pattern est **GitOps** : l'app habite son propre dépôt Gitea, sa base de données et ses accès Vault sont provisionnés par Terraform/OpenTofu, et **ArgoCD** déploie son chart Helm dans un namespace dédié. Les identifiants Postgres ne sont jamais écrits en clair : ils sont générés à la volée par Vault et injectés dans le pod par le **Vault Secrets Operator (VSO)**.
La mécanique est répartie sur **trois dépôts** — le dépôt plateforme [`factory`](https://gitea.arcodange.lab/arcodange-org/factory), le dépôt des services partagés [`tools`](https://gitea.arcodange.lab/arcodange-org/tools), et le **nouveau dépôt de l'app** — avec des **dépendances d'ordre** strictes (voir plus bas). Les exemples de référence sont [`erp`](https://gitea.arcodange.lab/arcodange-org/erp) (image publique + DB) et [`webapp`](https://gitea.arcodange.lab/arcodange-org/webapp) (image maison + DB).
**Sert à :**
1. Créer un dépôt Gitea et son squelette (`chart/`, `iac/`, `.gitea/workflows/`).
2. Provisionner la base de données, son rôle propriétaire, et les accès Vault (statiques + dynamiques).
3. Déployer l'app via ArgoCD et l'exposer derrière Traefik (interne `.lab` ou public `.fr` + CrowdSec).
## Carte de bout en bout
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart TB
classDef step fill:#2563eb,stroke:#1e40af,color:#fff
classDef plat fill:#059669,stroke:#047857,color:#fff
classDef gitops fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef run fill:#b45309,stroke:#92400e,color:#fff
REPO["1 · Dépôt Gitea<br>arcodange-org/app"]:::step
DB["2 · factory/postgres/iac<br>base app + rôle app_role"]:::plat
VAULT["3 · tools/hashicorp-vault/iac<br>gitea_cicd_app + policies app / app-ops"]:::plat
CONTENT["4·5·6 · chart/ + iac/ + .gitea/<br>push → CI build image &amp; tofu apply"]:::step
ARGO["7 · factory/argocd/values.yaml<br>→ Application ArgoCD (ns app)"]:::gitops
POD["Runtime · Pod(SA app) → VSO → Vault<br>creds/app → pgbouncer.tools → base app"]:::run
REPO --> DB
REPO --> VAULT
DB --> CONTENT
VAULT --> CONTENT
CONTENT --> ARGO
ARGO --> POD
```
## Ordre des opérations (le point le plus important)
> [!IMPORTANT]
> Les étapes ne sont pas interchangeables. Le rôle JWT de CI `gitea_cicd_<app>` (étape 3) et le rôle Postgres `<app>_role` (étape 2) doivent **exister avant** que la CI Terraform de l'app (étape 6 appliquant l'étape 5) ne s'exécute — sinon l'authentification Vault de la CI échoue, ou le module `app_roles` n'a pas de `<app>_role` à qui rattacher les credentials dynamiques.
```
[01] Dépôt Gitea sous arcodange-org (hérite les secrets CI d'org)
├──> [02] factory/postgres/iac → base <app> + <app>_role + user_lookup()
└──> [03] tools/hashicorp-vault/iac → gitea_cicd_<app> (JWT CI) + policies <app> / <app>-ops
│ (02 et 03 indépendants entre eux, mais TOUS DEUX avant 05)
[04+05+06] Contenu du dépôt : chart/ + iac/ + .gitea/workflows/ (+ Dockerfile)
│ push → CI « dockerimage » build l'image · CI « vault » applique iac/
│ → creds/<app> (rôle DB dynamique) + rôle K8s <app> + secrets KV
[07] factory/argocd/values.yaml → ArgoCD crée l'Application → déploie le chart dans le namespace <app>
Runtime : Pod(SA <app>) → VSO → VaultAuth(role <app>) → creds/<app>
→ user PG dynamique héritant de <app>_role → pgbouncer.tools → base <app>
```
## Prérequis plateforme (déjà en place)
Ces fondations existent et ne sont **pas** à refaire pour chaque app :
| Brique | Où | Rôle |
|---|---|---|
| Mounts Vault `kvv2`, `postgres`, `transit`, auth `kubernetes` | [`tools/hashicorp-vault/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) | Moteurs de secrets + auth K8s |
| Connexion Vault→Postgres (via `pgbouncer.tools`, user `credentials_editor`) | idem | Permet à Vault d'émettre des users PG dynamiques |
| Rôle JWT de bootstrap `gitea_cicd` + app OAuth2 Gitea (`gitea_app_id`) | [`gitea_oidc_auth.yml`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/gitea_oidc_auth.yml) | Échange OIDC Gitea → JWT Vault dans la CI |
| Bot `tofu_module_reader` (clé SSH dans `kvv1/gitea/tofu_module_reader`) | [`factory/iac/gitea_tofu_ci_user.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/iac/gitea_tofu_ci_user.tf) | Laisse la CI cloner le module partagé `tools` en SSH |
| Secrets Actions d'organisation (`HOMELAB_CA_CERT`, `vault_oauth__sh_b64`, `PACKAGES_TOKEN`) | org Gitea `arcodange-org` | Hérités par tout dépôt de l'org |
## Index des étapes
| # | Page | Ce qu'on y fait | Statut |
|---|---|---|---|
| — | [Conventions de nommage](conventions.md) | Le nom `<app>` réutilisé à l'identique partout (à lire en premier) | ✅ |
| 01 | [Dépôt Gitea](01-gitea-repo.md) | Créer le dépôt sous `arcodange-org` + squelette | ✅ |
| 02 | [Base de données](02-database.md) | `factory/postgres/iac` → base `<app>` + rôle `<app>_role` | ✅ |
| 03 | [Vault plateforme](03-vault-platform.md) | `tools/hashicorp-vault/iac``gitea_cicd_<app>` + policies | ✅ |
| 04 | [Chart Helm](04-helm-chart.md) | Le chart de l'app (DB via pgbouncer, secrets VSO, ingress) | ✅ |
| 05 | [Terraform de l'app](05-app-terraform.md) | `iac/` → module `app_roles` (creds dynamiques + rôle K8s) | ✅ |
| 06 | [Workflows CI](06-ci-workflows.md) | `.gitea/workflows/` : `tofu apply` + build image | ✅ |
| 07 | [Enregistrement ArgoCD](07-argocd-register.md) | `factory/argocd/values.yaml` → Application + déploiement | ✅ |
| 08 | [Checklist](08-checklist.md) | Récapitulatif ordonné + definition of done | ✅ |
## Légende de statut
✅ actif · 🟡 dégradé/beta · 🔴 critique/EOL · ⚠️ problème connu · ❌ désactivé
## Comment éditer ce runbook
1. **Ajouter une page** → la créer depuis le template tree-docs adéquat **et** ajouter sa ligne dans l'index ci-dessus.
2. **Garder les liens croisés bidirectionnels** → toute dépendance citée dans une page (`Upstream`/`Downstream`) doit avoir sa réciproque sur l'autre page.
3. **Mettre à jour `Last Updated:`** ci-dessus après tout changement de structure.
4. Les exemples cités (`erp`, `webapp`) sont vivants : revérifier les snippets contre le code réel avant de s'y fier aveuglément.

View File

@@ -0,0 +1,55 @@
[Factory](../../../README.md) > [Doc](../../README.md) > [Runbooks](../README.md) > [Nouvelle application web](README.md) > **Conventions de nommage**
# Conventions de nommage
> **Pourquoi cette page.** Un seul nom — `<app>` — est réutilisé à l'identique dans une dizaine de systèmes (Gitea, Postgres, Vault, Kubernetes, ArgoCD, GCS, DNS). Toutes les étapes du runbook en dépendent ; on le centralise ici pour ne pas le ré-expliquer partout.
> **Audience.** Quiconque crée ou audite une app sur la plateforme.
> **Status.** Actif · revu le 2026-05-31.
---
## TL;DR
Choisis **un** identifiant `<app>` en **kebab-case minuscule** (`webapp`, `erp`, `dance-lessons-coach`, `url-shortener`). Ce nom devient, **sans variation**, la clé de toutes les ressources. Une seule faute de frappe quelque part casse la chaîne (auth Vault, rattachement DB, sync ArgoCD).
## Le nom `<app>` dans chaque système
| Système | Identifiant dérivé de `<app>` | Exemple (`erp`) | Source de vérité |
|---|---|---|---|
| Dépôt Gitea | `arcodange-org/<app>` (ou `arcodange/<app>` si `org` surchargé) | `arcodange-org/erp` | [argocd/templates/apps.yaml](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/argocd/templates/apps.yaml) |
| Base PostgreSQL | `<app>` | `erp` | [postgres/iac/main.tf](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/postgres/iac/main.tf) |
| Rôle propriétaire PG (non-login) | `<app>_role` | `erp_role` | postgres/iac/main.tf |
| Rôle DB dynamique Vault | `postgres/creds/<app>` | `postgres/creds/erp` | [modules/app_roles/main.tf](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_roles/main.tf) |
| Rôle d'auth Kubernetes (Vault) | `<app>` | `erp` | modules/app_roles/main.tf |
| Policy Vault **runtime** (pod) | `<app>` | `erp` | [modules/app_policy/main.tf](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_policy/main.tf) |
| Policy Vault **CI** (ops) | `<app>-ops` | `erp-ops` | modules/app_policy/main.tf |
| Rôle JWT de CI (Vault) | `gitea_cicd_<app>` | `gitea_cicd_erp` | modules/app_policy/main.tf |
| Groupe d'identité Vault | `<app>-ops` | `erp-ops` | modules/app_policy/main.tf |
| Secret KV de config | `kvv2/<app>/config` | `kvv2/erp/config` | modules/app_roles (sortie `kvv2_path_prefix`) |
| Namespace Kubernetes | `<app>` | `erp` | apps.yaml (`CreateNamespace=true`) |
| ServiceAccount Kubernetes | `<app>` | `erp` | [chart/templates/serviceaccount.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/serviceaccount.yaml) |
| Application ArgoCD | `<app>` | `erp` | apps.yaml |
| Préfixe d'état OpenTofu (GCS) | `<app>/main` | `erp/main` | [iac/backend.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/backend.tf) |
| Domaine interne | `<app>.arcodange.lab` | `erp.arcodange.lab` | chart/values.yaml (ingress) |
| Domaine public | `<app>.arcodange.fr` | `webapp.arcodange.fr` | chart/values.yaml (ingress) |
## Pourquoi cette uniformité est structurante
Les briques se « branchent » entre elles **par convention de nom**, pas par configuration explicite :
- Le module `app_roles` génère un user PG dynamique avec `GRANT <app>_role TO …` → il **suppose** que `<app>_role` (créé à l'[étape 2](02-database.md)) porte exactement ce nom.
- Le `VaultDynamicSecret` du chart lit `postgres/creds/<app>` → il **suppose** que le rôle Vault (créé à l'[étape 5](05-app-terraform.md)) porte exactement `<app>`.
- L'`Application` ArgoCD déduit `repoURL=.../<app>`, `path=chart`, `namespace=<app>` du seul nom → le dépôt ([étape 1](01-gitea-repo.md)) et le namespace doivent matcher.
**Utilise un nom court, stable, kebab-case** dès le départ.
**N'introduis pas** de variantes (`my_app` vs `my-app`, `MyApp`, pluriels) : rien ne te préviendra, l'app échouera silencieusement à se connecter ou à se déployer.
## Références croisées
- [01 · Dépôt Gitea](01-gitea-repo.md) — fixe `<app>` comme nom de dépôt sous `arcodange-org`.
- [02 · Base de données](02-database.md) — crée `<app>` et `<app>_role`.
- [03 · Vault plateforme](03-vault-platform.md) — crée `gitea_cicd_<app>`, policies `<app>` / `<app>-ops`.
- [04 · Chart Helm](04-helm-chart.md) — référence `<app>`, `creds/<app>`, `kvv2/<app>/config`.
- [05 · Terraform de l'app](05-app-terraform.md) — appelle `app_roles` avec `name=<app>`.
- [06 · Workflows CI](06-ci-workflows.md) — s'authentifie avec `gitea_cicd_<app>`.
- [07 · Enregistrement ArgoCD](07-argocd-register.md) — déclare `<app>` dans `gitea_applications`.

174
iac/.terraform.lock.hcl generated Normal file
View File

@@ -0,0 +1,174 @@
# This file is maintained automatically by "tofu init".
# Manual edits may be lost in future updates.
provider "registry.opentofu.org/cloudflare/cloudflare" {
version = "5.21.1"
constraints = ">= 5.20.0, ~> 5.21"
hashes = [
"h1:gNF1Sro3G9nXhtdkitXwDVKxI1jpBAf8KPv+Y4kAJwk=",
"h1:iWJb0lHfVWmCJQSyroXOT8zQlFOT8k1caHcfaooG5wk=",
"zh:049719425b8be43d9d4f0c208217aca0baa22374f061d7ff92f02563490f649c",
"zh:0a8a3c1b26680b437fe9e7910ca81e532d36f8efacfb14f45690b6a779856993",
"zh:32b61f80892243f7ab8e453fa038c1f3e2aac733ccb98307c2cfe798b2793b32",
"zh:42c27f3cd62979e70716c51f682a3d131d51ad76d86dff83d8cdbfffcebac841",
"zh:4c8cd464f9b6ecde5cd4430bbba4be3b810826105e51ef6328b6a2b69f821443",
"zh:586ea42ef74d6c5bc4c9b89da6b1f8618a19f4e80272fe8d615e7d5b11c491af",
"zh:b09b86c7cac7085e01c9b7a828f09d13c44589d3e3cd42f0b694ca3e4cd3ed0a",
"zh:eac80665e60c701b37a6318f4e405d67f1720f8da5f93135c6256049282d3367",
"zh:f809ab383cca0a5f83072981c64208cbd7fa67e986a86ee02dd2c82333221e32",
]
}
provider "registry.opentofu.org/go-gitea/gitea" {
version = "0.6.0"
constraints = "0.6.0"
hashes = [
"h1:DB9cn3EvZt6yEDAW/4s7clYOQhIwXQpSMQ+kDAK+o9Q=",
"h1:MTo8bBuGgh5t3u/UuBI6oMJ/pT7a3GwdyVS1i/aPsh8=",
"zh:16269c27d36157a9248cdb3acd4ee507950cb84bc0eaf843f74b302b3d194285",
"zh:20951c7b571853def841942499a161bc806325b5c7d17de3cb49516bfcab3863",
"zh:3d9a69119d4a76de25a4562d9ed87ea72773733b97bc98084a9ba7572c5124c4",
"zh:49a8fb4735c12169cb0f66e1dd286a3cc008ebc212e486a82758fe3c50456e52",
"zh:4daa6ce8136204aa60f47b519c2da0a551e9ed45fdc684cafb8c3170c106f5d3",
"zh:88df966ec884351492f1284fbc55a5c35d3723a863b58f1d9d98039e3b7bc7c6",
"zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
"zh:a107ceacfca8341c4141574daacbbc6f91fb6414e0c541b27ac79948d4456ba9",
"zh:a3231c31c194f0606f01e06f44f74d37414f7a74e52452d6e93366ab2bcbcd4e",
"zh:a39af6f3dcb1d4fe6c19a8d64755a13fffb2febbec0be428291aa62c37295d17",
"zh:ac0c894d15a9c57e51ea67667fe9fdea0cf40dc7b97398e7e69fec03568801f7",
"zh:c3b70df30b8882b5d38e75b1709c292522a6f8a9bb226bc0a3258c4942c17c9d",
"zh:edd887c4eb5f721dcf6e288b0e1e599c2a7742135c1b3ab5493957f9a8f9dbe1",
"zh:f60b1f57123d11109caeff030c1c4456eed659bb88ec60b8d01b09c6a6954f00",
"zh:f7086eec6d90c2c0bd6385b7cccc80fffb5ce4300736f4bb982d70bf6eeecf48",
]
}
provider "registry.opentofu.org/hashicorp/google" {
version = "7.0.1"
constraints = "7.0.1"
hashes = [
"h1:kYx0VRlMuHcgOxEfbvORwTVGH+3WQUJJJJf1+PNh3k8=",
"h1:n9AyrMUKkTDkmfy1UBwaOh2ANepQ8i3Oa8ILLS6oaMI=",
"zh:0c1f204c23de0d63a5e3bf993a7f12d0b594f6a8020ef6dbac4ab711b2fc22d3",
"zh:2578d65af13c8b1971e6fb7c4725bbf93284c1e46a39d6528ec2323c17c84fb8",
"zh:3555b358d6c029929109fe629192ae19599d4efe1fee86d497d58b692a9313dd",
"zh:4475bc4fd37a51c962e5268a4ad65e059bcb074e5e0a9bac0d092bc23fda0927",
"zh:49af845bb5e1117bcb8885b9ecd4cce37dee00b43a1a08617392239c74398e8d",
"zh:ad5128adc7f3f1cb8ffbfdf98c1295c54e65da6d1e59849671081aac5caca01f",
"zh:c30baca3b476ea7ae9ad11f81ea85e8113b7f51ec21b4d6239142556131ada68",
"zh:c6de66d3674adc23abc65a3eea09829e9afbb1864aa563b140e1e5207671279e",
"zh:e9dda7a294a0c8f972c7ba20861be2c6fa7ee4c3c86550952e3a9199efd95d0e",
"zh:efdc977432a7bfe77a50dccdf1e890a7d0d9a8fb75dcd3a963cd0416ce175e8d",
]
}
provider "registry.opentofu.org/hashicorp/null" {
version = "3.3.0"
hashes = [
"h1:EvvCOc4FJY3NitSm6BpzCcUPU53LayVCB/tPOxYmy7U=",
"h1:mdu+qpyVmjDDLMrcL1JFy+cSyF58I3TFJwB5NssCZ58=",
"zh:083dcc0bec53f8abfa3f2aa2ce9d732a9675338fd60ae7d61162e25db7cb08bf",
"zh:19f7456b5a2ad16595860974714bfdb25b87bc16356ea9d5c7453892aaa27864",
"zh:222c0ed1fed4e4c677ebe626104dbfdba66763e264de0d9c27c58ce60104ee69",
"zh:271711d6caa7dd5a4e9b79fe8c679fab61a840bcf80040a0f5ebb425d1b27d97",
"zh:5adcf35f30baaea13f80c2a2c774deb9369892719493049687e23476c9dff40f",
"zh:5bcfd19df16e73d7f0ad75bd09e2b3b86cf6700d09822d585d68304b71de1d97",
"zh:604edecf263e38674decb35bb4e0e048fdc951f26fa103c33065ff9728f0313b",
"zh:782acbfb4fa4807e273e588fe45b4aaea9dd0fd1136f76ec3200f6f4db3af8d6",
"zh:84411a596d528fe67294e5c1cfd0c2036b08802497bcc4215ce518924f3c9a4a",
"zh:85e79eecf3f5348975cffec3016b0eba3baf605646102d4348796ccd2df2e5f6",
"zh:95669535ca17aeefef307ebfd59ce6930953173baae5637e8cbbf0297ec7ad58",
"zh:d04d9b177747bfd66b4a45b5d911a2a7822aa8451f5e35621971fb7a4206b530",
"zh:e6d9c924475283e90833450a14a732f4deb6d9bb131db8f86ab856e894270836",
"zh:ebcab0c8a1334c86ed7cfa53f571a17ad6d27e9901f27a8854ea622a74b54bb6",
"zh:ef9c757bb2c83d2103811a3d86b6ec5be06b0ffc337b84db1582d023bce7cdcd",
]
}
provider "registry.opentofu.org/hashicorp/random" {
version = "3.9.0"
hashes = [
"h1:U8KXqGCoNI9/guYbTvzgdtVk3fRthoG0UXwm1JoEpIs=",
"h1:gGDdPPibmw2EWROx+sh1RGLjR5+nPwZyrf6/N9jXfeM=",
"zh:03f1114cc20b8913523735ab76e0f0a2b16ce13c92923a53304bf85f07fc0dbc",
"zh:105b678ee72322a3067f105d7e05e940f6143238f377f6e87ff4ec909246ac2a",
"zh:55f3bbf13ea18cbace61a706566a80f25f33fe2b1780b6f3d7b582af2a05b6d2",
"zh:63adf996db48f082f7a6351eb485e219cd88795fc71e6ec60a837263ab0d2cb1",
"zh:7e99550738a4e3cc68b8a467714b0d69371025fe95e3326d5323d026d55653e9",
"zh:8342b54af3a18a37e075eeae61be57f4de2ba71b35d95c5075d402dd2c1f289d",
"zh:83ee18e32ac9dd5fc91298554b7c4cfa4c3a1db50f4c797945637cc93c0844ae",
"zh:993ecc0adbf6bd535a59fbc9b735d8c33950e6f6eb5e621d750da9b71d65d80a",
"zh:ad722bc59d4edbf1415e827fc007c0efe6e0e9462d5568bae20b34be1058a261",
"zh:ae9448e1f87b2f9a6c5197a0e9862162ec6b137cb3a3835e11522995d8939e7c",
"zh:bc9cdd3aac784f759125c6627f6f6416e8726a1c184eb9cf3e55b9edbc94c627",
"zh:c8e35b89572ba1c40a9b20022e033a3395fb8d42e7604d50c900f193ba10382e",
"zh:e2deaa8a9975ef81d9f62baed12c41286918b0a10908e0e031f13f69a3b730a1",
"zh:ee39707557210a0ab1098aa357d2cdfe502e5a312d0dbdffb09d08facc4d3fc5",
"zh:f81afe4eb63e8aa9e0ea71be6c990f0dc69cb360e7191c0742a991f4a5081b64",
]
}
provider "registry.opentofu.org/hashicorp/tls" {
version = "4.3.0"
hashes = [
"h1:GizReb5vbh71HnhHlGphHhVFj3ghwAaC2MKqb2d8Ye8=",
"h1:ZxKvDInYHzss9rv75M778pInFm08ME6hY31XMyFP4IA=",
"zh:07bb8c6e64124dada7dff57a38a46f2f323b3fd77920404c0c550293d1cf6188",
"zh:0b3bfda2df39c52f1c5452d05cf3107bedd5d20ab6977c90ede540c695fb6c3e",
"zh:110a055289f0400a63ac172bedb0e671d059b7a5ba22d4a3f5f246ccac0ad676",
"zh:15e532d8c711377499dece832e60170a8bef39830125b8154f4bda81d9721d29",
"zh:22ca65d96e9fc1be5605372d855c9e1eba2d86d510f7ac8593968f5649435e47",
"zh:36df38dfd03e8c1298c5704fd85e28b69a3927ed0b339f9628d0b56dac99c6b5",
"zh:429e2bfcb81656e1fe90b7b284767d1453c1a4100b16d27e4b29c34aa12f0ce1",
"zh:5b6679953065f0279bf018426c6fb06dd93a851a7a9369f2e3a1fec5bc417e83",
"zh:6a72c88d5aa945ddb32041350755377c96681563136decfe7e05c7cdea7988f1",
"zh:6f05757c50da9f8354a735b5756bd63a71126fcd142129525b90c56bfd081d61",
"zh:751703b7a4d40c3a111c4ed0d5da3ec91c14f880faf6f010a5000a2eb5366011",
"zh:87a5279e61b8198798a2fe86cfe3b74e5340bb486f4e148bb5b4d46f860cf1db",
"zh:942af95e9fd73327a7e9ab0803c4d701b782ddacd78c9b7ce9c91e38b3051522",
"zh:a457d0efea3c404178a182d240ba21cdeb0c620ffabeeb9a8977b024a85e1360",
"zh:d5eac8f4f0ae1ff41cbcc1008e6a74a8491dc27f4c6e5a0c32c5c4b6ef2e4087",
]
}
provider "registry.opentofu.org/hashicorp/vault" {
version = "4.4.0"
constraints = "4.4.0"
hashes = [
"h1:IhKDv0pTgpy89K3QYmDX872H75Wl7kZKR2scUQynuiA=",
"h1:t74F5RJkOMm0N/PbcvxPGyi0V1hwHjuOv0lFZ7lII6c=",
"zh:0309ea8f81386e17ab13c06c5991ca959708c55c815b0cfba2bbcd865e0d606e",
"zh:40e56199ccd266bffa216e8ebbcdc2e29b6ef5145b39377be766e763cac759c8",
"zh:6fad1f073bd2e53e34736e000f98db581137e153ac80bbb5c4f1a1e38b46a1d2",
"zh:74564fd4759decccf7f3c952aa2feba1012f103a66ec354aa3b3292a2f1b2412",
"zh:7aae012c1a43e6e5dae6f608ec0f08cdb3f95fa121a32e413fe7ee37cb99947f",
"zh:7c83f508e164844b1dd9bafe9de0fe60c7be7b55a02e704a6e2f50cff38b7d96",
"zh:873a42322b68d9fba4a38217b97ee04a1eb617e811d7f9954016f5c3eb6cb0bc",
"zh:9db2b13472cf91a5f18f0a7c6ae532277c05b0980d87f492341426b981679f7b",
"zh:ac1cbd2926265db80efe3f1814bed82901f7d8a7d4e5b1e22592e1eef234b1c7",
"zh:f465a955cc96f640e7426a648ba672c169a4a2959bad6146fe61583d67642561",
]
}
provider "registry.opentofu.org/ovh/ovh" {
version = "2.8.0"
constraints = "2.8.0"
hashes = [
"h1:wfhxUnZfCPsc6veiUOkEBXwyvF9ZGi2SwR85cp2CUws=",
"h1:zbnPL6Y4k/dY1X2u2JVyTid5hXwcCIfz65VC9UbkDrE=",
"zh:026d6590900388d8845af9d99a438e3cd90fcf50ef5f95a24b9dc646f391aa5c",
"zh:1375f3947bbdfe19c05abf0dbc0cb6f319d79976909282a269f4eb934a67fb18",
"zh:13cc7536d366935cb31b89f2b714c5ac8eac7e825e6897477fe56caebb04992e",
"zh:388696109f5f03c95775407df10dca822d0651237872a579fe7e953312a75ff6",
"zh:3ca9fd5e6756fe9f448066f74e7d6d7de5e7c0f34f923032d3a976ab6772a86a",
"zh:43ab0d8e362e2b22cac53747f609798de9e267a3eceaa66146b36e8ed6b16a98",
"zh:456d80cf53e21258d4df1a239ba3f7b1482631e558497cd797fafd25f8eea3ca",
"zh:54d46a83305120a9331b1dc12e6039b895b5285434bb96904d30f1fe277bbde7",
"zh:5b6b2628ef1a00579e769d7f67482fb8b59534f8761b399e7baf683e716e5d88",
"zh:68e6df5c16b92601d4545739855ec309b1ce7fce6597d8d6e4776357a5da7a7c",
"zh:80745afe134180fc441cc1c34c3a9ea20756f01ae793ba625255ce92817f5f5d",
"zh:a81a6896e60526588f8d16168d06018842c083ff5a1d73193cf7e9b26c3a4076",
"zh:ce68d4e6ca846f5e97de06fce5a4d6aca16154ddd8cf43580fd89b581e1ee471",
"zh:e498f560263abebf96a2cc698492b603c5a78851f77235d141c1ee7336ab866c",
]
}

View File

@@ -41,13 +41,13 @@ locals {
length(local.selected_account_permissions) > 0 ? {
effect = "allow"
permission_groups = [for id in local.selected_account_permissions : { id = id }]
resources = local.account_resource
resources = jsonencode(local.account_resource) # cloudflare provider >=5.20 types policies[].resources as a JSON string
} : null,
length(local.selected_bucket_permissions) > 0 ? {
effect = "allow"
permission_groups = [for id in local.selected_bucket_permissions : { id = id }]
resources = local.bucket_resource
resources = jsonencode(local.bucket_resource) # cloudflare provider >=5.20 types policies[].resources as a JSON string
} : null
] : policy if policy != null]

View File

@@ -0,0 +1,12 @@
# Bind the module's cloudflare_* resources to the cloudflare/cloudflare provider explicitly.
# Without this, OpenTofu defaults the module's provider source to hashicorp/cloudflare, pulling a
# second (redundant) provider into the lock file and relying on a registry redirect.
# >= 5.20 because policies[].resources is now a JSON string (set via jsonencode in main.tf).
terraform {
required_providers {
cloudflare = {
source = "cloudflare/cloudflare"
version = ">= 5.20"
}
}
}

View File

@@ -14,7 +14,7 @@ terraform {
}
cloudflare = {
source = "cloudflare/cloudflare"
version = "~> 5"
version = "~> 5.21" # pinned + .terraform.lock.hcl committed to avoid silent v5.x drift
}
ovh = {
source = "ovh/ovh"
@@ -23,8 +23,18 @@ terraform {
}
}
variable "gitea_cacert_file" {
# The gitea provider runs inside the dflook/terraform-apply container, which does NOT trust the
# homelab CA (unlike the ubuntu-latest-ca runner). Point it at the CA the workflow already writes
# so it can verify https://gitea.arcodange.lab. Set via TF_VAR_gitea_cacert_file in CI; null locally.
description = "Path to the homelab CA cert for the Gitea provider (set in CI). Null = use system trust."
type = string
default = null
}
provider "gitea" { # https://registry.terraform.io/providers/go-gitea/gitea/latest/docs
base_url = "https://gitea.arcodange.lab"
cacert_file = var.gitea_cacert_file
# use GITEA_TOKEN env var
}

View File

@@ -3,4 +3,5 @@ applications = [
"erp",
"crowdsec",
"plausible",
"dance-lessons-coach",
]

11
pyproject.toml Normal file
View File

@@ -0,0 +1,11 @@
[project]
name = "arcodange-factory"
version = "0.0.0"
description = "Ansible automation for the Arcodange factory homelab"
requires-python = ">=3.12,<3.13"
dependencies = [
"ansible-core",
"kubernetes",
"jmespath",
"dnspython",
]

342
uv.lock generated Normal file
View File

@@ -0,0 +1,342 @@
version = 1
revision = 3
requires-python = "==3.12.*"
[[package]]
name = "ansible-core"
version = "2.20.5"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "cryptography" },
{ name = "jinja2" },
{ name = "packaging" },
{ name = "pyyaml" },
{ name = "resolvelib" },
]
sdist = { url = "https://files.pythonhosted.org/packages/9d/ec/690cc73e38c3546eabc8ef4118e0d7be1758a598bc23eed3e24ca1f346a7/ansible_core-2.20.5.tar.gz", hash = "sha256:82e3049d95e6e02e5d20d4a5a8e10533a55e0cc52e878e4cf77166c45410f16f", size = 3339511, upload-time = "2026-04-21T00:48:27.175Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/f9/e1/4454505e725b84ae670229565dfc20c4075480199647bf4874cf337c560e/ansible_core-2.20.5-py3-none-any.whl", hash = "sha256:ff6ff15c6a37fda07dc7400207e17e93727b24173ca48c068b3311a50d75ecc3", size = 2416843, upload-time = "2026-04-21T00:48:25.413Z" },
]
[[package]]
name = "arcodange-factory"
version = "0.0.0"
source = { virtual = "." }
dependencies = [
{ name = "ansible-core" },
{ name = "dnspython" },
{ name = "jmespath" },
{ name = "kubernetes" },
]
[package.metadata]
requires-dist = [
{ name = "ansible-core" },
{ name = "dnspython" },
{ name = "jmespath" },
{ name = "kubernetes" },
]
[[package]]
name = "certifi"
version = "2026.4.22"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/25/ee/6caf7a40c36a1220410afe15a1cc64993a1f864871f698c0f93acb72842a/certifi-2026.4.22.tar.gz", hash = "sha256:8d455352a37b71bf76a79caa83a3d6c25afee4a385d632127b6afb3963f1c580", size = 137077, upload-time = "2026-04-22T11:26:11.191Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/22/30/7cd8fdcdfbc5b869528b079bfb76dcdf6056b1a2097a662e5e8c04f42965/certifi-2026.4.22-py3-none-any.whl", hash = "sha256:3cb2210c8f88ba2318d29b0388d1023c8492ff72ecdde4ebdaddbb13a31b1c4a", size = 135707, upload-time = "2026-04-22T11:26:09.372Z" },
]
[[package]]
name = "cffi"
version = "2.0.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "pycparser", marker = "implementation_name != 'PyPy'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/eb/56/b1ba7935a17738ae8453301356628e8147c79dbb825bcbc73dc7401f9846/cffi-2.0.0.tar.gz", hash = "sha256:44d1b5909021139fe36001ae048dbdde8214afa20200eda0f64c068cac5d5529", size = 523588, upload-time = "2025-09-08T23:24:04.541Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ea/47/4f61023ea636104d4f16ab488e268b93008c3d0bb76893b1b31db1f96802/cffi-2.0.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:6d02d6655b0e54f54c4ef0b94eb6be0607b70853c45ce98bd278dc7de718be5d", size = 185271, upload-time = "2025-09-08T23:22:44.795Z" },
{ url = "https://files.pythonhosted.org/packages/df/a2/781b623f57358e360d62cdd7a8c681f074a71d445418a776eef0aadb4ab4/cffi-2.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8eca2a813c1cb7ad4fb74d368c2ffbbb4789d377ee5bb8df98373c2cc0dee76c", size = 181048, upload-time = "2025-09-08T23:22:45.938Z" },
{ url = "https://files.pythonhosted.org/packages/ff/df/a4f0fbd47331ceeba3d37c2e51e9dfc9722498becbeec2bd8bc856c9538a/cffi-2.0.0-cp312-cp312-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:21d1152871b019407d8ac3985f6775c079416c282e431a4da6afe7aefd2bccbe", size = 212529, upload-time = "2025-09-08T23:22:47.349Z" },
{ url = "https://files.pythonhosted.org/packages/d5/72/12b5f8d3865bf0f87cf1404d8c374e7487dcf097a1c91c436e72e6badd83/cffi-2.0.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:b21e08af67b8a103c71a250401c78d5e0893beff75e28c53c98f4de42f774062", size = 220097, upload-time = "2025-09-08T23:22:48.677Z" },
{ url = "https://files.pythonhosted.org/packages/c2/95/7a135d52a50dfa7c882ab0ac17e8dc11cec9d55d2c18dda414c051c5e69e/cffi-2.0.0-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:1e3a615586f05fc4065a8b22b8152f0c1b00cdbc60596d187c2a74f9e3036e4e", size = 207983, upload-time = "2025-09-08T23:22:50.06Z" },
{ url = "https://files.pythonhosted.org/packages/3a/c8/15cb9ada8895957ea171c62dc78ff3e99159ee7adb13c0123c001a2546c1/cffi-2.0.0-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:81afed14892743bbe14dacb9e36d9e0e504cd204e0b165062c488942b9718037", size = 206519, upload-time = "2025-09-08T23:22:51.364Z" },
{ url = "https://files.pythonhosted.org/packages/78/2d/7fa73dfa841b5ac06c7b8855cfc18622132e365f5b81d02230333ff26e9e/cffi-2.0.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:3e17ed538242334bf70832644a32a7aae3d83b57567f9fd60a26257e992b79ba", size = 219572, upload-time = "2025-09-08T23:22:52.902Z" },
{ url = "https://files.pythonhosted.org/packages/07/e0/267e57e387b4ca276b90f0434ff88b2c2241ad72b16d31836adddfd6031b/cffi-2.0.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3925dd22fa2b7699ed2617149842d2e6adde22b262fcbfada50e3d195e4b3a94", size = 222963, upload-time = "2025-09-08T23:22:54.518Z" },
{ url = "https://files.pythonhosted.org/packages/b6/75/1f2747525e06f53efbd878f4d03bac5b859cbc11c633d0fb81432d98a795/cffi-2.0.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:2c8f814d84194c9ea681642fd164267891702542f028a15fc97d4674b6206187", size = 221361, upload-time = "2025-09-08T23:22:55.867Z" },
{ url = "https://files.pythonhosted.org/packages/7b/2b/2b6435f76bfeb6bbf055596976da087377ede68df465419d192acf00c437/cffi-2.0.0-cp312-cp312-win32.whl", hash = "sha256:da902562c3e9c550df360bfa53c035b2f241fed6d9aef119048073680ace4a18", size = 172932, upload-time = "2025-09-08T23:22:57.188Z" },
{ url = "https://files.pythonhosted.org/packages/f8/ed/13bd4418627013bec4ed6e54283b1959cf6db888048c7cf4b4c3b5b36002/cffi-2.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:da68248800ad6320861f129cd9c1bf96ca849a2771a59e0344e88681905916f5", size = 183557, upload-time = "2025-09-08T23:22:58.351Z" },
{ url = "https://files.pythonhosted.org/packages/95/31/9f7f93ad2f8eff1dbc1c3656d7ca5bfd8fb52c9d786b4dcf19b2d02217fa/cffi-2.0.0-cp312-cp312-win_arm64.whl", hash = "sha256:4671d9dd5ec934cb9a73e7ee9676f9362aba54f7f34910956b84d727b0d73fb6", size = 177762, upload-time = "2025-09-08T23:22:59.668Z" },
]
[[package]]
name = "charset-normalizer"
version = "3.4.7"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/e7/a1/67fe25fac3c7642725500a3f6cfe5821ad557c3abb11c9d20d12c7008d3e/charset_normalizer-3.4.7.tar.gz", hash = "sha256:ae89db9e5f98a11a4bf50407d4363e7b09b31e55bc117b4f7d80aab97ba009e5", size = 144271, upload-time = "2026-04-02T09:28:39.342Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/0c/eb/4fc8d0a7110eb5fc9cc161723a34a8a6c200ce3b4fbf681bc86feee22308/charset_normalizer-3.4.7-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:eca9705049ad3c7345d574e3510665cb2cf844c2f2dcfe675332677f081cbd46", size = 311328, upload-time = "2026-04-02T09:26:24.331Z" },
{ url = "https://files.pythonhosted.org/packages/f8/e3/0fadc706008ac9d7b9b5be6dc767c05f9d3e5df51744ce4cc9605de7b9f4/charset_normalizer-3.4.7-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6178f72c5508bfc5fd446a5905e698c6212932f25bcdd4b47a757a50605a90e2", size = 208061, upload-time = "2026-04-02T09:26:25.568Z" },
{ url = "https://files.pythonhosted.org/packages/42/f0/3dd1045c47f4a4604df85ec18ad093912ae1344ac706993aff91d38773a2/charset_normalizer-3.4.7-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e1421b502d83040e6d7fb2fb18dff63957f720da3d77b2fbd3187ceb63755d7b", size = 229031, upload-time = "2026-04-02T09:26:26.865Z" },
{ url = "https://files.pythonhosted.org/packages/dc/67/675a46eb016118a2fbde5a277a5d15f4f69d5f3f5f338e5ee2f8948fcf43/charset_normalizer-3.4.7-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:edac0f1ab77644605be2cbba52e6b7f630731fc42b34cb0f634be1a6eface56a", size = 225239, upload-time = "2026-04-02T09:26:28.044Z" },
{ url = "https://files.pythonhosted.org/packages/4b/f8/d0118a2f5f23b02cd166fa385c60f9b0d4f9194f574e2b31cef350ad7223/charset_normalizer-3.4.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5649fd1c7bade02f320a462fdefd0b4bd3ce036065836d4f42e0de958038e116", size = 216589, upload-time = "2026-04-02T09:26:29.239Z" },
{ url = "https://files.pythonhosted.org/packages/b1/f1/6d2b0b261b6c4ceef0fcb0d17a01cc5bc53586c2d4796fa04b5c540bc13d/charset_normalizer-3.4.7-cp312-cp312-manylinux_2_31_armv7l.whl", hash = "sha256:203104ed3e428044fd943bc4bf45fa73c0730391f9621e37fe39ecf477b128cb", size = 202733, upload-time = "2026-04-02T09:26:30.5Z" },
{ url = "https://files.pythonhosted.org/packages/6f/c0/7b1f943f7e87cc3db9626ba17807d042c38645f0a1d4415c7a14afb5591f/charset_normalizer-3.4.7-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:298930cec56029e05497a76988377cbd7457ba864beeea92ad7e844fe74cd1f1", size = 212652, upload-time = "2026-04-02T09:26:31.709Z" },
{ url = "https://files.pythonhosted.org/packages/38/dd/5a9ab159fe45c6e72079398f277b7d2b523e7f716acc489726115a910097/charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:708838739abf24b2ceb208d0e22403dd018faeef86ddac04319a62ae884c4f15", size = 211229, upload-time = "2026-04-02T09:26:33.282Z" },
{ url = "https://files.pythonhosted.org/packages/d5/ff/531a1cad5ca855d1c1a8b69cb71abfd6d85c0291580146fda7c82857caa1/charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:0f7eb884681e3938906ed0434f20c63046eacd0111c4ba96f27b76084cd679f5", size = 203552, upload-time = "2026-04-02T09:26:34.845Z" },
{ url = "https://files.pythonhosted.org/packages/c1/4c/a5fb52d528a8ca41f7598cb619409ece30a169fbdf9cdce592e53b46c3a6/charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:4dc1e73c36828f982bfe79fadf5919923f8a6f4df2860804db9a98c48824ce8d", size = 230806, upload-time = "2026-04-02T09:26:36.152Z" },
{ url = "https://files.pythonhosted.org/packages/59/7a/071feed8124111a32b316b33ae4de83d36923039ef8cf48120266844285b/charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:aed52fea0513bac0ccde438c188c8a471c4e0f457c2dd20cdbf6ea7a450046c7", size = 212316, upload-time = "2026-04-02T09:26:37.672Z" },
{ url = "https://files.pythonhosted.org/packages/fd/35/f7dba3994312d7ba508e041eaac39a36b120f32d4c8662b8814dab876431/charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:fea24543955a6a729c45a73fe90e08c743f0b3334bbf3201e6c4bc1b0c7fa464", size = 227274, upload-time = "2026-04-02T09:26:38.93Z" },
{ url = "https://files.pythonhosted.org/packages/8a/2d/a572df5c9204ab7688ec1edc895a73ebded3b023bb07364710b05dd1c9be/charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:bb6d88045545b26da47aa879dd4a89a71d1dce0f0e549b1abcb31dfe4a8eac49", size = 218468, upload-time = "2026-04-02T09:26:40.17Z" },
{ url = "https://files.pythonhosted.org/packages/86/eb/890922a8b03a568ca2f336c36585a4713c55d4d67bf0f0c78924be6315ca/charset_normalizer-3.4.7-cp312-cp312-win32.whl", hash = "sha256:2257141f39fe65a3fdf38aeccae4b953e5f3b3324f4ff0daf9f15b8518666a2c", size = 148460, upload-time = "2026-04-02T09:26:41.416Z" },
{ url = "https://files.pythonhosted.org/packages/35/d9/0e7dffa06c5ab081f75b1b786f0aefc88365825dfcd0ac544bdb7b2b6853/charset_normalizer-3.4.7-cp312-cp312-win_amd64.whl", hash = "sha256:5ed6ab538499c8644b8a3e18debabcd7ce684f3fa91cf867521a7a0279cab2d6", size = 159330, upload-time = "2026-04-02T09:26:42.554Z" },
{ url = "https://files.pythonhosted.org/packages/9e/5d/481bcc2a7c88ea6b0878c299547843b2521ccbc40980cb406267088bc701/charset_normalizer-3.4.7-cp312-cp312-win_arm64.whl", hash = "sha256:56be790f86bfb2c98fb742ce566dfb4816e5a83384616ab59c49e0604d49c51d", size = 147828, upload-time = "2026-04-02T09:26:44.075Z" },
{ url = "https://files.pythonhosted.org/packages/db/8f/61959034484a4a7c527811f4721e75d02d653a35afb0b6054474d8185d4c/charset_normalizer-3.4.7-py3-none-any.whl", hash = "sha256:3dce51d0f5e7951f8bb4900c257dad282f49190fdbebecd4ba99bcc41fef404d", size = 61958, upload-time = "2026-04-02T09:28:37.794Z" },
]
[[package]]
name = "cryptography"
version = "48.0.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "cffi", marker = "platform_python_implementation != 'PyPy'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/9f/a9/db8f313fdcd85d767d4973515e1db101f9c71f95fced83233de224673757/cryptography-48.0.0.tar.gz", hash = "sha256:5c3932f4436d1cccb036cb0eaef46e6e2db91035166f1ad6505c3c9d5a635920", size = 832984, upload-time = "2026-05-04T22:59:38.133Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/df/3d/01f6dd9190170a5a241e0e98c2d04be3664a9e6f5b9b872cde63aff1c3dd/cryptography-48.0.0-cp311-abi3-macosx_10_9_universal2.whl", hash = "sha256:0c558d2cdffd8f4bbb30fc7134c74d2ca9a476f830bb053074498fbc86f41ed6", size = 8001587, upload-time = "2026-05-04T22:57:36.803Z" },
{ url = "https://files.pythonhosted.org/packages/b2/6e/e90527eef33f309beb811cf7c982c3aeffcce8e3edb178baa4ca3ae4a6fa/cryptography-48.0.0-cp311-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:f5333311663ea94f75dd408665686aaf426563556bb5283554a3539177e03b8c", size = 4690433, upload-time = "2026-05-04T22:57:40.373Z" },
{ url = "https://files.pythonhosted.org/packages/90/04/673510ed51ddff56575f306cf1617d80411ee76831ccd3097599140efdfe/cryptography-48.0.0-cp311-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:7995ef305d7165c3f11ae07f2517e5a4f1d5c18da1376a0a9ed496336b69e5f3", size = 4710620, upload-time = "2026-05-04T22:57:42.935Z" },
{ url = "https://files.pythonhosted.org/packages/14/d5/e9c4ef932c8d800490c34d8bd589d64a31d5890e27ec9e9ad532be893294/cryptography-48.0.0-cp311-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:40ba1f85eaa6959837b1d51c9767e230e14612eea4ef110ee8854ada22da1bf5", size = 4696283, upload-time = "2026-05-04T22:57:45.294Z" },
{ url = "https://files.pythonhosted.org/packages/0c/29/174b9dfb60b12d59ecfc6cfa04bc88c21b42a54f01b8aae09bb6e51e4c7f/cryptography-48.0.0-cp311-abi3-manylinux_2_28_ppc64le.whl", hash = "sha256:369a6348999f94bbd53435c894377b20ab95f25a9065c283570e70150d8abc3c", size = 5296573, upload-time = "2026-05-04T22:57:47.933Z" },
{ url = "https://files.pythonhosted.org/packages/95/38/0d29a6fd7d0d1373f0c0c88a04ba20e359b257753ac497564cd660fc1d55/cryptography-48.0.0-cp311-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:a0e692c683f4df67815a2d258b324e66f4738bd7a96a218c826dce4f4bd05d8f", size = 4743677, upload-time = "2026-05-04T22:57:50.067Z" },
{ url = "https://files.pythonhosted.org/packages/30/be/eef653013d5c63b6a490529e0316f9ac14a37602965d4903efed1399f32b/cryptography-48.0.0-cp311-abi3-manylinux_2_31_armv7l.whl", hash = "sha256:18349bbc56f4743c8b12dc32e2bccb2cf83ee8b69a3bba74ef8ae857e26b3d25", size = 4330808, upload-time = "2026-05-04T22:57:52.301Z" },
{ url = "https://files.pythonhosted.org/packages/84/9e/500463e87abb7a0a0f9f256ec21123ecde0a7b5541a15e840ea54551fd81/cryptography-48.0.0-cp311-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:7e8eac43dfca5c4cccc6dad9a80504436fca53bb9bc3100a2386d730fbe6b602", size = 4695941, upload-time = "2026-05-04T22:57:54.603Z" },
{ url = "https://files.pythonhosted.org/packages/e3/dc/7303087450c2ec9e7fbb750e17c2abfbc658f23cbd0e54009509b7cc4091/cryptography-48.0.0-cp311-abi3-manylinux_2_34_ppc64le.whl", hash = "sha256:9ccdac7d40688ecb5a3b4a604b8a88c8002e3442d6c60aead1db2a89a041560c", size = 5252579, upload-time = "2026-05-04T22:57:57.207Z" },
{ url = "https://files.pythonhosted.org/packages/d0/c0/7101d3b7215edcdc90c45da544961fd8ed2d6448f77577460fa75a8443f7/cryptography-48.0.0-cp311-abi3-manylinux_2_34_x86_64.whl", hash = "sha256:bd72e68b06bb1e96913f97dd4901119bc17f39d4586a5adf2d3e47bc2b9d58b5", size = 4743326, upload-time = "2026-05-04T22:57:59.535Z" },
{ url = "https://files.pythonhosted.org/packages/ac/d8/5b833bad13016f562ab9d063d68199a4bd121d18458e439515601d3357ec/cryptography-48.0.0-cp311-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:59baa2cb386c4f0b9905bd6eb4c2a79a69a128408fd31d32ca4d7102d4156321", size = 4826672, upload-time = "2026-05-04T22:58:01.996Z" },
{ url = "https://files.pythonhosted.org/packages/98/e1/7074eb8bf3c135558c73fc2bcf0f5633f912e6fb87e868a55c454080ef09/cryptography-48.0.0-cp311-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:9249e3cd978541d665967ac2cb2787fd6a62bddf1e75b3e347a594d7dacf4f74", size = 4972574, upload-time = "2026-05-04T22:58:03.968Z" },
{ url = "https://files.pythonhosted.org/packages/04/70/e5a1b41d325f797f39427aa44ef8baf0be500065ab6d8e10369d850d4a4f/cryptography-48.0.0-cp311-abi3-win32.whl", hash = "sha256:9c459db21422be75e2809370b829a87eb37f74cd785fc4aa9ea1e5f43b47cda4", size = 3294868, upload-time = "2026-05-04T22:58:06.467Z" },
{ url = "https://files.pythonhosted.org/packages/f4/ac/8ac51b4a5fc5932eb7ee5c517ba7dc8cd834f0048962b6b352f00f41ebf9/cryptography-48.0.0-cp311-abi3-win_amd64.whl", hash = "sha256:5b012212e08b8dd5edc78ef54da83dd9892fd9105323b3993eff6bea65dc21d7", size = 3817107, upload-time = "2026-05-04T22:58:08.845Z" },
{ url = "https://files.pythonhosted.org/packages/f2/63/61d4a4e1c6b6bab6ce1e213cd36a24c415d90e76d78c5eb8577c5541d2e8/cryptography-48.0.0-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:58d00498e8933e4a194f3076aee1b4a97dfec1a6da444535755822fe5d8b0b86", size = 7983482, upload-time = "2026-05-04T22:58:43.769Z" },
{ url = "https://files.pythonhosted.org/packages/d5/ac/f5b5995b87770c693e2596559ffafe195b4033a57f14a82268a2842953f3/cryptography-48.0.0-cp39-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:614d0949f4790582d2cc25553abd09dd723025f0c0e7c67376a1d77196743d6e", size = 4683266, upload-time = "2026-05-04T22:58:46.064Z" },
{ url = "https://files.pythonhosted.org/packages/ec/c6/8b14f67e18338fbc4adb76f66c001f5c3610b3e2d1837f268f47a347dbbb/cryptography-48.0.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:7ce4bfae76319a532a2dc68f82cc32f5676ee792a983187dac07183690e5c66f", size = 4696228, upload-time = "2026-05-04T22:58:48.22Z" },
{ url = "https://files.pythonhosted.org/packages/ea/73/f808fbae9514bd91b47875b003f13e284c8c6bdfd904b7944e803937eec1/cryptography-48.0.0-cp39-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:2eb992bbd4661238c5a397594c83f5b4dc2bc5b848c365c8f991b6780efcc5c7", size = 4689097, upload-time = "2026-05-04T22:58:50.9Z" },
{ url = "https://files.pythonhosted.org/packages/93/01/d86632d7d28db8ae83221995752eeb6639ffb374c2d22955648cf8d52797/cryptography-48.0.0-cp39-abi3-manylinux_2_28_ppc64le.whl", hash = "sha256:22a5cb272895dce158b2cacdfdc3debd299019659f42947dbdac6f32d68fe832", size = 5283582, upload-time = "2026-05-04T22:58:53.017Z" },
{ url = "https://files.pythonhosted.org/packages/02/e1/50edc7a50334807cc4791fc4a0ce7468b4a1416d9138eab358bfc9a3d70b/cryptography-48.0.0-cp39-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:2b4d59804e8408e2fea7d1fbaf218e5ec984325221db76e6a241a9abd6cdd95c", size = 4730479, upload-time = "2026-05-04T22:58:55.611Z" },
{ url = "https://files.pythonhosted.org/packages/6f/af/99a582b1b1641ff5911ac559beb45097cf79efd4ead4657f578ef1af2d47/cryptography-48.0.0-cp39-abi3-manylinux_2_31_armv7l.whl", hash = "sha256:984a20b0f62a26f48a3396c72e4bc34c66e356d356bf370053066b3b6d54634a", size = 4326481, upload-time = "2026-05-04T22:58:57.607Z" },
{ url = "https://files.pythonhosted.org/packages/90/ee/89aa26a06ef0a7d7611788ffd571a7c50e368cc6a4d5eef8b4884e866edb/cryptography-48.0.0-cp39-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:5a5ed8fde7a1d09376ca0b40e68cd59c69fe23b1f9768bd5824f54681626032a", size = 4688713, upload-time = "2026-05-04T22:59:00.077Z" },
{ url = "https://files.pythonhosted.org/packages/70/ba/bcb1b0bb7a33d4c7c0c4d4c7874b4a62ae4f56113a5f4baefa362dfb1f0f/cryptography-48.0.0-cp39-abi3-manylinux_2_34_ppc64le.whl", hash = "sha256:8cd666227ef7af430aa5914a9910e0ddd703e75f039cef0825cd0da71b6b711a", size = 5238165, upload-time = "2026-05-04T22:59:02.317Z" },
{ url = "https://files.pythonhosted.org/packages/c9/70/ca4003b1ce5ca3dc3186ada51908c8a9b9ff7d5cab83cc0d43ee14ec144f/cryptography-48.0.0-cp39-abi3-manylinux_2_34_x86_64.whl", hash = "sha256:9071196d81abc88b3516ac8cdfad32e2b66dd4a5393a8e68a961e9161ddc6239", size = 4729947, upload-time = "2026-05-04T22:59:05.255Z" },
{ url = "https://files.pythonhosted.org/packages/44/a0/4ec7cf774207905aef1a8d11c3750d5a1db805eb380ee4e16df317870128/cryptography-48.0.0-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:1e2d54c8be6152856a36f0882ab231e70f8ec7f14e93cf87db8a2ed056bf160c", size = 4822059, upload-time = "2026-05-04T22:59:07.802Z" },
{ url = "https://files.pythonhosted.org/packages/1e/75/a2e55f99c16fcac7b5d6c1eb19ad8e00799854d6be5ca845f9259eae1681/cryptography-48.0.0-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:a5da777e32ffed6f85a7b2b3f7c5cbc88c146bfcd0a1d7baf5fcc6c52ee35dd4", size = 4960575, upload-time = "2026-05-04T22:59:09.851Z" },
{ url = "https://files.pythonhosted.org/packages/b8/23/6e6f32143ab5d8b36ca848a502c4bcd477ae75b9e1677e3530d669062578/cryptography-48.0.0-cp39-abi3-win32.whl", hash = "sha256:77a2ccbbe917f6710e05ba9adaa25fb5075620bf3ea6fb751997875aff4ae4bd", size = 3279117, upload-time = "2026-05-04T22:59:12.019Z" },
{ url = "https://files.pythonhosted.org/packages/9d/9a/0fea98a70cf1749d41d738836f6349d97945f7c89433a259a6c2642eefeb/cryptography-48.0.0-cp39-abi3-win_amd64.whl", hash = "sha256:16cd65b9330583e4619939b3a3843eec1e6e789744bb01e7c7e2e62e33c239c8", size = 3792100, upload-time = "2026-05-04T22:59:14.884Z" },
]
[[package]]
name = "dnspython"
version = "2.8.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/8c/8b/57666417c0f90f08bcafa776861060426765fdb422eb10212086fb811d26/dnspython-2.8.0.tar.gz", hash = "sha256:181d3c6996452cb1189c4046c61599b84a5a86e099562ffde77d26984ff26d0f", size = 368251, upload-time = "2025-09-07T18:58:00.022Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ba/5a/18ad964b0086c6e62e2e7500f7edc89e3faa45033c71c1893d34eed2b2de/dnspython-2.8.0-py3-none-any.whl", hash = "sha256:01d9bbc4a2d76bf0db7c1f729812ded6d912bd318d3b1cf81d30c0f845dbf3af", size = 331094, upload-time = "2025-09-07T18:57:58.071Z" },
]
[[package]]
name = "durationpy"
version = "0.10"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/9d/a4/e44218c2b394e31a6dd0d6b095c4e1f32d0be54c2a4b250032d717647bab/durationpy-0.10.tar.gz", hash = "sha256:1fa6893409a6e739c9c72334fc65cca1f355dbdd93405d30f726deb5bde42fba", size = 3335, upload-time = "2025-05-17T13:52:37.26Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/b0/0d/9feae160378a3553fa9a339b0e9c1a048e147a4127210e286ef18b730f03/durationpy-0.10-py3-none-any.whl", hash = "sha256:3b41e1b601234296b4fb368338fdcd3e13e0b4fb5b67345948f4f2bf9868b286", size = 3922, upload-time = "2025-05-17T13:52:36.463Z" },
]
[[package]]
name = "idna"
version = "3.13"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/ce/cc/762dfb036166873f0059f3b7de4565e1b5bc3d6f28a414c13da27e442f99/idna-3.13.tar.gz", hash = "sha256:585ea8fe5d69b9181ec1afba340451fba6ba764af97026f92a91d4eef164a242", size = 194210, upload-time = "2026-04-22T16:42:42.314Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/5d/13/ad7d7ca3808a898b4612b6fe93cde56b53f3034dcde235acb1f0e1df24c6/idna-3.13-py3-none-any.whl", hash = "sha256:892ea0cde124a99ce773decba204c5552b69c3c67ffd5f232eb7696135bc8bb3", size = 68629, upload-time = "2026-04-22T16:42:40.909Z" },
]
[[package]]
name = "jinja2"
version = "3.1.6"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "markupsafe" },
]
sdist = { url = "https://files.pythonhosted.org/packages/df/bf/f7da0350254c0ed7c72f3e33cef02e048281fec7ecec5f032d4aac52226b/jinja2-3.1.6.tar.gz", hash = "sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d", size = 245115, upload-time = "2025-03-05T20:05:02.478Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload-time = "2025-03-05T20:05:00.369Z" },
]
[[package]]
name = "jmespath"
version = "1.1.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/d3/59/322338183ecda247fb5d1763a6cbe46eff7222eaeebafd9fa65d4bf5cb11/jmespath-1.1.0.tar.gz", hash = "sha256:472c87d80f36026ae83c6ddd0f1d05d4e510134ed462851fd5f754c8c3cbb88d", size = 27377, upload-time = "2026-01-22T16:35:26.279Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/14/2f/967ba146e6d58cf6a652da73885f52fc68001525b4197effc174321d70b4/jmespath-1.1.0-py3-none-any.whl", hash = "sha256:a5663118de4908c91729bea0acadca56526eb2698e83de10cd116ae0f4e97c64", size = 20419, upload-time = "2026-01-22T16:35:24.919Z" },
]
[[package]]
name = "kubernetes"
version = "35.0.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "certifi" },
{ name = "durationpy" },
{ name = "python-dateutil" },
{ name = "pyyaml" },
{ name = "requests" },
{ name = "requests-oauthlib" },
{ name = "six" },
{ name = "urllib3" },
{ name = "websocket-client" },
]
sdist = { url = "https://files.pythonhosted.org/packages/2c/8f/85bf51ad4150f64e8c665daf0d9dfe9787ae92005efb9a4d1cba592bd79d/kubernetes-35.0.0.tar.gz", hash = "sha256:3d00d344944239821458b9efd484d6df9f011da367ecb155dadf9513f05f09ee", size = 1094642, upload-time = "2026-01-16T01:05:27.76Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/0c/70/05b685ea2dffcb2adbf3cdcea5d8865b7bc66f67249084cf845012a0ff13/kubernetes-35.0.0-py2.py3-none-any.whl", hash = "sha256:39e2b33b46e5834ef6c3985ebfe2047ab39135d41de51ce7641a7ca5b372a13d", size = 2017602, upload-time = "2026-01-16T01:05:25.991Z" },
]
[[package]]
name = "markupsafe"
version = "3.0.3"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/7e/99/7690b6d4034fffd95959cbe0c02de8deb3098cc577c67bb6a24fe5d7caa7/markupsafe-3.0.3.tar.gz", hash = "sha256:722695808f4b6457b320fdc131280796bdceb04ab50fe1795cd540799ebe1698", size = 80313, upload-time = "2025-09-27T18:37:40.426Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/5a/72/147da192e38635ada20e0a2e1a51cf8823d2119ce8883f7053879c2199b5/markupsafe-3.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d53197da72cc091b024dd97249dfc7794d6a56530370992a5e1a08983ad9230e", size = 11615, upload-time = "2025-09-27T18:36:30.854Z" },
{ url = "https://files.pythonhosted.org/packages/9a/81/7e4e08678a1f98521201c3079f77db69fb552acd56067661f8c2f534a718/markupsafe-3.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:1872df69a4de6aead3491198eaf13810b565bdbeec3ae2dc8780f14458ec73ce", size = 12020, upload-time = "2025-09-27T18:36:31.971Z" },
{ url = "https://files.pythonhosted.org/packages/1e/2c/799f4742efc39633a1b54a92eec4082e4f815314869865d876824c257c1e/markupsafe-3.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:3a7e8ae81ae39e62a41ec302f972ba6ae23a5c5396c8e60113e9066ef893da0d", size = 24332, upload-time = "2025-09-27T18:36:32.813Z" },
{ url = "https://files.pythonhosted.org/packages/3c/2e/8d0c2ab90a8c1d9a24f0399058ab8519a3279d1bd4289511d74e909f060e/markupsafe-3.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d6dd0be5b5b189d31db7cda48b91d7e0a9795f31430b7f271219ab30f1d3ac9d", size = 22947, upload-time = "2025-09-27T18:36:33.86Z" },
{ url = "https://files.pythonhosted.org/packages/2c/54/887f3092a85238093a0b2154bd629c89444f395618842e8b0c41783898ea/markupsafe-3.0.3-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:94c6f0bb423f739146aec64595853541634bde58b2135f27f61c1ffd1cd4d16a", size = 21962, upload-time = "2025-09-27T18:36:35.099Z" },
{ url = "https://files.pythonhosted.org/packages/c9/2f/336b8c7b6f4a4d95e91119dc8521402461b74a485558d8f238a68312f11c/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:be8813b57049a7dc738189df53d69395eba14fb99345e0a5994914a3864c8a4b", size = 23760, upload-time = "2025-09-27T18:36:36.001Z" },
{ url = "https://files.pythonhosted.org/packages/32/43/67935f2b7e4982ffb50a4d169b724d74b62a3964bc1a9a527f5ac4f1ee2b/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:83891d0e9fb81a825d9a6d61e3f07550ca70a076484292a70fde82c4b807286f", size = 21529, upload-time = "2025-09-27T18:36:36.906Z" },
{ url = "https://files.pythonhosted.org/packages/89/e0/4486f11e51bbba8b0c041098859e869e304d1c261e59244baa3d295d47b7/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:77f0643abe7495da77fb436f50f8dab76dbc6e5fd25d39589a0f1fe6548bfa2b", size = 23015, upload-time = "2025-09-27T18:36:37.868Z" },
{ url = "https://files.pythonhosted.org/packages/2f/e1/78ee7a023dac597a5825441ebd17170785a9dab23de95d2c7508ade94e0e/markupsafe-3.0.3-cp312-cp312-win32.whl", hash = "sha256:d88b440e37a16e651bda4c7c2b930eb586fd15ca7406cb39e211fcff3bf3017d", size = 14540, upload-time = "2025-09-27T18:36:38.761Z" },
{ url = "https://files.pythonhosted.org/packages/aa/5b/bec5aa9bbbb2c946ca2733ef9c4ca91c91b6a24580193e891b5f7dbe8e1e/markupsafe-3.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:26a5784ded40c9e318cfc2bdb30fe164bdb8665ded9cd64d500a34fb42067b1c", size = 15105, upload-time = "2025-09-27T18:36:39.701Z" },
{ url = "https://files.pythonhosted.org/packages/e5/f1/216fc1bbfd74011693a4fd837e7026152e89c4bcf3e77b6692fba9923123/markupsafe-3.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:35add3b638a5d900e807944a078b51922212fb3dedb01633a8defc4b01a3c85f", size = 13906, upload-time = "2025-09-27T18:36:40.689Z" },
]
[[package]]
name = "oauthlib"
version = "3.3.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/0b/5f/19930f824ffeb0ad4372da4812c50edbd1434f678c90c2733e1188edfc63/oauthlib-3.3.1.tar.gz", hash = "sha256:0f0f8aa759826a193cf66c12ea1af1637f87b9b4622d46e866952bb022e538c9", size = 185918, upload-time = "2025-06-19T22:48:08.269Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/be/9c/92789c596b8df838baa98fa71844d84283302f7604ed565dafe5a6b5041a/oauthlib-3.3.1-py3-none-any.whl", hash = "sha256:88119c938d2b8fb88561af5f6ee0eec8cc8d552b7bb1f712743136eb7523b7a1", size = 160065, upload-time = "2025-06-19T22:48:06.508Z" },
]
[[package]]
name = "packaging"
version = "26.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/d7/f1/e7a6dd94a8d4a5626c03e4e99c87f241ba9e350cd9e6d75123f992427270/packaging-26.2.tar.gz", hash = "sha256:ff452ff5a3e828ce110190feff1178bb1f2ea2281fa2075aadb987c2fb221661", size = 228134, upload-time = "2026-04-24T20:15:23.917Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/df/b2/87e62e8c3e2f4b32e5fe99e0b86d576da1312593b39f47d8ceef365e95ed/packaging-26.2-py3-none-any.whl", hash = "sha256:5fc45236b9446107ff2415ce77c807cee2862cb6fac22b8a73826d0693b0980e", size = 100195, upload-time = "2026-04-24T20:15:22.081Z" },
]
[[package]]
name = "pycparser"
version = "3.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/1b/7d/92392ff7815c21062bea51aa7b87d45576f649f16458d78b7cf94b9ab2e6/pycparser-3.0.tar.gz", hash = "sha256:600f49d217304a5902ac3c37e1281c9fe94e4d0489de643a9504c5cdfdfc6b29", size = 103492, upload-time = "2026-01-21T14:26:51.89Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/0c/c3/44f3fbbfa403ea2a7c779186dc20772604442dde72947e7d01069cbe98e3/pycparser-3.0-py3-none-any.whl", hash = "sha256:b727414169a36b7d524c1c3e31839a521725078d7b2ff038656844266160a992", size = 48172, upload-time = "2026-01-21T14:26:50.693Z" },
]
[[package]]
name = "python-dateutil"
version = "2.9.0.post0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "six" },
]
sdist = { url = "https://files.pythonhosted.org/packages/66/c0/0c8b6ad9f17a802ee498c46e004a0eb49bc148f2fd230864601a86dcf6db/python-dateutil-2.9.0.post0.tar.gz", hash = "sha256:37dd54208da7e1cd875388217d5e00ebd4179249f90fb72437e91a35459a0ad3", size = 342432, upload-time = "2024-03-01T18:36:20.211Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427", size = 229892, upload-time = "2024-03-01T18:36:18.57Z" },
]
[[package]]
name = "pyyaml"
version = "6.0.3"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/05/8e/961c0007c59b8dd7729d542c61a4d537767a59645b82a0b521206e1e25c2/pyyaml-6.0.3.tar.gz", hash = "sha256:d76623373421df22fb4cf8817020cbb7ef15c725b9d5e45f17e189bfc384190f", size = 130960, upload-time = "2025-09-25T21:33:16.546Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d1/33/422b98d2195232ca1826284a76852ad5a86fe23e31b009c9886b2d0fb8b2/pyyaml-6.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7f047e29dcae44602496db43be01ad42fc6f1cc0d8cd6c83d342306c32270196", size = 182063, upload-time = "2025-09-25T21:32:11.445Z" },
{ url = "https://files.pythonhosted.org/packages/89/a0/6cf41a19a1f2f3feab0e9c0b74134aa2ce6849093d5517a0c550fe37a648/pyyaml-6.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:fc09d0aa354569bc501d4e787133afc08552722d3ab34836a80547331bb5d4a0", size = 173973, upload-time = "2025-09-25T21:32:12.492Z" },
{ url = "https://files.pythonhosted.org/packages/ed/23/7a778b6bd0b9a8039df8b1b1d80e2e2ad78aa04171592c8a5c43a56a6af4/pyyaml-6.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9149cad251584d5fb4981be1ecde53a1ca46c891a79788c0df828d2f166bda28", size = 775116, upload-time = "2025-09-25T21:32:13.652Z" },
{ url = "https://files.pythonhosted.org/packages/65/30/d7353c338e12baef4ecc1b09e877c1970bd3382789c159b4f89d6a70dc09/pyyaml-6.0.3-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:5fdec68f91a0c6739b380c83b951e2c72ac0197ace422360e6d5a959d8d97b2c", size = 844011, upload-time = "2025-09-25T21:32:15.21Z" },
{ url = "https://files.pythonhosted.org/packages/8b/9d/b3589d3877982d4f2329302ef98a8026e7f4443c765c46cfecc8858c6b4b/pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ba1cc08a7ccde2d2ec775841541641e4548226580ab850948cbfda66a1befcdc", size = 807870, upload-time = "2025-09-25T21:32:16.431Z" },
{ url = "https://files.pythonhosted.org/packages/05/c0/b3be26a015601b822b97d9149ff8cb5ead58c66f981e04fedf4e762f4bd4/pyyaml-6.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:8dc52c23056b9ddd46818a57b78404882310fb473d63f17b07d5c40421e47f8e", size = 761089, upload-time = "2025-09-25T21:32:17.56Z" },
{ url = "https://files.pythonhosted.org/packages/be/8e/98435a21d1d4b46590d5459a22d88128103f8da4c2d4cb8f14f2a96504e1/pyyaml-6.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:41715c910c881bc081f1e8872880d3c650acf13dfa8214bad49ed4cede7c34ea", size = 790181, upload-time = "2025-09-25T21:32:18.834Z" },
{ url = "https://files.pythonhosted.org/packages/74/93/7baea19427dcfbe1e5a372d81473250b379f04b1bd3c4c5ff825e2327202/pyyaml-6.0.3-cp312-cp312-win32.whl", hash = "sha256:96b533f0e99f6579b3d4d4995707cf36df9100d67e0c8303a0c55b27b5f99bc5", size = 137658, upload-time = "2025-09-25T21:32:20.209Z" },
{ url = "https://files.pythonhosted.org/packages/86/bf/899e81e4cce32febab4fb42bb97dcdf66bc135272882d1987881a4b519e9/pyyaml-6.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:5fcd34e47f6e0b794d17de1b4ff496c00986e1c83f7ab2fb8fcfe9616ff7477b", size = 154003, upload-time = "2025-09-25T21:32:21.167Z" },
{ url = "https://files.pythonhosted.org/packages/1a/08/67bd04656199bbb51dbed1439b7f27601dfb576fb864099c7ef0c3e55531/pyyaml-6.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:64386e5e707d03a7e172c0701abfb7e10f0fb753ee1d773128192742712a98fd", size = 140344, upload-time = "2025-09-25T21:32:22.617Z" },
]
[[package]]
name = "requests"
version = "2.33.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "certifi" },
{ name = "charset-normalizer" },
{ name = "idna" },
{ name = "urllib3" },
]
sdist = { url = "https://files.pythonhosted.org/packages/5f/a4/98b9c7c6428a668bf7e42ebb7c79d576a1c3c1e3ae2d47e674b468388871/requests-2.33.1.tar.gz", hash = "sha256:18817f8c57c6263968bc123d237e3b8b08ac046f5456bd1e307ee8f4250d3517", size = 134120, upload-time = "2026-03-30T16:09:15.531Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d7/8e/7540e8a2036f79a125c1d2ebadf69ed7901608859186c856fa0388ef4197/requests-2.33.1-py3-none-any.whl", hash = "sha256:4e6d1ef462f3626a1f0a0a9c42dd93c63bad33f9f1c1937509b8c5c8718ab56a", size = 64947, upload-time = "2026-03-30T16:09:13.83Z" },
]
[[package]]
name = "requests-oauthlib"
version = "2.0.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "oauthlib" },
{ name = "requests" },
]
sdist = { url = "https://files.pythonhosted.org/packages/42/f2/05f29bc3913aea15eb670be136045bf5c5bbf4b99ecb839da9b422bb2c85/requests-oauthlib-2.0.0.tar.gz", hash = "sha256:b3dffaebd884d8cd778494369603a9e7b58d29111bf6b41bdc2dcd87203af4e9", size = 55650, upload-time = "2024-03-22T20:32:29.939Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/3b/5d/63d4ae3b9daea098d5d6f5da83984853c1bbacd5dc826764b249fe119d24/requests_oauthlib-2.0.0-py2.py3-none-any.whl", hash = "sha256:7dd8a5c40426b779b0868c404bdef9768deccf22749cde15852df527e6269b36", size = 24179, upload-time = "2024-03-22T20:32:28.055Z" },
]
[[package]]
name = "resolvelib"
version = "1.2.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/1d/14/4669927e06631070edb968c78fdb6ce8992e27c9ab2cde4b3993e22ac7af/resolvelib-1.2.1.tar.gz", hash = "sha256:7d08a2022f6e16ce405d60b68c390f054efcfd0477d4b9bd019cc941c28fad1c", size = 24575, upload-time = "2025-10-11T01:07:44.582Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/e2/23/c941a0d0353681ca138489983c4309e0f5095dfd902e1357004f2357ddf2/resolvelib-1.2.1-py3-none-any.whl", hash = "sha256:fb06b66c8da04172d9e72a21d7d06186d8919e32ae5ab5cdf5b9d920be805ac2", size = 18737, upload-time = "2025-10-11T01:07:43.081Z" },
]
[[package]]
name = "six"
version = "1.17.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/94/e7/b2c673351809dca68a0e064b6af791aa332cf192da575fd474ed7d6f16a2/six-1.17.0.tar.gz", hash = "sha256:ff70335d468e7eb6ec65b95b99d3a2836546063f63acc5171de367e834932a81", size = 34031, upload-time = "2024-12-04T17:35:28.174Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/b7/ce/149a00dd41f10bc29e5921b496af8b574d8413afcd5e30dfa0ed46c2cc5e/six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274", size = 11050, upload-time = "2024-12-04T17:35:26.475Z" },
]
[[package]]
name = "urllib3"
version = "2.6.3"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/c7/24/5f1b3bdffd70275f6661c76461e25f024d5a38a46f04aaca912426a2b1d3/urllib3-2.6.3.tar.gz", hash = "sha256:1b62b6884944a57dbe321509ab94fd4d3b307075e0c2eae991ac71ee15ad38ed", size = 435556, upload-time = "2026-01-07T16:24:43.925Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/39/08/aaaad47bc4e9dc8c725e68f9d04865dbcb2052843ff09c97b08904852d84/urllib3-2.6.3-py3-none-any.whl", hash = "sha256:bf272323e553dfb2e87d9bfd225ca7b0f467b919d7bbd355436d3fd37cb0acd4", size = 131584, upload-time = "2026-01-07T16:24:42.685Z" },
]
[[package]]
name = "websocket-client"
version = "1.9.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/2c/41/aa4bf9664e4cda14c3b39865b12251e8e7d239f4cd0e3cc1b6c2ccde25c1/websocket_client-1.9.0.tar.gz", hash = "sha256:9e813624b6eb619999a97dc7958469217c3176312b3a16a4bd1bc7e08a46ec98", size = 70576, upload-time = "2025-10-07T21:16:36.495Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/34/db/b10e48aa8fff7407e67470363eac595018441cf32d5e1001567a7aeba5d2/websocket_client-1.9.0-py3-none-any.whl", hash = "sha256:af248a825037ef591efbf6ed20cc5faa03d3b47b9e5a2230a529eeee1c1fc3ef", size = 82616, upload-time = "2025-10-07T21:16:34.951Z" },
]

View File

@@ -0,0 +1,84 @@
[vibe](../README.md) > [ADR](README.md) > **0001 · Safe, production-like environment**
# ADR-0001: Safe, production-like environment for the lab
> **Status**: Accepted
> **Date**: 2026-06-23
> **Deciders**: @arcodange
## Context
The Arcodange lab doubles as company production. The same three Raspberry Pis and one MacBook control node run the public CMS (`arcodange.fr`), Zoho-backed business email, the Dolibarr ERP holding accounting and business records, the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one laptop holding the kubeconfig, the Vault root token, and every cloud admin token.
There is no separation between *where I experiment* and *where the business runs*. Every risky change is tested directly in production. The 2026-04-13 power-cut proved recovery is manual, multi-step, and only ever validated by a real incident — never rehearsed.
The danger is concentrated in a handful of change-classes. Each one can cause silent, fleet-wide, or data-losing damage when applied to the live environment:
| Change-class | Blast radius if wrong |
| --- | --- |
| Ansible playbook edits | Can wipe disks, reset k3s, or corrupt Longhorn across the fleet. |
| Vault policy / auth / mount changes | Lock out the Vault Secrets Operator → fleet-wide secret outage; a botched init could overwrite the single unseal key. |
| Postgres migrations / role changes | The superuser provider on `192.168.1.202` can drop or alter live databases → ERP data loss. |
| ArgoCD sync / app-of-apps changes | `prune` + `selfHeal` auto-prunes live resources fleet-wide. |
| Cloudflare / DNS / email changes | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. |
| Longhorn / storage ops | Volume recreation orphans replicas via new engine IDs. |
| Recovery drills | The runbook is only validated by real incidents, never rehearsed. |
| Cert / PKI re-init | Rotates the internal CA, invalidating every issued `*.arcodange.lab` cert. |
A change-management process is not enough: the operator needs a place to *make the mistake first*, where the mistake cannot reach production.
## Decision
We will build a **local-only safe environment on the MacBook control node**, seeded from the *same* GitOps repos via a dedicated sandbox inventory, with two modes:
- **(a) k3d single-node "fast inner loop"** (~60s bring-up) for app, Vault, and ArgoCD iteration.
- **(b) Three arm64 VMs** (multipass or Vagrant on the M4) reproducing the three-node topology — Postgres + Gitea as docker-compose *outside* k3s on the "pi2-equivalent" VM, Longhorn across the three VM disks — for Ansible, Longhorn, and recovery work.
The load-bearing requirement is the **isolation boundary**: the sandbox must be *unable* to mutate real production even on a wrong command. Each production coupling maps to a concrete guardrail — a separate sandbox inventory with a prod-IP abort guard, sandbox GCS state prefixes, a separate sandbox Vault with its own unseal-key path, a sandbox Postgres host check, and plan-only DNS against a throwaway zone. The real `arcodange.fr` Cloudflare/Zoho tokens are never exported into a sandbox shell. Because the `<app>` convention keys everything *within* a cluster/Vault/DB, the sandbox reuses identical `<app>` names with no collision — the boundary is the cluster + Vault + state + DNS zone, not the names, so runbooks read identically in both environments.
See the [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) for the full coupling→control mapping, and the [PRD](../PRD/safe-prod-like-environment/README.md) for the complete product view.
## Consequences
- **+** $0 inner loop that runs on the existing control node — no new hardware.
- **+** Rehearses every dangerous change-class except real public DNS/email.
- **+** The 2026-04-13 recovery sequence becomes a repeatable drill instead of a once-per-incident gamble.
- **+** Identical `<app>` names mean runbooks are environment-agnostic.
- **** x86/ARM nuance must be handled (use arm64 VMs/images on the M4).
- **** New guardrail and parity-manifest maintenance burden.
- **** Single-laptop resource limits — k3d for speed, VMs only when multi-node fidelity is actually needed.
- **→** Real public DNS/ACME and physical-ARM always-on testing remain unsolved by design; revisit only if recurring game-days demand them.
## Alternatives considered
| Option | Fidelity | Isolation | $ / effort | Verdict |
| --- | --- | --- | --- | --- |
| 1 · Ephemeral local cluster (k3d/kind) | Medium (single-node) | Full (separate cluster) | $0 / low | ✅ **Chosen** as the fast mode. |
| 2 · Three arm64 VMs reproducing the topology | High (3-node, PG+Gitea outside k3s, Longhorn) | Full (separate VMs) | $0 / medium | ✅ **Chosen** for fidelity. |
| 3 · Sandbox namespace on the real cluster | High | None — shared Vault/PG/Longhorn/ArgoCD | $0 / low | ❌ **Rejected**: shared blast radius fails the core isolation requirement. |
| 4 · Dedicated physical node (4th Pi / mini-PC) | High (real ARM, always-on) | Full | $$ hardware / medium | ⛔ **Out of scope**: hardware cost; revisit only for recurring always-on ARM game-days. |
| 5 · Disposable cloud k3s for real public DNS/ACME | High infra, but arch drift | Full | $ recurring / medium | ⛔ **Out of scope**: cost + ARM drift; its only unique value is real DNS/email, which we explicitly do not test. |
## QA & validation
- **Parity manifest** — k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, PG + Gitea outside k3s, three nodes. Any drift from this manifest is a failure.
- **Provisioning-parity test** — run the new-web-app runbook for a throwaway `<app>` "canary" and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running.
- **Idempotence gate** — `ansible-playbook --check --diff` reports `changed=0` on the converged sandbox *before* any change is promoted to prod.
- **`tofu plan` diff gate** — plan against sandbox state; for DNS, assert it touches *only* the throwaway zone.
- **Chaos drills** (mapped to CLUSTER_RECOVERY.md sections): node-kill; Vault-seal (unseal via the *sandbox* key; VSO re-auths); Longhorn volume corruption (run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)); DB drop/restore; full power-cut simulation (execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green); ArgoCD bad-sync (observe prune/selfHeal, author a rollback runbook section); cert/PKI re-issue.
- **Monthly game-day** — the operator follows *only* the runbook; any improvised step becomes a runbook PR. This is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident.
- **Promotion gate** — no infra/Vault/storage/DNS change reaches prod until it has been applied to the sandbox *and* survived the matching drill. Each drill records time-to-recover and which step failed or was improvised.
See the [QA strategy leaf](../PRD/safe-prod-like-environment/qa-strategy.md) for the detailed drill table and evidence trail.
## References
- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — full product view, requirements, and phased rollout.
- [INV-001 · Prod blast-radius couplings](../investigations/INV-001-prod-blast-radius-couplings.md) — the investigation that mapped every prod coupling.
- [Guidebook · Lab ecosystem](../guidebooks/lab-ecosystem/README.md) — the end-to-end map of the prod topology this sandbox mirrors.
- [Guidebook · Storage and recovery](../guidebooks/lab-ecosystem/storage-and-recovery.md) — how Longhorn and the recovery sequence work today.
- [doc/adr index](../../doc/adr/README.md) — foundational infrastructure ADRs (read-only history).
- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association, exercised by the Longhorn chaos drill.
- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the `<app>` convention reused identically across sandbox and prod.
- CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section.
- PRs: [#10 — bootstrap vibe/ tree + ecosystem AGENTS.md](https://gitea.arcodange.lab/arcodange-org/factory/pulls/10).

45
vibe/ADR/README.md Normal file
View File

@@ -0,0 +1,45 @@
[vibe](../README.md) > **ADR**
# Architecture Decision Records
> **Status**: 🟢 Active
> **Last Updated**: 2026-06-23
> **Related**: [vibe/PRD](../PRD/README.md) · [vibe/Investigations](../investigations/README.md)
> **Historical**: [doc/adr](../../doc/adr/README.md) (foundational infra) · [ansible/.../docs/adr](../../ansible/arcodange/factory/docs/adr/) (dated infra ADRs)
`vibe/ADR/` is the **canonical home for Architecture Decision Records going forward**. The format is MADR-lite: one short, self-contained Markdown file per decision, focused on the *why* rather than the *how*. Use the [`_template.md`](_template.md) skeleton to start a new one.
## Where ADRs live
There are three ADR locations in this repo. Only the first accepts new records; the other two are read-only history kept for context.
| Location | Role | Accepts new ADRs? |
| --- | --- | --- |
| `vibe/ADR/` (this folder) | Canonical, MADR-lite, going forward | ✅ Yes |
| [`doc/adr/`](../../doc/adr/README.md) | Foundational infrastructure ADRs (DNS, k3s, CI/CD, Vault, telegram-gateway auth) | ❌ Historical |
| [`ansible/arcodange/factory/docs/adr/`](../../ansible/arcodange/factory/docs/adr/) | Dated infra ADRs (network, CI/CD, Longhorn PVC recovery, internal DNS) | ❌ Historical |
When a new decision *supersedes* one of the historical records, write the new ADR here, set the old one's status note to `Superseded by ADR-NNNN`, and cross-link both ways.
## Rules
- **One file per decision**, named `NNNN-kebab-title.md` (zero-padded sequence, e.g. `0001-safe-prod-like-environment.md`).
- **The body is immutable once `Accepted`.** A decision is a historical fact: do not rewrite the Context/Decision/Consequences after acceptance. The *only* mutation allowed on an accepted ADR is its **status** (e.g. flipping `Accepted``Superseded`).
- **Statuses**: `Proposed` (under discussion) → `Accepted` (decided, body frozen) → `Superseded` (replaced; points to the successor ADR). A `Proposed` ADR may still be edited freely.
- **No-tombstone rule.** Each file reads as currently true. Never leave "previously X, now Y", changelog lines, or "updated to ..." notes inside an ADR — git history is the audit trail. A superseded ADR keeps its original frozen body; the supersession is recorded only in its status line and the successor's References.
- **PR cross-link both ways.** The ADR References section links the PR that introduced it; the PR description links back to the ADR. Keep links bidirectional.
## Index
| # | Title | Status | Date |
| --- | --- | --- | --- |
| [0001](0001-safe-prod-like-environment.md) | Safe, production-like environment | 🟢 Accepted | 2026-06-23 |
## Rules to contribute
1. Copy [`_template.md`](_template.md) to `NNNN-kebab-title.md` using the next free sequence number and delete the top HTML-comment note.
2. Fill in the blockquote (Status/Date/Deciders), then Context, Decision, Consequences, Alternatives considered, QA & validation, References.
3. Open the ADR with status `Proposed`. Flip it to `Accepted` once the decision is settled — and from that point treat the body as frozen.
4. Add a row to the Index table above (newest at the bottom to preserve chronological numbering).
5. In the PR that lands the ADR, link to the ADR file; in the ADR's References, link back to the PR. Bidirectional links are mandatory.
6. If this ADR supersedes a historical one in `doc/adr/` or the Ansible ADR folder, update the old record's status note and cross-reference both directions.

41
vibe/ADR/_template.md Normal file
View File

@@ -0,0 +1,41 @@
[vibe](../README.md) > [ADR](README.md) > **_template**
<!-- Copy this file to NNNN-kebab-title.md, fill in, delete this note. -->
# ADR-NNNN: Title
> **Status**: Proposed | Accepted | Superseded by ADR-NNNN
> **Date**: YYYY-MM-DD
> **Deciders**: name(s)
## Context
What forces are at play? Describe the problem, the constraints, and the situation that makes a decision necessary. State facts, not opinions. Keep it short enough that a future reader understands *why* a decision was needed without prior context.
## Decision
The decision, stated in the active voice: "We will ...". One clear choice. If the decision has sub-parts, use a short bulleted list.
## Consequences
What becomes easier or harder as a result of this decision?
- **+** A positive outcome / something now enabled.
- **** A trade-off / cost / new constraint accepted.
- **→** A future follow-up this implies (work deferred, a door left open, a re-evaluation trigger).
## Alternatives considered
| Option | Why not |
| --- | --- |
| Alternative A | Reason it was rejected. |
| Alternative B | Reason it was rejected. |
## QA & validation
How was (or will be) this decision validated? Tests, smoke checks, manual verification, rollback plan, or the criteria that would tell us the decision was wrong.
## References
- Link to the PR that introduces this ADR (and ensure the PR links back here).
- Related ADRs, PRDs, investigations, or external docs (descriptive link text, never "here"/"this").

34
vibe/PRD/README.md Normal file
View File

@@ -0,0 +1,34 @@
[vibe](../README.md) > **PRD**
# Product Requirement Documents
> **Status**: 🟢 Active
> **Last Updated**: 2026-06-23
> **Related**: [vibe/ADR](../ADR/README.md) · [vibe/Investigations](../investigations/README.md)
`vibe/PRD/` holds the Product Requirement Documents that drive larger pieces of work in the lab. A PRD captures *what* we want and *why it matters*; the matching ADRs capture *how we decided to build it*, and investigations capture *what we learned* along the way.
## Convention
- **One subfolder per PRD**, kebab-case (e.g. `safe-prod-like-environment/`).
- Each subfolder **MUST** contain:
- `README.md` — the PRD hub: problem, goals/non-goals, requirements, success criteria, and a QA strategy.
- `STATUS.md` — the implementation tracker. **Update it whenever something ships** (a PR merges, a brick lands, a milestone closes). It is the living view of "where are we" against the PRD.
- A **big PRD uses tree-docs**: the `README.md` stays a hub and detail lives in leaf pages (each with its own breadcrumb and bidirectional cross-links). A tree-sized PRD **MUST** detail an explicit **QA strategy** — how the delivered work will be verified, and what "done and safe" means.
- **PRs cross-link to the PRD**, and the PRD's `STATUS.md` **cross-links back** to the PRs/ADRs/investigations that realised each part. Links are bidirectional.
- **No-tombstone rule** applies: the PRD reads as currently true. Progress lives in `STATUS.md` (which *is* a tracker and may legitimately list shipped items), not as "previously / now" edits scattered through the hub.
## Index
| PRD | Hub | Status |
| --- | --- | --- |
| Safe, production-like environment | [safe-prod-like-environment/README.md](safe-prod-like-environment/README.md) | 🟡 In design |
## Rules to contribute
1. Create a kebab-case subfolder named for the PRD.
2. Add `README.md` (the hub) and `STATUS.md` (the tracker). Both carry a breadcrumb first line and the leaf header blockquote (Status / Last Updated / Related).
3. In the hub, state the problem, goals and non-goals, requirements, success criteria, and the QA strategy. If the PRD is large, split detail into leaf pages and keep the README as a navigable hub.
4. Keep `STATUS.md` current: every time a piece ships, record it there and link the PR/ADR that delivered it.
5. Add a row to the Index table above.
6. Ensure every PR that implements part of the PRD links to the PRD, and that `STATUS.md` links back. Bidirectional links are mandatory.

View File

@@ -0,0 +1,69 @@
[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment**
# Safe, production-like environment
> **Status:** In design
> **Last Updated:** 2026-06-23
> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md)
## Problem
The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.
There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.
## Users & personas
A **single operator wearing two hats**:
- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.
Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.
## Goals & non-goals
**Goals**
- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**.
- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated.
- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
- Run on the **existing control node** (the MacBook) at **$0** marginal cost.
**Non-goals**
- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope.
- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred.
- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred.
- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.
## Requirements
Functional:
- **One-command bring-up**, seeded from the same GitOps repos as prod.
- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md).
- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail.
- **Two modes** seeded from the same repos via the sandbox inventory:
- **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration.
- **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.
The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`<app>` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md).
## QA strategy
Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md).
## Implementation status
Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md).
## Leaves
| Page | Summary | Status |
| --- | --- | --- |
| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `<app>` naming note; the token caution. | 🟡 In design |
| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design |
| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started |

View File

@@ -0,0 +1,52 @@
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **STATUS**
# STATUS — Safe, production-like environment
> **Last Updated:** 2026-06-23
Legend: ⬜ not started · 🟡 in progress · ✅ done
> [!IMPORTANT]
> This file MUST be updated whenever something ships. Every PR that advances a phase crosslinks back here (and the matching checkbox flips), and the [PRs](#prs) table gets a row.
## Phase 0 — Isolation guardrails
*Must land before any sandbox run.*
- [ ] ⬜ Sandbox inventory `inventory/sandbox/hosts.yml` (VM/cloud hosts only)
- [ ] ⬜ Prod-IP abort guard (aborts on `192.168.1.201-203` unless `i_mean_prod=true`)
- [ ] ⬜ Sandbox GCS state prefixes (`sandbox/...`) or `gs://arcodange-tf-sandbox`
- [ ] ⬜ Sandbox Vault unseal-key path (`~/.arcodange/sandbox/cluster-keys.json`)
- [ ] ⬜ Sandbox env profile / plan-only DNS against a throwaway zone
## Phase 1 — Tier-1 k3d fast mode
- [ ] ⬜ One-command bring-up seeded from GitOps
- [ ] ⬜ Parity manifest v1
- [ ] ⬜ Canary provisioning-parity test
- [ ]`changed=0` idempotence gate documented
## Phase 2 — Tier-1 3-VM cluster
- [ ] ⬜ Three arm64 VMs (multipass / Vagrant on the M4)
- [ ] ⬜ Same `system_k3s`; Postgres + Gitea outside k3s on the pi2-equivalent VM
- [ ] ⬜ Longhorn across the three VM disks
- [ ] ⬜ Chaos drills: node-kill / Vault-seal / DB-drop
- [ ] ⬜ First full `CLUSTER_RECOVERY` dry-run against the sandbox
## Phase 3 — Game-day operationalization
- [ ] ⬜ Monthly cadence + promotion gate in the PR checklist
- [ ] ⬜ Longhorn engine-ID drill
- [ ] ⬜ ArgoCD bad-sync rollback runbook
- [ ] ⬜ Evidence trail for ≥1 cycle
## Phase 4 — out of scope
Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s for real public DNS/ACME. See [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) for the rejected-alternatives rationale.
## PRs
| PR | Scope | Phase | Merged |
| --- | --- | --- | --- |
| [#10](https://gitea.arcodange.lab/arcodange-org/factory/pulls/10) | Bootstrap the `vibe/` tree + ecosystem `AGENTS.md` (PRD scaffold, not a phase deliverable) | — | 🟡 open |

View File

@@ -0,0 +1,31 @@
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **Isolation boundary**
# Isolation boundary
> **Status:** In design
> **Last Updated:** 2026-06-23
> **Upstream:** [Safe, production-like environment](README.md)
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
The isolation boundary is the load-bearing part of this PRD: the sandbox must be **unable to mutate real prod even on a wrong command**. Every prod coupling that a sandbox run could touch is mapped below to a concrete control. The boundary is the **cluster + Vault + state + DNS zone** — not the names (see the naming note).
## Prod couplings → sandbox controls
| Prod coupling | What it can break in prod | Sandbox control |
| --- | --- | --- |
| Ansible inventory `hosts.yml``192.168.1.201-203` | Wipe disks, reset k3s, corrupt Longhorn on the live Pis. | Separate `inventory/sandbox/hosts.yml` (VM/cloud hosts only) **plus** a pre-task guard that **aborts** if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true` is set explicitly. |
| OpenTofu state in `gs://arcodange-tf` (prefixes) | A sandbox apply rewrites live state and re-plans prod resources. | A sandbox prefix family (`sandbox/factory/main`, `sandbox/tools/...`, `sandbox/factory/postgres`) via a backend-config override, **or** a separate bucket `gs://arcodange-tf-sandbox`. Sandbox runs never touch prod state. |
| Gitea provider `base_url` `gitea.arcodange.lab` + ArgoCD `repoURL` / `targetRevision` | Sandbox commits/pushes into the prod forge; ArgoCD syncs sandbox refs onto the prod cluster. | Sandbox Gitea on the sandbox cluster (or org `arcodange-sandbox`); the sandbox app-of-apps points at a **sandbox branch** so the sandbox cluster syncs only sandbox refs. |
| Vault provider `address` `vault.arcodange.lab` + unseal key `~/.arcodange/cluster-keys.json` | Sandbox writes clobber prod policies/auth/mounts; a botched init overwrites the prod unseal key. | A **separate sandbox Vault**; override the unseal-key path to `~/.arcodange/sandbox/cluster-keys.json` so prod's key can never be overwritten. |
| PostgreSQL provider `host` `192.168.1.202` (superuser) | Drop or alter live DBs — including ERP business records. | Sandbox PG is the docker-compose on the sandbox pi2-equivalent; a guard **refuses apply** if `host == 192.168.1.202` and `workspace != prod`. |
| Cloudflare account / OVH `arcodange.fr` / Zoho live mail | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | DNS/email modules run **plan-only** against a throwaway zone/subdomain with a separate token. The real `arcodange.fr` token is **never** exported into a sandbox shell. Real public DNS/ACME is out of scope. |
| Longhorn backup bucket | A restore drill overwrites prod backups. | Sandbox backup target is a **separate bucket/prefix** so restore drills cannot overwrite prod backups. |
## The `<app>` naming note
The `<app>` key threads one kebab-case identifier through the Gitea repo, the PG db + role, the Vault paths/policies, the k8s namespace + SA, the ArgoCD Application, the GCS state prefix, and DNS — see [conventions](../../../doc/runbooks/new-web-app/conventions.md).
Because `<app>` keys everything **within** a cluster / Vault / DB / zone, the sandbox can reuse **identical `<app>` names with no collision**. The isolation boundary is the cluster + Vault + state + DNS zone, not the names. This is deliberate: runbooks read **identically** in both environments, so a drill exercises the exact same convention chain an operator runs in prod.
> [!CAUTION]
> The real `arcodange.fr` Cloudflare token must **never** be exported into a sandbox shell. DNS/email work in the sandbox is plan-only against a throwaway zone with its own separate token. Exporting the prod token into a sandbox session would defeat the entire isolation boundary — a single `tofu apply` could rewrite live public DNS or mail records.

View File

@@ -0,0 +1,49 @@
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy**
# QA strategy
> **Status:** In design
> **Last Updated:** 2026-06-23
> **Upstream:** [Safe, production-like environment](README.md)
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md)
The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate).
## Fidelity gates
- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate.
- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `<app>` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test.
- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready.
- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate.
## Chaos drills
Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.
| Drill | Action | Acceptance | Recovery section |
| --- | --- | --- | --- |
| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. |
| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. |
| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. |
| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. |
| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. |
| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). |
| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. |
1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets.
3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR.
4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last.
6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section.
7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.
## Recurring game-day & promotion gate
A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident.
**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md).
## Evidence trail
Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.

50
vibe/README.md Normal file
View File

@@ -0,0 +1,50 @@
# vibe/ — Arcodange Knowledge Base
You-are-here: the **root** of the `vibe/` knowledge tree — the front door for every doc agents write and read.
Up: [factory](../README.md) / [AGENTS.md](../AGENTS.md)
> **Status:** Active
> **Last Updated:** 2026-06-23
## What is `vibe/`?
`vibe/` is the knowledge base dedicated to **LLM agents** working on the Arcodange lab. It collects the *why* (ADRs), the *what/when* (PRDs), the *what-we-found* (investigations), the *how-it-fits-together* (guidebooks), the *how-to-do-it* (runbooks), and the *what-we-told-humans* (shareouts). Everything here is written in **English** — the single exception is **shareouts handouts, which are FRENCH**. Operating rules (no-tombstone, mermaid prefs, tree-docs, ADR/PRD/investigation conventions, PR crosslinking, language policy) are defined authoritatively in [AGENTS.md](../AGENTS.md); this page summarizes them and points there.
## Folder map
| Folder | When to use it | Status |
|---|---|---|
| [ADR](ADR/README.md) | Recording an architecture **decision** (MADR-lite; body immutable once Accepted). Canonical home going forward. | ⬜ |
| [PRD](PRD/README.md) | Specifying a **product/project**: Problem → … → QA strategy → `STATUS.md` (mandatory, kept current). | ⬜ |
| [investigations](investigations/README.md) | Capturing a **finding/analysis** — single `INV-NNN-slug.md`, or stub + notebooks when data-heavy. | ⬜ |
| [guidebooks](guidebooks/README.md) | Mapping a **component or the ecosystem** as navigable tree-docs (the lab cartography). | ⬜ |
| [runbooks](runbooks/README.md) | Documenting an **operational procedure** step-by-step with `[AGENT]` / `[HUMAN]` markers. | ⬜ |
| [shareouts](shareouts/README.md) | Producing **handouts/presentations** for humans (FRENCH). | ⬜ |
Status legend: ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started.
## Conventions at a glance
- **No-tombstone rule (foremost)** — write each file as currently true; never leave "previously X, now Y", changelogs, or "updated to …" notes. Git history is the audit trail. Only exception: a forward-looking `> [!CAUTION]` about a live risk.
- **Breadcrumb spine** — every non-root file starts with a breadcrumb: ancestors as relative links, current page bold-unlinked, separator ` > `. This root has no breadcrumb (it uses the you-are-here + up-link above instead).
- **README hub per folder** — each folder's `README.md` is an index table of its children (link + one-line summary + status), sorted by importance/sequence.
- **Bidirectional links** — if A references B as related, B references A. Use descriptive link text (never "here"/"this").
- **Mermaid prefs** — `theme base`/`forest` init directive; legible `classDef` palette (dark fills + light text); `<br>` not `\n`; leading space before slash-labels; validate with the Mermaid MCP; **a numbered ordered list restating the flow after every diagram**.
- **GitHub alert legend** — `[!NOTE]` info/forward-looking · `[!TIP]` aside · `[!IMPORTANT]` inherent constraint · `[!WARNING]` degraded-but-working · `[!CAUTION]` data-loss/breaking.
- **Status emoji legend** — ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started.
- **Language policy** — English throughout `vibe/`; FRENCH only for shareouts handouts.
Authority for all of the above: [AGENTS.md](../AGENTS.md).
## Maintenance policy
- **Adding a page** → also add its row to the parent folder's `README.md` index table.
- **Keep links bidirectional** → when you link A→B, add B→A.
- **Stamp `Last Updated:`** at each tree root (this file and every guidebook/big-PRD root) after any structural change.
- **Never tombstone** → edit content in place; let git carry the history.
- **Guidebook coupling** → changing a documented component means updating its guidebook page in the same change.
- **PR crosslinks** → every PR references the ADR/PRD it advances; that ADR's References and the PRD's `STATUS.md` link back.
## Cohort + workflow (recap)
Docs here are produced by a cohort of persona subagents — Lab Cartographer, ADR Scribe, PRD Architect, Runbook Engineer, Investigator, Diagram Smith, Continuity Warden — spawned via the Agent tool or a Workflow. The recommended pipeline for substantial contributions is **Scaffold → Author → Validate → Review → Assemble**. Full descriptions and responsibilities live in [AGENTS.md](../AGENTS.md).

52
vibe/guidebooks/README.md Normal file
View File

@@ -0,0 +1,52 @@
[vibe](../README.md) > **Guidebooks**
# Guidebooks
> **Status:** Active
> **Last Updated:** 2026-06-23
> **Related:** [vibe runbooks](../runbooks/README.md) · [vibe shareouts](../shareouts/README.md) · canonical docs under [doc/](../../doc/README.md)
## What a guidebook is
A **guidebook** is a *tree-doc reference map* of the lab: a navigable set of linked Markdown pages (a root index, per-folder README hubs, and leaf pages wired with breadcrumbs and bidirectional cross-references) whose job is to **describe how the system is actually wired right now** — components, the conventions that join them, and the data/control flows between them.
Guidebooks are descriptive maps, not procedures. They answer *"how does this fit together?"* For *"how do I execute X step by step?"* see the [runbooks](../runbooks/README.md). For *"why was it built this way?"* see the architecture decision records under [doc/adr](../../doc/adr/README.md).
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
SYS["Lab system<br>(factory + tools + cms)"]:::src --> GB["Guidebook<br>(tree-doc reference map)"]:::proc --> READER["Reader<br>(human or agent)<br>understands the wiring"]:::store
```
1. The lab system spans three repos — `factory`, `tools`, and `cms` — joined by the `<app>` naming convention.
2. A guidebook surveys that system and renders it as a tree-doc reference map: indexed folders, breadcrumb-linked leaves, Mermaid flow diagrams.
3. A reader (a human onboarding, or an agent planning a change) consumes the guidebook to understand how the pieces wire together before touching anything.
## Key maintenance rule
> [!IMPORTANT]
> **If a component documented in a guidebook is altered, the guidebook page describing it MUST be updated in the same change.** A reference map that drifts from reality is worse than no map — it sends readers (and agents) confidently down dead paths. Treat the guidebook edit as part of the diff, not a follow-up: the PR that changes the component is the PR that updates its guidebook page.
## Index
| Guidebook | What it maps | Status |
|---|---|---|
| [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `<app>` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active |
| [Factory provisioning](factory-provisioning/README.md) | Deep dive into how factory provisions everything: Ansible playbooks + roles and OpenTofu | ✅ Active |
| [Tools](tools/README.md) | Deep dive into the lab platform services in the `tools` namespace (Vault+VSO, Prometheus, Grafana, CrowdSec, poolers, Redis, Plausible, ClickHouse) | ✅ Active |
| [CMS](cms/README.md) | Deep dive into the public Nuxt site arcodange.fr + its Cloudflare DNS/tunnel/Turnstile and Zoho email IaC | ✅ Active |
| [Applications](applications/README.md) | The deployed apps and the common pattern they share — webapp (Go + Postgres) and url-shortener (Rust + SQLite); erp has its own guidebook | ✅ Active |
| [ERP](erp/README.md) | The labs Dolibarr ERP — its deployment on Postgres, its document storage + backup/restore, and the read-only ops CLI (the most data-critical app) | ✅ Active |
## Rules to contribute
1. **Use the `tree-docs` skill.** Guidebooks are tree-docs: author and grow them with the skill so breadcrumbs, hubs, and cross-links stay consistent.
2. **Breadcrumb spine on every file.** The first line of each page is its breadcrumb trail: ancestors are relative links, the current page is the bold-unlinked last item, separator is ` > ` (space-gt-space).
3. **README hub per subfolder.** Every folder carries a `README.md` index hub: a table of its children (link + one-line summary + status), sorted by importance/sequence, never alphabetically.
4. **Bidirectional links.** When page A references page B as related, page B references A back. Use descriptive link text — never "here" or "this".
5. **Mermaid preferences.** Begin each diagram with a `%%{init: {'theme': 'base'}}%%` directive, define a `classDef` palette legible on both light and dark backgrounds (dark fills, light text), use HTML `<br>` for line breaks, and follow every diagram immediately with a numbered ordered list restating the same flow in words.
6. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started.
7. **Honour the maintenance rule above** — update the relevant guidebook page in the same change that alters the component it documents.

View File

@@ -0,0 +1,118 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > **Applications**
# Applications
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [Lab ecosystem hub](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md)
> **Downstream:** [webapp](webapp.md) · [url-shortener](url-shortener.md)
> **Related:** [naming-conventions](../lab-ecosystem/naming-conventions.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md)
This guidebook maps the **deployed applications** — the workloads ArgoCD runs in their own `<app>` namespace — and, more importantly, the **single repeatable pattern** every one of them follows. Once you know the pattern, every app reads as a variation on the same skeleton: a Gitea repo whose contents (Dockerfile + Helm chart + optional Vault IaC + CI) and whose `<app>` name fully determine how it builds, deploys, gets its secrets, and is reached from the network.
Two apps are presented in depth as the canonical archetypes: [webapp](webapp.md) (Go + external Postgres) and [url-shortener](url-shortener.md) (Rust + embedded SQLite). Other apps in the cluster — `erp`, `dance-lessons-coach`, `telegram-gateway`, `plausible` — instantiate the same pattern; `erp` has its own [ERP guidebook](../erp/README.md) because it carries far more moving parts than the two archetypes.
## The common app pattern
Every application is a self-contained Gitea repo under the `arcodange-org` (or `arcodange`) org that carries the same four ingredients. The `<app>` name — the repo name — is the join key that threads through all of them (see the [naming-conventions concept](../lab-ecosystem/naming-conventions.md)).
| Ingredient | Path in the app repo | What it is | Required? |
|---|---|---|---|
| **Dockerfile** | `Dockerfile` | Multi-stage build producing the runtime image, pushed to the Gitea container registry as `gitea.arcodange.lab/<org>/<app>` | ✅ always |
| **Helm chart** | `chart/` | `Chart.yaml` + `values.yaml` + `templates/` (deployment, service, ingress, serviceaccount, hpa, config, NOTES, optional PVC, optional Vault CRDs) — the unit ArgoCD syncs | ✅ always |
| **Vault IaC** | `iac/` | OpenTofu that declares the app's Vault objects: a Postgres dynamic-secret role keyed on `<app>` + a Kubernetes auth role bound to the `<app>` ServiceAccount. The canonical form pulls the [`app_roles` module](../tools/secrets-and-vso.md) from `tools`; the privileged `app_policy` half is declared centrally so the app repo never holds it | 🟡 only apps needing Postgres / Vault KV |
| **CI workflows** | `.gitea/workflows/` | A `dockerimage` job that builds + pushes the image on every `main` push, and (when `iac/` exists) a `vault` job that runs `tofu apply` against Vault, gated to changes under `iac/*.tf` | ✅ image build · 🟡 vault apply |
### How a chart becomes a running app
Factory's [ArgoCD app-of-apps](../lab-ecosystem/01-factory.md) emits **one `Application` CRD per app**, and every field is derived mechanically from the `<app>` name:
| Application field | Value | Source |
|---|---|---|
| `repoURL` | `https://gitea.arcodange.lab/<org>/<app>` | `<app>` + optional org override |
| `path` | `chart` | fixed convention |
| `namespace` | `<app>` (`CreateNamespace=true`) | `<app>` |
| `syncPolicy` | `automated` with `prune: true` + `selfHeal: true` | app-of-apps default |
The same `<app>` name is also the Postgres database/role name, the Vault role name, the KV path prefix, and the ServiceAccount name — one string keying the whole stack. See [naming-conventions](../lab-ecosystem/naming-conventions.md).
### Ingress convention — `.fr` public vs `.lab` internal
Every app that serves HTTP exposes itself through two Traefik ingresses with a fixed split by domain suffix:
| Ingress | Domain | Traefik entrypoint | Middlewares | TLS / cert | Reached via |
|---|---|---|---|---|---|
| **Public** | `<app>.arcodange.fr` | `web` | `kube-system-crowdsec@kubernetescrd` (CrowdSec bouncer) | terminated at the edge | the **Cloudflared tunnel** — the public web entrypoint |
| **Internal** | `<app>.arcodange.lab` | `websecure` | `localIp@file` (LAN-only allow-list) | a cert from **either** the Traefik `letsencrypt` resolver **or** cert-manager's `step-issuer` (`StepClusterIssuer`) | the LAN directly |
> [!NOTE]
> The two archetypes differ only in cert mechanism, not in the convention: webapp's internal ingress carries the `letsencrypt` certresolver annotations, while url-shortener's internal ingress requests its cert from cert-manager's `step-issuer`. Both still ride `websecure` + `localIp@file`; both still expose a `.fr` twin behind the CrowdSec middleware.
## Two archetypes compared
The deployed apps fall into two shapes. Pick the matching archetype's page when adding or modifying an app.
| Aspect | [webapp](webapp.md) | [url-shortener](url-shortener.md) |
|---|---|---|
| Language / build | **Go** (golang:1.23 → alpine runtime) | **Rust** (cargo-chef → `scratch` runtime) |
| State | **External Postgres**, reached through the `tools` **pgbouncer** pooler with credentials delivered by **VSO** | **Embedded SQLite** on a `/data` file |
| Persistence | none in-cluster (DB lives on `pi2`) | a **Longhorn RWO PVC** (`storageClassName: longhorn`, `helm.sh/resource-policy: keep`) mounted at `/data` |
| Replicas | **scalable** (stateless pods; HPA-ready) | **single** — RWO volume cannot be shared across pods |
| `iac/` + Vault | **yes** — declares a Postgres dynamic-secret role + a k8s auth role; pod consumes **dynamic, rotating** DB creds via `VaultAuth` + `VaultDynamicSecret` + `VaultStaticSecret` CRDs | **none** — no Vault objects, no DB role |
| Recovery | restore from the **PostgreSQL backup** (factory `05_backup``/mnt/backups`) | **Longhorn block-device recovery** of the PVC (raw replica `.img` files) — see [ansible recover](../factory-provisioning/ansible/06-recover.md) |
The choice is essentially *"shared/scalable state that survives a single node"* (Postgres, webapp shape) versus *"self-contained single-writer state co-located with the pod"* (SQLite-on-Longhorn, url-shortener shape). The trade-off and why both are kept prod-like is recorded in the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md).
## Generic app lifecycle
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef net fill:#b45309,stroke:#92400e,color:#fff
REPO["app repo<br>Dockerfile + chart/ + iac/ + .gitea/workflows"]:::src
IMG["image pushed<br>gitea registry &lt;org&gt;/&lt;app&gt;"]:::store
VAULT["tofu apply<br>Vault role + Postgres role"]:::store
ARGO["ArgoCD<br>deploys chart (ns &lt;app&gt;)"]:::proc
POD["pod<br>Postgres via pgbouncer + VSO<br>OR SQLite on Longhorn PVC"]:::proc
TR["Traefik ingress<br>.fr public / .lab internal"]:::net
REPO -- "dockerimage CI" --> IMG
REPO -- "vault CI (iac apps only)" --> VAULT
IMG --> ARGO
VAULT -. "creds for DB apps" .- POD
ARGO --> POD
POD --> TR
```
1. The **app repo** holds the four ingredients: a Dockerfile, a `chart/`, an optional `iac/`, and `.gitea/workflows`.
2. On a push to `main`, the **dockerimage** workflow builds the image and pushes it to the Gitea container registry as `<org>/<app>`.
3. For apps with an `iac/`, the **vault** workflow runs `tofu apply` to declare the app's Postgres dynamic-secret role and Kubernetes auth role in Vault.
4. **ArgoCD** (factory's app-of-apps) syncs the chart into the `<app>` namespace, deriving `repoURL`/`path`/`namespace` from the `<app>` name.
5. The **pod** comes up; a Postgres-backed app receives rotating DB credentials through pgbouncer + VSO, while a SQLite-backed app mounts its Longhorn PVC at `/data`.
6. **Traefik** publishes the pod through two ingresses: the `.fr` public route (CrowdSec middleware, via the Cloudflared tunnel) and the `.lab` internal route (`websecure` + `localIp` + a letsencrypt/step-issuer cert).
## Index
| Page | Archetype | Status |
|---|---|---|
| [webapp](webapp.md) | Canonical **Go + external Postgres** exemplar — `iac/` + Vault dynamic creds, scalable stateless pods | ✅ Active |
| [url-shortener](url-shortener.md) | **Rust + embedded SQLite** counterpart — single replica on a Longhorn RWO PVC, no Vault | ✅ Active |
`erp` and the other apps (`dance-lessons-coach`, `telegram-gateway`, `plausible`) follow the same pattern; `erp` is documented in depth in its own [ERP guidebook](../erp/README.md).
## Maintenance rule
> [!IMPORTANT]
> **When an app's repo changes shape, its page here changes in the same PR.** If you alter the chart structure, the ingress convention, the Vault wiring, the persistence model, or the CI workflows of a deployed app, update this hub and the relevant archetype page in the same change. A reference map that drifts from the real `chart/` and `iac/` sends agents confidently down dead paths.
## Cross-references
- [01 · factory](../lab-ecosystem/01-factory.md) — the ArgoCD app-of-apps that emits one `Application` per app on this page.
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the `app_policy` + `app_roles` module pair that turns `<app>` into Vault policies, roles, and CI identities; the VSO runtime path the Postgres archetype rides.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app PostgreSQL database + `<app>_role` the webapp archetype depends on.
- [naming-conventions](../lab-ecosystem/naming-conventions.md) — the `<app>` join key that threads through repo, image, namespace, DB, Vault role, and ServiceAccount.
- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) — why the lab keeps apps deployed prod-like and the state/recovery trade-offs behind the two archetypes.

View File

@@ -0,0 +1,172 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [Applications](README.md) > **url-shortener**
# url-shortener (Chhoto URL)
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Applications index](README.md) · [lab-ecosystem hub](../lab-ecosystem/README.md)
> **Downstream:** [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md) · [Ansible recover playbooks](../factory-provisioning/ansible/06-recover.md)
> **Related:** [webapp](webapp.md) (the stateless counterpart) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [secrets-and-vault](../lab-ecosystem/secrets-and-vault.md)
`url-shortener` is the lab's **stateful** application — the deliberate mirror image of [webapp](webapp.md). Where webapp is a horizontally-scalable, Postgres-backed, Vault-credentialed reference app, `url-shortener` is a single-pod, SQLite-on-a-disk service that proves the lab can run a genuinely stateful workload on Longhorn block storage and recover it after a crash.
The application itself is **Chhoto URL**, a tiny Rust/[actix-web](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/Cargo.toml) URL shortener, mirrored into the lab from the upstream project [`sintan1729/chhoto-url`](https://github.com/SinTan1729/chhoto-url). The lab does not fork the source — it mirrors the published Docker image (see [§5 CI](#5-ci--mirror-not-build)) — but keeps a copy of the source and a custom Helm chart in [the `url-shortener` repo](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main) so the lab owns its packaging.
> [!NOTE]
> A second public web app, the **erp** application, exists in the ecosystem but does not yet have a guidebook page; it is mentioned here in prose only and will be cross-linked once its guidebook ships.
---
## 1. The app & image
Chhoto URL is a single self-contained binary: an `actix-web` HTTP server with a bundled SQLite engine and a plain static frontend. There is no separate database process and no application runtime to install.
| Aspect | Detail | Source |
| --- | --- | --- |
| Language / framework | Rust, [`actix-web`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/src/main.rs) `4.5.x` | [`actix/Cargo.toml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/Cargo.toml) |
| Database driver | [`rusqlite`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/src/database.rs) with the `bundled` feature — SQLite is compiled **into** the binary | `Cargo.toml` |
| Sessions | `actix-session` cookie store; the session key is regenerated at boot, so a restart invalidates all logins | [`actix/src/main.rs`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/actix/src/main.rs) |
| Frontend | Plain HTML/CSS/JS served by `actix-files` from [`resources/`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/resources) (`index.html`, `static/styles.css`, `static/script.js`, `static/404.html`) — no build step, no SPA framework | `resources/` |
| Listen port | `4567` (overridable via the `port` env var) | `actix/src/main.rs` |
| Slug generation | Adjective-name `Pair` or `UID` styles; the lab forces `UID` (see [§3](#3-the-chart)) | `actix/src/utils.rs` |
### Image build
| Image | Base / approach | Result | File |
| --- | --- | --- | --- |
| `Dockerfile` | `cargo-chef` dependency caching → musl static build (`x86_64-unknown-linux-musl`) → `FROM scratch`, copying only the binary + `resources/` | A single-arch, ~6 MB image with **no** OS, shell, or libc | [`Dockerfile`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/Dockerfile) |
| `Dockerfile.multiarch` | Per-arch `FROM scratch` stages selected by `TARGETARCH`, copying pre-built musl binaries | `amd64` / `arm64` / `armv7` images from a local cross-compile | [`Dockerfile.multiarch`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/Dockerfile.multiarch) |
| `compose.yaml` | Pulls the upstream `sintan1729/chhoto-url:latest`, binds `4567`, mounts a `db` volume at the SQLite path | Local-dev only — not the deployed topology | [`compose.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/compose.yaml) |
The `FROM scratch` design is what makes the storage story so stark: the container has nothing **except** the SQLite file on its mounted volume. All durable state is one file.
---
## 2. Storage — the key contrast
This is the section that distinguishes `url-shortener` from every other lab app. The entire database is a single SQLite file, `/data/urls.sqlite`, living on a Longhorn PersistentVolumeClaim.
| PVC field | Value | Why it matters |
| --- | --- | --- |
| `accessModes` | `ReadWriteOnce` (RWO) | Only **one** node can mount the volume at a time |
| `resources.requests.storage` | `128Mi` | A URL table is tiny; this is generous |
| `storageClassName` | `longhorn` | Replicated block storage; the source of durability |
| `annotations` | `helm.sh/resource-policy: keep` | The PVC (and its data) **survives `helm uninstall`** |
| Mount | `/data` → SQLite at `/data/urls.sqlite` | Set by the ConfigMap `db_url` ([§3](#3-the-chart)) |
Source: [`chart/templates/pvc.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/pvc.yaml) and the volume mount in [`chart/templates/deployment.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/deployment.yaml).
Because the volume is **RWO**, `replicaCount` is hard-pinned to `1`. A second pod cannot mount the same volume, so there is **no HA and no rolling update** — the next pod can only start after the previous one has released the disk. This is the exact data shape that was reconstructed during the **2026-04-13 power-cut** incident via Longhorn block-device recovery (see [§6](#6-recovery)).
---
## 3. The chart
`url-shortener` ships its own Helm chart at [`chart/`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart). It is deployed by ArgoCD like every other lab app, following the [naming conventions](../lab-ecosystem/naming-conventions.md).
| Concern | Setting | Source |
| --- | --- | --- |
| Image | `gitea.arcodange.lab/arcodange-org/url-shortener`, tag defaults to `.Chart.AppVersion` (`6.5.3`) | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml), [`Chart.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/Chart.yaml) |
| Service | `ClusterIP`, port `4567` | [`service.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/service.yaml) |
| Internal ingress | Host `url.arcodange.lab` on `websecure`; TLS via cert-manager `StepClusterIssuer` (`step-issuer`); `localIp@file` middleware (LAN-only) | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml) |
| Public ingress | Host derived from the internal host by the `.lab → .fr` substitution in `_helpers.tpl`; `PathRegexp` matcher (`/[^/]+`) so it intercepts shortlink redirects; **no `localIp`** (it carries the `crowdsec` middleware instead, since the public path must be reachable) | [`public-ingress.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/public-ingress.yaml), [`_helpers.tpl`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/_helpers.tpl) |
| ConfigMap | `db_url=/data/urls.sqlite`, `site_url` = the `.fr` FQDN, `slug_style=UID`, `slug_length=4` | [`config.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/config.yaml) |
| Probes | Liveness + readiness HTTP `GET /` on the `http` port | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml) |
| Autoscaling | **Disabled.** `maxReplicas > 1` would fail under RWO (a second pod cannot mount the volume) | [`hpa.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/templates/hpa.yaml), [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/chart/values.yaml) |
> [!NOTE]
> The `_helpers.tpl` `url-shortener.fqdn` template builds the public `site_url` by taking the first internal ingress host and running `replace ".lab" ".fr"` — so the same value drives the internal `url.arcodange.lab` and the public `.fr` redirect domain. This is the same `.lab ↔ .fr` split documented in [naming-conventions](../lab-ecosystem/naming-conventions.md).
---
## 4. No iac/, no Vault — a deliberate deviation
`url-shortener` has **no `iac/` directory and no Vault CRDs**, and this is intentional, not an oversight.
| Convention (see [webapp](webapp.md)) | Why url-shortener skips it |
| --- | --- |
| `iac/` OpenTofu module declaring a Postgres role | There is no Postgres. SQLite is embedded; there is no database server to provision. |
| Vault `app_roles` + VSO-synced dynamic credentials | There are no credentials to issue — the app talks to a local file, not a networked database. See [secrets-and-vault](../lab-ecosystem/secrets-and-vault.md) for the pattern url-shortener opts out of. |
For a SQLite-on-a-file app, **the Helm chart *is* the IaC**: the PVC, the ConfigMap, and the ingress fully describe the deployable surface. There is no second provisioning tier. Compare with [webapp](webapp.md), where the Postgres role and Vault role are first-class infrastructure objects managed outside the chart.
---
## 5. CI — mirror, not build
The single workflow, [`.gitea/workflows/dockerimage.yaml`](https://gitea.arcodange.lab/arcodange-org/url-shortener/src/branch/main/.gitea/workflows/dockerimage.yaml), does **not** build the local `actix/` source. It mirrors the upstream image.
| Step | Action |
| --- | --- |
| 1. Discover version | `wget` the upstream `Cargo.toml` from `SinTan1729/chhoto-url` on GitHub and parse `version` |
| 2. Login | Authenticate to `gitea.arcodange.lab` using the `PACKAGES_TOKEN` secret |
| 3. Pull | `docker pull sintan1729/chhoto-url:<tag>` from Docker Hub (`latest` and the discovered version) |
| 4. Retag | `docker tag` to `gitea.arcodange.lab/<repo>:<tag>` |
| 5. Push | `docker push` both `latest` and the version tag to the Gitea registry |
So the in-repo `actix/` source and the `Dockerfile`/`Dockerfile.multiarch` exist for **reference and local cross-compilation** (`Makefile` targets `build-release` / `docker-release` use `cross` for `amd64` / `arm64` / `armv7`), but the cluster runs the **mirrored upstream image**. The multi-arch build is a manual, local-developer flow — it is not run in CI.
---
## 6. Recovery
`url-shortener`'s SQLite-on-RWO is the canonical example that the lab's recovery tooling targets. When a node dies mid-write or the cluster loses power, the durable artifact to reconstruct is exactly this kind of Longhorn block volume.
- The concept and policy live in [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md).
- The mechanics live in the [Ansible recover playbooks](../factory-provisioning/ansible/06-recover.md): the `longhorn_data.yml` block-device recovery flow reconstructs precisely this single-file SQLite volume, which is what was recovered in the **2026-04-13** power-cut.
> [!WARNING]
> Single replica + RWO means **downtime on any pod move**: a node drain, an upgrade, or a reschedule cannot overlap the old and new pods — the new one waits for the disk to detach. There is **no application-level redundancy**. The only durability is **Longhorn volume replication plus backups**; if the volume is lost and unbacked, the URL database is gone. Treat backup health as the single point of failure for this app.
---
## 7. Deviations from convention (vs webapp)
A side-by-side of where `url-shortener` (stateful) departs from [webapp](webapp.md) (the stateless reference):
| Dimension | webapp | url-shortener |
| --- | --- | --- |
| Database | PostgreSQL (external server) | SQLite (embedded, single file) |
| Vault / secrets | `app_roles` + VSO-synced dynamic creds | **None** — no networked credentials |
| `iac/` directory | Yes (Postgres role, Vault role) | **None** — the Helm chart is the IaC |
| Replicas | Scalable (HPA-eligible) | **1, hard-pinned** (RWO forbids more) |
| Rolling update / HA | Yes | **No** — single pod, downtime on move |
| CI | Builds the app source | **Mirrors** the upstream image |
| Recovery shape | Postgres backup/restore | Longhorn block-device recovery |
---
## 8. Deploy + storage path
```mermaid
%%{init: {'theme':'base'}}%%
flowchart LR
upstream["Docker Hub<br>sintan1729/chhoto-url"]
ci["Gitea Actions<br>dockerimage.yaml (mirror)"]
reg["Gitea registry<br>gitea.arcodange.lab/<br>arcodange-org/url-shortener"]
argo["ArgoCD + Helm chart"]
pod["Pod (replicaCount 1)<br>actix on :4567"]
pvc["Longhorn PVC (RWO, 128Mi)<br>keep policy"]
db["/data/urls.sqlite"]
upstream -- "pull + retag" --> ci
ci -- "push latest + version" --> reg
reg -- "image ref" --> argo
argo -- "deploy" --> pod
pod -- "mounts (RWO)" --> pvc
pvc -- "holds" --> db
classDef box fill:#1f2933,stroke:#7b8794,color:#f5f7fa;
class upstream,ci,reg,argo,pod,pvc,db box;
```
1. **Upstream** — the canonical `sintan1729/chhoto-url` image is published to Docker Hub by the original maintainer.
2. **CI mirror** — the lab's `dockerimage.yaml` workflow pulls that image, retags it, and pushes it to the Gitea registry (it does not build from `actix/`).
3. **Gitea registry**`gitea.arcodange.lab/arcodange-org/url-shortener` holds both `latest` and the version tag.
4. **ArgoCD + Helm** — the chart references the registry image (tag defaults to `appVersion`) and renders the Deployment, Service, ingresses, ConfigMap, and PVC.
5. **Pod** — a single `actix` pod listens on `4567`; HPA and rolling updates are off.
6. **Longhorn PVC** — the pod mounts the RWO volume at `/data`; only one pod can hold it.
7. **SQLite file** — all durable state is the single `/data/urls.sqlite` file, which is what [Longhorn block-device recovery](../factory-provisioning/ansible/06-recover.md) reconstructs.
---
See also: [webapp](webapp.md) (the stateless, Postgres-backed contrast) · [Applications index](README.md) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md).

View File

@@ -0,0 +1,155 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [Applications](README.md) > **webapp**
# webapp
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [Applications](README.md) · [01 · factory](../lab-ecosystem/01-factory.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md)
> **Downstream:** the template every simple Postgres-backed app (`erp`, `dance-lessons-coach`) is cloned from
> **Related:** [url-shortener](url-shortener.md) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md)
`webapp` is the **canonical simple-app exemplar** — a deliberately small Go diagnostic app whose whole job is to exercise the lab's plumbing so every other Postgres-backed app can be cloned from its shape. It ships the four ingredients of the [common app pattern](README.md) (Dockerfile, `chart/`, `iac/`, `.gitea/workflows`) in their most legible form, with no business logic to read around. When you add a new simple app, this repo is the skeleton you copy.
What it actually does at runtime is a handful of probes:
| Endpoint | Purpose |
|---|---|
| `GET /` | Serves an HTML form that posts a number to `/query` |
| `GET /query?param=N` | Runs the parameterized query `SELECT 42 + $1` against Postgres and renders the result — the end-to-end "is the DB reachable and answering" check |
| `GET /liveness` | Always-`200 OK` liveness probe (no DB touch) |
| `GET /readiness` | Calls `db.Ping()`; returns `503 NOT READY` if Postgres is unreachable — so the pod only takes traffic once the DB is live |
| `GET /display-info` | Dumps the request's cookies, client IP, and headers — used to confirm the real client IP survives the ingress path |
| `GET /oauth-callback`, `/retrieve`, `/test-oauth-callback` | OAuth device-flow test endpoints (see below) — a workaround for Gitea lacking the OIDC device grant |
> [!NOTE]
> The OAuth endpoints exist because Gitea's OIDC provider only advertises `authorization_code` + `refresh_token`, not the device grant. `webapp` stands in as the redirect target: `/oauth-callback` stores the `code` keyed by the client-chosen `state` in an in-memory cache (5-minute TTL), and a CLI client then polls `/retrieve?state=…` to exchange its `state` for the `code`. `/retrieve` is IP-gated — it only answers callers in the LAN CIDRs (`192.168.0.0/16`, the IPv6 prefix, the k3s `10.42.0.0/16`) or an explicit `OAUTH_DEVICE_CODE_ALLOWED_IPS` allow-list — which is exactly why the pod must see the **real client IP** (see the `nodeSelector` note in [the chart](#2-the-chart)).
---
## 1) The app & image
A single-file Go program with two third-party dependencies, built into a tiny runtime image.
| Aspect | Value |
|---|---|
| Source | [`main.go`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/main.go) — one file, `package main` |
| Module | [`go.mod`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/go.mod) · `gitea.arcodange.lab/arcodange-org/webapp` · Go 1.23 |
| Postgres driver | `github.com/lib/pq` v1.10.9 — registered as the `postgres` driver; the connection string comes from `DATABASE_URL` |
| Cache | `github.com/patrickmn/go-cache` v2.1.0 — the in-memory `state → code` store for the OAuth callback (5 min default expiry, 10 min cleanup) |
| Listen port | `:8080`, plain HTTP (`net/http` default mux) — TLS is terminated upstream at Traefik |
| Query | `SELECT 42 + $1` — parameterized (`$1` bound, not interpolated) so the diagnostic endpoint is not an injection vector |
| Dockerfile | [`Dockerfile`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/Dockerfile) — multistage: `golang:1.23-alpine` builder (`go build -o app .`) → `alpine:latest` runtime with `ca-certificates`, `EXPOSE 8080`, `CMD ["./app"]` |
| Image | `gitea.arcodange.lab/arcodange-org/webapp` — pushed to the Gitea container registry by CI (tags `latest` + the git ref name) |
The runtime image carries no config of its own: everything (`DATABASE_URL`, `OAUTH_ALLOWED_HOSTS`, the optional `OAUTH_DEVICE_CODE_ALLOWED_IPS`) arrives as environment variables from the chart's ConfigMap.
---
## 2) The chart
The Helm chart at [`chart/`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart) ([`Chart.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/Chart.yaml): `name: webapp`, `appVersion: "latest"`) is the unit ArgoCD syncs into the `webapp` namespace. It is the boilerplate `helm create` scaffold plus the Vault CRDs and a hardcoded second ingress.
| Chart object | Template | Shape |
|---|---|---|
| **Deployment** | [`deployment.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/deployment.yaml) | `replicaCount: 1`; `revisionHistoryLimit: 3`; one container on `containerPort: 8080`; `envFrom` the ConfigMap; liveness + readiness probes |
| **Node pinning** | `nodeSelector` in [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/values.yaml) | `kubernetes.io/hostname: pi1` — pinned to the **network entrypoint node** so traffic avoids NAT and the pod sees the **real client IP** (load-bearing for the IP-gated `/retrieve` and for `/display-info`) |
| **ConfigMap** | [`config.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/config.yaml) | `OAUTH_ALLOWED_HOSTS: webapp.arcodange.lab,webapp.arcodange.fr`; `DATABASE_URL: postgres://pgbouncer_auth:pgbouncer_auth@pgbouncer.tools/postgres?sslmode=disable` |
| **Public ingress** | [`ingress.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/ingress.yaml) (values-driven) | host `webapp.arcodange.fr`, Traefik `web` entrypoint (HTTP), middleware `kube-system-crowdsec@kubernetescrd` (the CrowdSec bouncer) |
| **Internal ingress** | [`localIngress.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/localIngress.yaml) (hardcoded manifest) | `Ingress/webapp-local`, host `webapp.arcodange.lab`, Traefik `websecure` entrypoint, `letsencrypt` certresolver, middleware `localIp@file` (LAN-only) |
| **Service** | [`service.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/service.yaml) | `ClusterIP` on port 8080 → `http` target |
| **ServiceAccount** | [`serviceaccount.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/serviceaccount.yaml) | created as `webapp`, token auto-mounted — the identity VSO uses to authenticate to Vault |
| **Probes** | `livenessProbe` / `readinessProbe` in [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/values.yaml) | liveness → `/liveness` (cheap), readiness → `/readiness` (**pings the DB**), both on the `http` port |
| **HPA** | [`hpa.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/hpa.yaml) | gated on `autoscaling.enabled`, which is `false`**HPA disabled**; the single replica is fixed |
> [!NOTE]
> The internal `.lab` ingress is shipped as a **hardcoded `localIngress.yaml`** (a plain manifest, not the templated `ingress.yaml`), and the equivalent block in `values.yaml` is left commented out. That is why webapp has two ingress *templates* but the values file only configures the `.fr` public one. The split — `web`/CrowdSec public vs `websecure`/`localIp`/letsencrypt internal — is the lab-wide [ingress convention](README.md#ingress-convention--fr-public-vs-lab-internal).
---
## 3) Vault CRDs in the chart
The chart ships the **full Vault Secrets Operator (VSO) wiring** as three CRDs. Together they let the pod authenticate to Vault as `webapp` and pull both static config and dynamic DB credentials.
| CRD | Template | What it declares |
|---|---|---|
| **VaultAuth** | [`vaultauth.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/vaultauth.yaml) | `kubernetes` auth method on mount `kubernetes`, **role `webapp`**, ServiceAccount `webapp`, audience `vault` — the login other CRDs reference via `vaultAuthRef: auth` |
| **VaultStaticSecret** | [`vaultsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/vaultsecret.yaml) | `kv-v2` on mount `kvv2`, path **`webapp/config`** → k8s Secret **`secretkv`** (created by VSO), `refreshAfter: 30s` |
| **VaultDynamicSecret** | [`vaultdynamicsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/chart/templates/vaultdynamicsecret.yaml) | mount `postgres`, path **`creds/webapp`** → k8s Secret **`vso-db-credentials`** (created by VSO), with a `rolloutRestartTargets` entry on the `webapp` **Deployment** so the pod restarts when creds rotate |
The runtime path these ride — VSO reading Vault on the pod's behalf and materializing k8s Secrets — is documented in [tools secrets-and-vso](../tools/secrets-and-vso.md).
---
## 4) The key nuance — wiring shipped, live pod still on static creds
> [!NOTE]
> **webapp provisions the complete dynamic-DB-credentials path but does not yet consume it.** The chart's `VaultDynamicSecret` (path `postgres/creds/webapp` → Secret `vso-db-credentials`, with a `rolloutRestart` on the Deployment) and the matching `iac/` role together stand up the **entire** per-app dynamic-credentials machinery end to end. But the **running Deployment takes `DATABASE_URL` from the ConfigMap**, which points at the **shared static `pgbouncer_auth` user** (`postgres://pgbouncer_auth:pgbouncer_auth@pgbouncer.tools/postgres?sslmode=disable`), and it does **not** mount the `vso-db-credentials` Secret. So `webapp` demonstrates the dynamic-creds wiring in full as a reference, while its live pod runs on the shared static account. Switching the live pod to rotating per-app credentials is simply a matter of consuming the `vso-db-credentials` Secret (e.g. project its fields into `DATABASE_URL` instead of the ConfigMap value). This is the current state, by design as an exemplar — not a misconfiguration.
This is the one place to read `webapp` carefully: as a **template** it shows you every CRD and IaC resource a dynamic-creds app needs; as a **deployed workload** it is still on the shared pooler user. When you clone it for a real app, the last step is to wire the pod to `vso-db-credentials`.
---
## 5) iac/ — Vault objects declared inline
The OpenTofu under [`iac/`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac) declares webapp's Vault objects. Unlike `erp` and `dance-lessons-coach`, which call the shared **`app_roles` module** from `tools` (see [tools secrets-and-vso](../tools/secrets-and-vso.md)), webapp declares them **inline** in `main.tf` — which is exactly why it reads as the legible reference: every resource is visible in one file rather than hidden behind a module call.
| File | Contents |
|---|---|
| [`providers.tf`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac/providers.tf) | `vault` provider v4.4.0 at `https://vault.arcodange.lab`; authenticates via `auth_login_jwt { mount = "gitea_jwt", role = "gitea_cicd_webapp" }` (token from `TERRAFORM_VAULT_AUTH_JWT`) |
| [`backend.tf`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac/backend.tf) | GCS backend, bucket `arcodange-tf`, prefix **`webapp/main`** |
| [`main.tf`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/iac/main.tf) | three inline resources (below) |
The three resources in `main.tf`:
| Resource | Vault object | Detail |
|---|---|---|
| `vault_database_secret_backend_role.role` | `postgres/creds/webapp` | creation SQL `CREATE ROLE "{{name}}" WITH LOGIN PASSWORD … VALID UNTIL '{{expiration}}'` then **`GRANT webapp_role TO "{{name}}"`**; revocation `REVOKE ALL ON DATABASE webapp FROM …` |
| `vault_kubernetes_auth_backend_role.role` | k8s auth role `webapp` | bound to SA `webapp` + namespace `webapp`, `audience = vault`, `token_policies = ["default", "webapp"]`, `token_ttl = 3600` |
| `vault_kv_secret_v2.webapp_config` | `kvv2/webapp/config` | the KV-v2 config secret VSO reads into the `secretkv` k8s Secret |
> [!IMPORTANT]
> The `GRANT webapp_role TO …` statement depends on the **`webapp_role`** Postgres group role being created first by factory's [postgres-iac](../factory-provisioning/opentofu/postgres-iac.md). webapp's IaC mints *short-lived login roles* that inherit the privileges of that pre-existing `webapp_role`; if `webapp_role` does not exist, the dynamic-credential creation fails at grant time.
---
## 6) CI workflows
Two Gitea Actions workflows under [`.gitea/workflows/`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/.gitea/workflows), each gated to the part of the repo it owns.
| Workflow | File | Trigger | What it does |
|---|---|---|---|
| **Hashicorp Vault** | [`vault.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/.gitea/workflows/vault.yaml) | push / PR touching `iac/*.tf` (+ manual) | a `gitea_vault_auth` job mints the Gitea OIDC token, then a `tofu` job runs **`terraform apply`** on `iac/` with **OpenTofu 1.8.2**; reads `kvv1/google/credentials` for the GCS backend; `VAULT_CACERT` is built from the `HOMELAB_CA_CERT` secret |
| **Docker Build** | [`dockerimage.yaml`](https://gitea.arcodange.lab/arcodange-org/webapp/src/branch/main/.gitea/workflows/dockerimage.yaml) | push to `main` (ignoring `README.md`, `chart/**`) + manual | logs into the Gitea registry with `PACKAGES_TOKEN`, `docker build`, pushes **`latest`** and the **git ref-name** tag to `gitea.arcodange.lab/<repo>` |
The `vault.yaml` flow — Gitea OIDC → Vault JWT login → `tofu apply` with GCS state — is the lab-standard CI apply path described in [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md). The image is then rolled out cluster-side: ArgoCD Image Updater (factory side) watches the registry on a **digest** strategy and bumps the running Deployment when the `latest` digest changes.
---
## 7) `<app>` convention mapping for webapp
The single string `webapp` is the join key threading through every layer of the stack (the lab-wide [naming convention](../lab-ecosystem/naming-conventions.md)):
| Layer | Value for `webapp` |
|---|---|
| Gitea repo | `arcodange-org/webapp` |
| Container image | `gitea.arcodange.lab/arcodange-org/webapp` (tags `latest` + ref-name) |
| PostgreSQL | database `webapp`, group role **`webapp_role`** (from factory [postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)) |
| Vault — dynamic DB | `postgres/creds/webapp` |
| Vault — KV config | `kvv2/webapp/config` |
| Vault — k8s auth role | `webapp` (policies `default`, `webapp`; SA + ns `webapp`) |
| Vault — CI JWT role | `gitea_cicd_webapp` (mount `gitea_jwt`) |
| Terraform state | GCS `arcodange-tf` prefix `webapp/main` |
| Kubernetes | namespace `webapp`, ServiceAccount `webapp` |
| ArgoCD | `Application` `webapp` (chart synced into ns `webapp`) |
| Ingress hosts | public `webapp.arcodange.fr` · internal `webapp.arcodange.lab` |
---
## Cross-references
- [url-shortener](url-shortener.md) — the **stateful contrast**: Rust + embedded SQLite on a Longhorn PVC, single replica, **no `iac/` and no Vault objects** at all. webapp (shared/scalable Postgres) vs url-shortener (self-contained single-writer SQLite) are the two archetypes of the [Applications hub](README.md).
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime path the Vault CRDs ride, and the shared `app_roles` module that `erp`/`dance-lessons-coach` use *instead* of webapp's inline declarations.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — creates the `webapp` database and `webapp_role` that webapp's `GRANT` depends on.
- [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) — the Gitea OIDC → Vault JWT → `tofu apply` pipeline behind `vault.yaml`.
- [naming-conventions](../lab-ecosystem/naming-conventions.md) — the `<app>` join key tabulated in section 7.
- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) — why a throwaway diagnostic app is still deployed prod-like, complete with dynamic-creds wiring.

View File

@@ -0,0 +1,75 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > **CMS**
# CMS
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md)
> **Downstream:** [Site (Nuxt)](site.md) · [Cloudflare](cloudflare.md) · [Zoho email](zoho-email.md)
> **Related:** [tools CrowdSec](../tools/components.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md)
This guidebook maps the [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) — the one app in the lab whose primary audience is the open Internet. It serves the public site **arcodange.fr** and owns the OpenTofu that wires its Cloudflare edge, its Cloudflared tunnel into the cluster, its Turnstile CAPTCHA, and its Zoho email.
## Two faces of one repo
The `cms` repo holds two distinct concerns that share a domain but live in different directories.
| Face | Where | What it is |
|---|---|---|
| **The SITE** | repo root ([`app/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/app), [`content/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content), [`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart)) | A **Nuxt 4** application (Nuxt Content + Nuxt Studio) built to static output and deployed **two ways**: to **Cloudflare Pages** (public `arcodange.fr` / `www`) and into **k3s** via a Helm chart (ArgoCD app **`cms`**) reachable through the Cloudflared tunnel (e.g. `cms-rec.arcodange.fr`, `www.arcodange.lab`) |
| **The IaC** | [`cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare) | **OpenTofu** managing the `arcodange.fr` zone (registered at OVH, DNS delegated to Cloudflare), Cloudflare **Pages**, the **Cloudflared** Zero-Trust tunnel into internal Traefik, a **Turnstile** CAPTCHA feeding CrowdSec, and **Zoho** email |
The site is *what visitors see*; the IaC is *how they reach it and how mail flows*. Both deploy from the same Gitea repo through Gitea Actions.
## Public request + email flow
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef edge fill:#d97706,stroke:#b45309,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
USER(["Visitor"]):::edge
CFDNS["Cloudflare DNS<br>arcodange.fr zone"]:::edge
PAGES["Cloudflare Pages<br>(static Nuxt build)"]:::proc
TUN["Cloudflared tunnel"]:::edge
TRAEFIK["internal Traefik"]:::proc
CS["CrowdSec bouncer<br>(Turnstile-backed)"]:::proc
CMS["cms pod (Nuxt)<br>cms-rec.arcodange.fr"]:::proc
MAIL(["Sender"]):::edge
ZOHO["Zoho<br>MX / SPF / DKIM / DMARC / BIMI"]:::store
USER --> CFDNS
CFDNS -- "arcodange.fr / www" --> PAGES
CFDNS -- "*.arcodange.fr" --> TUN
TUN --> TRAEFIK --> CS --> CMS
MAIL -- "MX lookup arcodange.fr" --> ZOHO
```
1. A **visitor** resolves a hostname under `arcodange.fr` through **Cloudflare DNS** (the zone OpenTofu manages).
2. The apex and `www` records (proxied CNAMEs) land on **Cloudflare Pages**, which serves the static Nuxt build directly from the edge.
3. Wildcard `*.arcodange.fr` hostnames route through the **Cloudflared** Zero-Trust tunnel — no home-LAN ports are opened — onto **internal Traefik**, which passes the request through the **CrowdSec** bouncer (its CAPTCHA challenge backed by Turnstile) to the in-cluster **`cms`** Nuxt pod (e.g. `cms-rec.arcodange.fr`).
4. Separately, **email** to `arcodange.fr` follows the **MX** record to **Zoho**, with **SPF/DKIM/DMARC/BIMI** authenticating and presenting the mail.
## Index
| Page | What it maps | Status |
|---|---|---|
| [Site (Nuxt)](site.md) | The Nuxt 4 app: Nuxt Content + Studio, static build, the dual deploy to Cloudflare Pages and to k3s via the Helm chart / ArgoCD app `cms` | ✅ Active |
| [Cloudflare](cloudflare.md) | The `cloudflare/` OpenTofu: zone (OVH-registered, CF-delegated), Pages, the Cloudflared tunnel into Traefik, and the Turnstile CAPTCHA for CrowdSec | ✅ Active |
| [Zoho email](zoho-email.md) | Zoho mail IaC: domain verification, MX/SPF/DKIM/DMARC/BIMI records, and the public aliases | ✅ Active |
## Maintenance rule
> [!IMPORTANT]
> **If any component documented in this guidebook is altered, update the page describing it in the same change.** A reference map that drifts from the real `cms` repo sends readers and agents down dead paths. The PR that changes a component is the PR that updates its CMS guidebook page.
## Cross-references
- [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) — the whole-lab view of where `cms` sits among `factory` + `tools`.
- [tools CrowdSec](../tools/components.md) — the Traefik bouncer the Turnstile challenge feeds for public-edge decisioning.
- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — where the Cloudflared tunnel token, Turnstile secret, and Cloudflare/Zoho/OVH credentials live in Vault.
- [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) — the OpenTofu apply pipeline pattern the `cloudflare/` IaC follows in Gitea Actions.
- [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) — why public-facing surfaces like this one are isolated from a safe prod-like environment.
- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms).

View File

@@ -0,0 +1,182 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [CMS](README.md) > **Cloudflare**
# Cloudflare
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [CMS](README.md) · [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md)
> **Downstream:** [tools CrowdSec](../tools/components.md) (consumes the Turnstile widget)
> **Related:** [Zoho email](zoho-email.md) · [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [naming conventions](../lab-ecosystem/naming-conventions.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md)
This page maps [`cms/cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare) — the OpenTofu root that owns the **`arcodange.fr`** edge. One `tofu apply` registers the zone at OVH, **delegates its DNS to Cloudflare**, publishes the public site on **Cloudflare Pages**, opens a **Cloudflared** Zero-Trust tunnel into the in-cluster Traefik, mints the **Turnstile** CAPTCHA the [tools CrowdSec bouncer](../tools/components.md) challenges with, and (via a sibling module) wires **Zoho** mail. The Nuxt site itself is not built here — see [Site (Nuxt)](site.md).
## Providers
Declared in [`providers.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/providers.tf). Versions pinned in [`.terraform.lock.hcl`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/.terraform.lock.hcl).
| Provider | Source | Version | Auth | Purpose |
|---|---|---|---|---|
| `cloudflare` | `cloudflare/cloudflare` | `~> 5` | `CLOUDFLARE_API_TOKEN` env | Zone, Pages, DNS records, Zero-Trust tunnel, Turnstile, zone settings |
| `ovh` | `ovh/ovh` | `~> 2.8` | `OVH_*` env (`ovh-eu` endpoint) | Domain registration + nameserver delegation |
| `vault` | `vault` | `5.5.0` | `auth_login_jwt` (mount `gitea_jwt`, role `gitea_cicd_cms`) at `https://vault.arcodange.lab` | Persists the Turnstile secret/sitekey; reads tunnel token |
> [!NOTE]
> The Vault provider authenticates with a **Gitea-issued OIDC JWT** (`TERRAFORM_VAULT_AUTH_JWT`), the same OIDC→Vault pattern the [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) documents lab-wide.
## State backend — S3 on Cloudflare R2
[`backend.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/backend.tf) keeps state in an **S3-compatible bucket on Cloudflare R2**, not AWS. The `skip_*` flags and `use_path_style` are what let the AWS S3 backend talk to R2.
| Setting | Value |
|---|---|
| `bucket` | `arcodange-tf` |
| `key` | `cms/terraform.tfstate` |
| `region` | `auto` |
| `endpoints.s3` | `var.CLOUDFLARE_S3_ENDPOINT` (R2 S3 API URL) |
| `access_key` / `secret_key` | `var.CLOUDFLARE_S3_ACCESS_KEY` / `var.CLOUDFLARE_S3_SECRET_ACCESS_KEY` |
| Flags | `skip_credentials_validation`, `skip_metadata_api_check`, `skip_region_validation`, `skip_requesting_account_id`, `skip_s3_checksum`, `use_path_style` |
> [!WARNING]
> The R2 backend credentials are **Terraform variables**, so they must be present in the environment *before* `tofu init` can read state. CI injects them from Vault path `kvv1/cloudflare/r2/arcodange-tf` (mapped to `TF_VAR_CLOUDFLARE_*` — see [CI](#ci--cloudflareyaml) below). Without those creds nothing — not even a read-only plan — can run.
## Resource graph
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart TD
classDef ovh fill:#1e3a8a,stroke:#1e40af,color:#fff
classDef cf fill:#d97706,stroke:#b45309,color:#fff
classDef mod fill:#059669,stroke:#047857,color:#fff
classDef vault fill:#7c3aed,stroke:#6d28d9,color:#fff
OVHDOM["ovh_domain_name<br>arcodange.fr"]:::ovh
OVHNS["ovh_domain_name_servers<br>delegate NS"]:::ovh
ZONE["cloudflare_zone<br>arcodange.fr"]:::cf
PAGES["cloudflare_pages_project<br>arcodange-cms (branch main)"]:::cf
PDOM["cloudflare_pages_domain<br>arcodange.fr + www"]:::cf
DNS["cloudflare_dns_record<br>@ + www CNAME (proxied)"]:::cf
TUN["module.cf_tunnel<br>Zero-Trust tunnel 'lab'"]:::mod
CAP["module.cf_captcha_for_crowdsec<br>Turnstile widget"]:::mod
ZOHO["module.zoho<br>mail records"]:::mod
VBACK["module.vault_backend<br>cms app role (cloudflared)"]:::vault
OVHDOM --> ZONE
ZONE -- "name_servers" --> OVHNS
ZONE --> PAGES --> PDOM
PAGES -- "subdomain target" --> DNS
ZONE --> DNS
ZONE --> TUN
ZONE --> ZOHO
OVHDOM --> CAP
```
1. **`ovh_domain_name "arcodange.fr"`** anchors the registration (imported into state, not created by OpenTofu).
2. **`cloudflare_zone`** creates the Cloudflare zone for that domain under the `arcodange@gmail.com` account.
3. **`ovh_domain_name_servers`** writes Cloudflare's assigned nameservers back at OVH, **delegating DNS to Cloudflare**.
4. **`cloudflare_pages_project "arcodange-cms"`** (production branch `main`) plus two **`cloudflare_pages_domain`** resources attach `arcodange.fr` and `www.arcodange.fr` to Pages.
5. **`cloudflare_dns_record`** publishes apex (`@`) and `www` as **proxied CNAMEs** pointing at the Pages project's `.pages.dev` subdomain.
6. The three **modules** (`cf_tunnel`, `cf_captcha_for_crowdsec`, `zoho`) and `vault_backend` hang off the same zone/domain/account.
### DNS & zone resources
| Resource | Name | Detail |
|---|---|---|
| `ovh_domain_name.arcodange_fr` | `arcodange.fr` | Registration; `# was terraform imported into state` |
| `cloudflare_zone.arcodange_fr` | `arcodange.fr` | Zone under account resolved from `arcodange@gmail.com` |
| `ovh_domain_name_servers.arcodange_fr` | — | Delegates NS to `cloudflare_zone…name_servers` (or `original_name_servers` when rolling back) |
| `terraform_data.arcodange_fr_initial_conf` | — | Snapshot of OVH's pre-Cloudflare config, kept for rollback inspection (`ignore_changes`) |
| `cloudflare_pages_project.arcodange_fr` | `arcodange-cms` | `production_branch = "main"` |
| `cloudflare_pages_domain.arcodange_fr` | `arcodange.fr` | Custom domain on Pages |
| `cloudflare_pages_domain.www_arcodange_fr` | `www.arcodange.fr` | Custom domain on Pages |
| `cloudflare_dns_record.root_cname` | `@` | CNAME → Pages `subdomain`, `proxied = true`, `ttl = 1` |
| `cloudflare_dns_record.www_cname` | `www` | CNAME → Pages `subdomain`, `proxied = true`, `ttl = 1` |
All wiring lives in [`iac.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/iac.tf). The account id is resolved at plan time via `data.cloudflare_account` filtered on the `arcodange@gmail.com` account name.
## Module: `cloudflared_tunnel`
[`modules/cloudflared_tunnel/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/modules/cloudflared_tunnel). A **Zero-Trust Cloudflared tunnel** that lets public hostnames reach in-cluster services **without opening any home-LAN port** — Cloudflare originates the connection from inside the cluster outward. Instantiated as `module.cf_tunnel` with `tunnel_name = "lab"`.
| Resource | Role |
|---|---|
| `cloudflare_zero_trust_tunnel_cloudflared.tunnel` | The tunnel named **`lab`** under the account |
| `cloudflare_zero_trust_tunnel_cloudflared_config.tunnel_config` | Ingress rules from `hostname_mappings`, terminating in a catch-all `http_status:404` |
| `data.cloudflare_zone.arcodange` | Looks up the zone (created by the root module) |
| `cloudflare_zone_setting.setting` | Sets **`always_use_https = on`** |
| `cloudflare_dns_record.dns` | One **proxied CNAME** per mapping → `<tunnel_id>.cfargotunnel.com` |
The single ingress mapping passed from the root is:
| Hostname | Service |
|---|---|
| `*.arcodange.fr` | `http://traefik.kube-system.svc.cluster.local:80` |
So every wildcard subdomain under `arcodange.fr` lands on the cluster's **internal Traefik** (`origin_request.no_tls_verify = true`), which then routes to the right in-cluster app (e.g. the `cms` Nuxt pod, Grafana, etc.). Pairs with the apex/`www` Pages records above, which are *not* tunneled.
> [!CAUTION]
> **The tunnel token is created by hand and rotation is not automated.** Cloudflare only issues a connector token from the web console, so it is **manually stored in Vault** under the KV-v2 mount `kvv2` at path `cms/cloudflared` (the in-repo `vault_kv_secret` resource is commented out for exactly this reason). The cluster-side `cloudflared` Deployment reads it via a `VaultStaticSecret` (Vault Secrets Operator), role `cms`, refreshed hourly. If the token is rotated in the console, the Vault entry must be updated **manually** — nothing in this IaC will do it. `module.vault_backend` provisions the `cms` Vault app role (service account `cloudflared`) that grants that read; see [secrets-and-vault](../lab-ecosystem/secrets-and-vault.md).
## Module: `cloudflared_captcha_for_crowdsec`
[`modules/cloudflared_captcha_for_crowdsec/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/modules/cloudflared_captcha_for_crowdsec). Mints a **Cloudflare Turnstile widget** and stores its keys in Vault for the [tools CrowdSec bouncer](../tools/components.md) to serve as a CAPTCHA challenge on remediated requests.
| Resource | Detail |
|---|---|
| `cloudflare_turnstile_widget.turnstile` | `name = "crowdsec captcha"`, `mode = "invisible"`, `clearance_level = "interactive"`, `region = "world"`; `bot_fight_mode`/`ephemeral_id`/`offlabel` all `false` |
| `vault_kv_secret_v2.turnstile` | Writes `{ sitekey, secret }` to KV-v2 (`cas = 1`) |
Instantiated as `module.cf_captcha_for_crowdsec` with `domain_names = [arcodange.fr, arcodange.lab, arcodange.duckdns.org]` and `vault_path = "cms/factory/turnstile"`.
| What | Where |
|---|---|
| **Turnstile mode** | Invisible widget, interactive clearance — challenges only when CrowdSec flags a request |
| **Vault destination** | `kvv2/cms/factory/turnstile` → keys `sitekey` + `secret` |
| **Consumer** | The [CrowdSec Traefik bouncer in `tools`](../tools/components.md) reads sitekey + secret to render and verify the challenge |
This is the one knot that ties the **`cms`** edge to the **`tools`** security stack: `cms` produces the Turnstile keys; `tools` consumes them.
## Sibling module: Zoho mail
`module.zoho` (source `./zoho`) lives in **this same OpenTofu root** and writes mail records into the same `cloudflare_zone`. It is documented separately on [Zoho email](zoho-email.md) — note that a `cms/cloudflare` apply touches mail DNS too, so plan output there is expected.
## CI — `cloudflare.yaml`
[`.gitea/workflows/cloudflare.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/cloudflare.yaml). Manual-only (`workflow_dispatch`), same Gitea-OIDC→Vault→`tofu apply` shape as the [tofu CI flow concept](../factory-provisioning/opentofu/ci-apply-flow.md).
1. **`gitea_vault_auth`** — mints a Gitea OIDC id-token (decodes `vault_oauth__sh_b64` and runs it), exported as `gitea_vault_jwt`.
2. **`tofu`** — depends on the auth job; a shared `*vault_step` reads all secrets from Vault (role `gitea_cicd_cms`, mount `gitea_jwt`), prepares the homelab CA cert, then runs **`dflook/terraform-apply@v1`** on `path: cloudflare/` with **`auto_approve: true`** at **OpenTofu `1.8.2`**.
### Vault secrets read by the workflow
| Vault path | Mapped to | Used for |
|---|---|---|
| `kvv1/cloudflare/cms/cf_arcodange_cms_token` (`token`) | `CLOUDFLARE_API_TOKEN` | Cloudflare provider auth |
| `kvv1/cloudflare/r2/arcodange-tf` (`*`) | `TF_VAR_CLOUDFLARE_*` | R2/S3 state backend creds + endpoint |
| `kvv1/gitea/tofu_module_reader` (`ssh_private_key`) | `TERRAFORM_SSH_KEY` | SSH key to clone the `tools` git module (`vault_backend`) |
| `kvv1/ovh/cms/app` (`*`) | `OVH_*` | OVH provider auth |
| `kvv1/zoho/self_client` (`*`) | `ZOHO_*` **and** `TF_VAR_ZOHO_*` | Zoho API auth for `module.zoho` |
> [!CAUTION]
> **`auto_approve: true` applies without a human gate.** Any dispatch of this workflow on any ref runs `tofu apply` straight against the live `arcodange.fr` edge and Vault. There is no plan-review step; review happens in the PR before merge, not in the apply. Treat a dispatch as a production change.
## Gotchas
> [!CAUTION]
> **Cloudflared tunnel token — manual, unrotated.** Created in the Cloudflare console and hand-placed in Vault under `kvv2` at path `cms/cloudflared`. No IaC rotates it. (Repeated here because it is the most common surprise.)
> [!WARNING]
> **OVH → Cloudflare nameserver delegation is the live cutover.** `ovh_domain_name_servers` points OVH at Cloudflare's nameservers. The `use_ovh_initial_name_servers` variable (default `false`) is meant to flip delegation back to OVH's `original_name_servers`, but that **rollback path is untested** — `terraform_data.arcodange_fr_initial_conf` only *snapshots* the pre-Cloudflare config for inspection. Do not assume a clean revert.
> [!WARNING]
> **R2-backed state creds gate everything.** State lives on Cloudflare R2 and the access/secret keys are `TF_VAR_` inputs (from `kvv1/cloudflare/r2/arcodange-tf`). If those creds are missing or rotated out from under the workflow, even `tofu init` fails — there is no fallback backend.
## Cross-references
- [CMS](README.md) — the guidebook hub; the public-request + email flow diagram.
- [Site (Nuxt)](site.md) — the Nuxt app served by the Pages project and the in-cluster pod this tunnel fronts.
- [Zoho email](zoho-email.md) — `module.zoho` lives in this same OpenTofu root.
- [tools CrowdSec](../tools/components.md) — consumer of the Turnstile widget minted here.
- [tofu CI flow concept](../factory-provisioning/opentofu/ci-apply-flow.md) — the shared Gitea-OIDC→Vault→apply pattern.
- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — where the tunnel token, Turnstile keys, and provider creds live.
- [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) — why this Internet-facing surface is isolated from the safe prod-like environment.
- Code: [`cms/cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare).

165
vibe/guidebooks/cms/site.md Normal file
View File

@@ -0,0 +1,165 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [CMS](README.md) > **Site (Nuxt)**
# Site (Nuxt)
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [CMS](README.md) · [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md)
> **Downstream:** [Cloudflare](cloudflare.md)
> **Related:** [Zoho email](zoho-email.md) · [tools CrowdSec](../tools/components.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md)
The public site face of the [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms): a **Nuxt 4** application built to **static HTML** and shipped two ways from one image — to **Cloudflare Pages** (the live public `arcodange.fr`) and into **k3s** via a Helm chart behind the Cloudflared tunnel. This page maps the Nuxt app, its Docker build, the Helm chart, and the Gitea Actions that drive both deploys.
## The Nuxt 4 application
Configured in [`nuxt.config.ts`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/nuxt.config.ts). It runs `ssr: true` for dev but is shipped via **`nuxt generate`** — a full static prerender — so production is plain HTML served by a static file server, no Node runtime.
| Concern | Setting | Notes |
|---|---|---|
| Rendering | `ssr: true`, shipped via `nuxt generate` | Static prerender to `.output/public`; Nitro `prerender.autoSubfolderIndex: false` |
| Site identity | `site.url: https://arcodange.fr`, `site.name: Arcodange`, `trailingSlash: true` | Drives canonical URLs, sitemap, robots via `@nuxtjs/seo` |
| Content | `@nuxt/content` collections | Markdown under [`content/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content); mermaid highlight enabled |
| Editing | **Nuxt Studio** at route **`/admin`** | `nuxt-studio` module; repo `arcodange-org/cms`, commits to `main` |
| Sitemap / robots | `@nuxtjs/sitemap` (`zeroRuntime: true`), `@nuxtjs/seo` | No runtime sitemap server — fully prerendered |
| Analytics | `@nuxtjs/plausible` | `apiHost: https://analytics.arcodange.fr`, `hashMode: true`, outbound tracking on, `localhost` ignored |
| i18n | `@nuxtjs/i18n` | Single locale **`fr`** (default `fr`); `htmlAttrs.lang: fr` |
| Images | `@nuxt/image` | `webp`/`jpeg`, quality 80 |
| Fonts | `@nuxt/fonts` | Local **Noto Emoji** preloaded |
| UI | `@nuxt/ui` | Plus `@nuxt/scripts`, `@nuxtjs/device`, `nuxt-booster`, `@compodium/nuxt` |
### Content collections
Declared in [`content.config.ts`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content.config.ts). Every collection is wrapped with `asSeoCollection()` (from `@nuxtjs/seo`) and sourced from a folder of Markdown under [`content/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/content).
| Collection | Source glob | Type | Schema extras |
|---|---|---|---|
| `parcours` | `parcours/*.md` | `page` | — |
| `site` | `site/*.md` | `page` | — |
| `tech` | `tech/*.md` | `page` | `date` (required), `image` (media), `featured` (default `false`) |
| `experiences` | `experiences/*.md` | `page` | `date`, `enddate`, `icon` (default `i-lucide-rocket`), `image`, `secondaryImage`, `descriptionHTML` |
A content build transformer `~~/content/transformers/description-md` runs at build time, and Markdown highlighting registers the `mermaid` language.
## Docker build: one image, two static trees
[`Dockerfile`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/Dockerfile) is a multi-stage build that produces **two** static outputs from the same source and packs them into a static web server image.
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef base fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef build fill:#059669,stroke:#047857,color:#fff
classDef out fill:#d97706,stroke:#b45309,color:#fff
DEPS["cms-deps:TAG<br>(Dockerfile.deps base)"]:::base
BUILD["build stage<br>npm ci"]:::build
PROD["nuxt generate<br>→ /app/prod"]:::out
STG["NUXT_SITE_ENV=staging<br>nuxt generate → /app/.output/public"]:::out
SWS["static-web-server:2<br>serves /public"]:::build
DEPS --> BUILD
BUILD --> PROD
BUILD --> STG
PROD --> SWS
STG --> SWS
```
1. The **build stage** starts `FROM gitea.arcodange.lab/arcodange-org/cms-deps:${BASE_IMAGE_TAG}` — a prebuilt base ([`Dockerfile.deps`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/Dockerfile.deps), `node:24-slim` + `python3`/`make`/`g++`/`sqlite3`/`libvips` for `better-sqlite3`/`libvips`) — copies the source and runs `npm ci`.
2. The **prod** build: `npx nuxt generate`, then the output is moved to **`/app/prod`**.
3. The **staging** build: `NUXT_SITE_ENV="staging" npx nuxt generate`, leaving its output at **`/app/.output/public`**.
4. The **server stage** is `FROM joseluisq/static-web-server:2`; it copies the staging tree to **`/public`** and the prod tree to **`/prod`**, plus [`webserver.config.toml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/webserver.config.toml) as `/sws.toml`, and serves on port 80.
> [!NOTE]
> **`/public` is staging, `/prod` is production.** The static-web-server serves `root = "./public"` (the **staging** build) by default — that is what the in-cluster k3s deploy exposes (e.g. `cms-rec.arcodange.fr`). The **prod** build at `/prod` is the tree extracted and pushed to Cloudflare Pages by the `arcodange_fr` workflow. One image therefore carries both faces.
The final image is pushed to **`gitea.arcodange.lab/arcodange-org/cms`** (tags `latest` and the branch ref).
## Helm chart
[`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart) deploys the in-cluster face. The pod is just the static-web-server image above, fronted by Traefik with a CrowdSec middleware and reached either over the lab ingress (`www.arcodange.lab`) or through a sidecar Cloudflared tunnel (`cms-rec.arcodange.fr`).
| Key | Value | Source |
|---|---|---|
| Chart name / version | `arcodange-cms` / `0.1.0`, `appVersion: latest` | [`Chart.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/Chart.yaml) |
| Image | `gitea.arcodange.lab/arcodange-org/cms:latest`, `pullPolicy: Always` | [`values.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/values.yaml) |
| Replicas | `1` (autoscaling disabled) | `replicaCount: 1`, `autoscaling.enabled: false` |
| Service | `ClusterIP`, port **80** (named `http`) | `service.port: 80` |
| Probes | liveness + readiness `httpGet /` on `http` | — |
| ServiceAccount | created, name **`cms`**, automount on | `serviceAccount.name: cms` |
| Lab ingress | `www.arcodange.lab`, path `/` Prefix | Traefik `websecure`, TLS via `letsencrypt` resolver (`arcodange.lab` + SAN `www.arcodange.lab`) |
| Edge middleware | `kube-system-crowdsec@kubernetescrd` | applied on both ingresses |
| Tunnel ingress | `cms-rec.arcodange.fr`, Traefik `web` entrypoint | `ingress.cloudflared.host` |
| Cloudflared sidecar | enabled, `Deployment`, `1` replica, image `cloudflare/cloudflared:latest` | `cloudflared.*` |
| Tunnel token | Vault KV-v2 `kvv2` path `cms/cloudflared`, role `cms`, refresh `1h` | `cloudflared.vault.*` |
### Chart templates
The chart renders these objects (in [`chart/templates/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates)):
| Template | Renders |
|---|---|
| [`deployment.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/deployment.yaml) | the `cms` static-web-server pod, port `http`/80 |
| [`service.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/service.yaml) | ClusterIP service on 80 |
| [`ingress.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/ingress.yaml) | lab Traefik ingress for `www.arcodange.lab` + CrowdSec middleware |
| [`ingress_cloudflared.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/ingress_cloudflared.yaml) | `<fullname>-cloudflared` ingress for `cms-rec.arcodange.fr` (web entrypoint) |
| [`cloudflared_tunnel.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/cloudflared_tunnel.yaml) | `cloudflared` SA, `VaultAuth`, `VaultStaticSecret`, and the cloudflared `Deployment`/`DaemonSet` |
| [`serviceaccount.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/serviceaccount.yaml) | the `cms` ServiceAccount |
| [`ingress_gitea.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/ingress_gitea.yaml), [`hpa.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart/templates/hpa.yaml) | optional Gitea ingress; HPA (disabled) |
### Cloudflared tunnel template
The cloudflared sidecar pulls its tunnel token from Vault through the [VSO](../tools/components.md) operator, never from a static manifest:
1. A `ServiceAccount` **`cloudflared`** is created with a `VaultAuth` (Kubernetes auth, mount `kubernetes`, role from `cloudflared.vault.role` = `cms`, audience `vault`).
2. A **`VaultStaticSecret`** named `cloudflared-tunnel-token` reads **KV-v2** mount **`kvv2`** at path **`cms/cloudflared`** (refresh `1h`) and materialises a `cloudflared-tunnel-token` Secret.
3. The cloudflared `Deployment` (1 replica, pinned to a `control-plane` node via affinity) runs `cloudflared tunnel --no-autoupdate run --token $(TUNNEL_TOKEN) --no-tls-verify`, with `TUNNEL_TOKEN` injected from that Secret's `token` key.
This connects Cloudflare's edge to internal Traefik so `cms-rec.arcodange.fr` reaches the in-cluster `cms` service without opening any home-LAN port — the cluster side of the tunnel whose Cloudflare side lives in the [Cloudflare IaC](cloudflare.md).
## CI: building and deploying
Three Gitea Actions workflows under [`.gitea/workflows/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows) cover the site. (A fourth, `cloudflare.yaml`, drives the OpenTofu — see [Cloudflare](cloudflare.md).)
| Workflow | Triggers | What it does |
|---|---|---|
| [`docker-dependencies.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/docker-dependencies.yaml) | `workflow_dispatch`; push to `main` touching `package.json`, `package-lock.json`, `Dockerfile.deps` | Builds the **deps** base image, pushes `gitea.arcodange.lab/<repo>-deps:{latest,YYYYMMDD-SHA8}`, then creates+pushes a **git tag `deps-YYYYMMDD-SHA8`** (with retry, up to 30 attempts) |
| [`docker-content.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/docker-content.yaml) | `workflow_dispatch`; push to `main` touching `nuxt.config.ts`, `app/**`, `content.config.ts`, `content/**`, `public/**`, `package*.json`, `Dockerfile` | Finds the latest `deps-*` git tag, strips `deps-` to get `BASE_TAG`, builds the **full image** with `--build-arg BASE_IMAGE_TAG=$BASE_TAG`, pushes `gitea.arcodange.lab/<repo>:{latest,<ref>}` |
| [`arcodange_fr.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/arcodange_fr.yaml) | `workflow_dispatch` (input `image_tag`, default `main`) | Pulls `cms:<image_tag>`, `docker create` + `docker cp` to extract **`/prod`** to `./public`, writes a minimal `wrangler.toml`, then **`wrangler pages deploy`** to project `arcodange-cms`, branch `main` |
> [!IMPORTANT]
> **The deps tag is the contract between the two Docker workflows.** `docker-dependencies` publishes both the `-deps` image and a matching **git tag** `deps-YYYYMMDD-SHA8`; `docker-content` discovers that tag (`git tag --list "deps-*" | sort -V | tail -n1`) to pin its `BASE_IMAGE_TAG`. Touch `package.json`/lockfile/`Dockerfile.deps` and the deps build must land first, or the content build pins a stale base.
### From image to Cloudflare Pages
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef ci fill:#059669,stroke:#047857,color:#fff
classDef reg fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef edge fill:#d97706,stroke:#b45309,color:#fff
DEP["docker-dependencies<br>→ -deps image + git tag deps-*"]:::ci
CON["docker-content<br>pins BASE_IMAGE_TAG"]:::ci
REG["registry<br>gitea.arcodange.lab/arcodange-org/cms"]:::reg
FR["arcodange_fr<br>extract /prod"]:::ci
PAGES["Cloudflare Pages<br>project arcodange-cms"]:::edge
K3S["k3s Helm chart<br>serves /public (staging)"]:::edge
DEP --> CON --> REG
REG --> FR --> PAGES
REG --> K3S
```
1. **`docker-dependencies`** publishes the `-deps` base image and a `deps-YYYYMMDD-SHA8` git tag whenever dependencies change.
2. **`docker-content`** resolves that tag, builds the full dual-tree image, and pushes it to **`gitea.arcodange.lab/arcodange-org/cms`**.
3. **`arcodange_fr`** (manual) pulls that image, extracts the **`/prod`** tree, and deploys it to **Cloudflare Pages** project `arcodange-cms` on branch `main` — this is the live public `arcodange.fr`.
4. In parallel, the k3s **Helm chart** runs the same image and serves the **`/public`** (staging) tree behind Traefik + CrowdSec and the Cloudflared tunnel (`cms-rec.arcodange.fr`, `www.arcodange.lab`).
## Cross-references
- [CMS](README.md) — the guidebook hub: the two faces of the repo and the public request/email flow.
- [Cloudflare](cloudflare.md) — the Cloudflare side of the tunnel, the Pages project, and the zone the deploys publish into.
- [Zoho email](zoho-email.md) — mail for the same `arcodange.fr` domain.
- [tools CrowdSec](../tools/components.md) — the Traefik bouncer middleware fronting both chart ingresses.
- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — where the Cloudflared tunnel token (`kvv2` `cms/cloudflared`) and registry/CF credentials live.
- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms).

View File

@@ -0,0 +1,116 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [CMS](README.md) > **Zoho email**
# Zoho email
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [CMS](README.md) · [Cloudflare](cloudflare.md)
> **Downstream:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md)
> **Related:** [lab-ecosystem 03 · cms](../lab-ecosystem/03-cms.md) · [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-env PRD](../../PRD/safe-prod-like-environment/README.md)
Email for **arcodange.fr** is hosted at **Zoho Mail (EU region)** and provisioned *entirely from OpenTofu*. There is no Zoho web-console click-ops in the steady state: the same `tofu apply` that owns the Cloudflare zone also drives the Zoho REST API to read the organization, publish the DNS records mail delivery depends on, and create one mailbox alias + one Inbox sub-folder per address. This page lives under [`cms/cloudflare/zoho/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho), a sub-module of the [Cloudflare](cloudflare.md) tofu root.
> [!CAUTION]
> **DNS/email changes here are high-stakes and slow to fail.** A wrong MX, SPF, DKIM, or DMARC record silently degrades or breaks `arcodange.fr` deliverability for **days** — receivers cache TTLs, reputation decays, and there is no synchronous error to catch in CI. DMARC is published as **`p=reject`**, so a broken SPF/DKIM alignment means conforming receivers *drop* legitimate mail outright rather than quarantine it. This is a prime motivation for the **safe environment**: changes to this module must be validated **plan-only against a throwaway/clone zone**, never iterated directly against the live `arcodange.fr` zone. See the [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) and the [safe-env PRD](../../PRD/safe-prod-like-environment/README.md).
## How the module is wired
The Cloudflare root ([`cloudflare/iac.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/iac.tf)) instantiates `module "zoho"`, passing it the live zone and domain plus the OAuth client credentials:
| Input | Source | Purpose |
|---|---|---|
| `domain_name` | `ovh_domain_name.arcodange_fr.domain_name` | the domain to manage (`arcodange.fr`) |
| `dns_zone_id` | `cloudflare_zone.arcodange_fr.id` | Cloudflare zone the DNS records land in |
| `zoho_client_id` | `var.ZOHO_CLIENT_ID` (Vault `kvv1/zoho/self_client`) | OAuth2 self-client id |
| `zoho_client_secret` | `var.ZOHO_CLIENT_SECRET` (Vault `kvv1/zoho/self_client`) | OAuth2 self-client secret |
In CI the secrets are injected by the `vault-action` step in [`.gitea/workflows/cloudflare.yaml`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/.gitea/workflows/cloudflare.yaml), which maps the whole `kvv1/zoho/self_client` KV-v1 secret into **both** the shell env (`ZOHO_*`, consumed by the helper scripts) and the tofu vars (`TF_VAR_ZOHO_*`, consumed by `config.tf`):
```
kvv1/zoho/self_client * | ZOHO_ ;
kvv1/zoho/self_client * | TF_VAR_ZOHO_ ;
```
## OAuth2: client-credentials flow
Zoho is a self-client (machine-to-machine) integration on the **EU** datacenter — every host is `*.zoho.eu` / `accounts.zoho.eu`. Authentication uses the OAuth2 **`client_credentials`** grant; there is no interactive user consent in the running flow (a commented device-code flow remains in [`.env`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/.env) as historical bootstrap).
The token is minted in [`config.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/config.tf) via a `data "http"` POST to `https://accounts.zoho.eu/oauth/v2/token` with `grant_type=client_credentials` and the comma-joined scope list. The bearer is then folded into an `Authorization: Zoho-oauthtoken <token>` header (`local.auth_headers`) reused by every subsequent read.
| Scope | Access | Why it is needed |
|---|---|---|
| `ZohoMail.partner.organization.READ` | READ (org) | resolve the org **ZOID** |
| `ZohoMail.organization.accounts.READ` | READ (accounts) | find the super-admin **account id / zuid** |
| `ZohoMail.organization.accounts.UPDATE` | UPDATE (accounts) | add / remove email aliases |
| `ZohoMail.organization.domains.READ` | READ (domains) | fetch the domain verification code + DKIM public key |
| `ZohoMail.folders.ALL` | ALL (folders) | list and create per-alias Inbox sub-folders |
Lookup chain (each step feeds the next):
1. `GET https://mail.zoho.eu/api/organization``local.org`, from which `zoid` builds `local.api_prefix = https://mail.zoho.eu/api/organization/<zoid>`.
2. `GET {api_prefix}/domains/{domain_name}` ([`dns.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/dns.tf)) → `local.domain`, exposing `CNAMEVerificationCode` and `dkimDetailList[0].publicKey`.
3. `GET {api_prefix}/accounts` ([`email_aliases.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/email_aliases.tf)) → the single `iamUserRole == "super_admin"` account, giving its `accountId` and `zuid`.
## DNS records published on the Cloudflare zone
[`modules/zoho_mail_dns`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/modules/zoho_mail_dns) materialises every `cloudflare_dns_record` Zoho mail needs onto the live zone. The DKIM key and verification code are read live from the Zoho domain API (step 2 above) and passed in as module inputs, so the records always track what Zoho actually expects. All records use **TTL 3600** and apply to the apex (`@`) unless noted.
| Name | Type | Value | Purpose |
|---|---|---|---|
| `@` | TXT | `"zoho-verification=<CNAMEVerificationCode>.zmverify.zoho.eu"` | proves domain ownership to Zoho |
| `@` | MX | `mx.zoho.eu` (priority **10**) | primary inbound mail exchanger |
| `@` | MX | `mx2.zoho.eu` (priority **20**) | secondary mail exchanger |
| `@` | MX | `mx3.zoho.eu` (priority **50**) | tertiary mail exchanger |
| `@` | TXT | `"v=spf1 include:zohomail.eu ~all"` | SPF: authorise Zoho to send for the domain |
| `zmail._domainkey` | TXT | `"<dkim_public_key>"` (from `dkimDetailList[0].publicKey`) | DKIM public key for outbound signing |
| `_dmarc` | TXT | `"v=DMARC1; p=reject; rua=mailto:arcodange@gmail.com; ruf=mailto:arcodange@gmail.com; sp=reject; adkim=r; aspf=r; pct=100"` | DMARC policy: **reject** non-aligned mail, 100% coverage, aggregate+forensic reports to `arcodange@gmail.com` |
| `default._bimi` | TXT | `"v=BIMI1; l=https://arcodange.fr/.well-known/logo.svg; avp=brand;"` | BIMI: display the brand logo beside authenticated mail (created only when `bimi_logo_url != null`) |
> [!WARNING]
> The DMARC policy is the strictest tier: `p=reject` **and** `sp=reject` (subdomains) with relaxed alignment (`adkim=r`, `aspf=r`) and `pct=100`. There is no `quarantine` grace band — any message that fails both SPF *and* DKIM alignment is rejected by conforming receivers. Validate SPF/DKIM correctness in the safe environment before touching the live `_dmarc` or apex records.
## Email aliases
Seven addresses are defined as a single map in [`email_aliases.tf`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/email_aliases.tf) (`local.email_aliases`). Each is provisioned **twice** against the super-admin mailbox: as an **email alias** on the account, and as a matching **Inbox sub-folder** so mail to that address can be filtered into its own folder.
| Alias (`@arcodange.fr`) | Display name | Purpose |
|---|---|---|
| `bonjour` | `Service Bonjour` | commercial / sales |
| `bureaux` | `Bureaux Arcodange` | official bodies (URSSAF, administration) |
| `contact` | `Premier Contact` | website contact form |
| `helloworld` | `✅ Arcodange 🏹💻🪽` | social networks, newsletter |
| `analytics` | `Analytics 📊🔍` | social networks, newsletter |
| `books` | `Accounting 📒🧮` | accounting / bookkeeping |
| `abonnements` | `Abonnements 📱🤖` | subscriptions (phone, AI, services) |
Provisioning is *imperative-inside-declarative*: each alias is a `terraform_data` resource whose `triggers_replace` watches whether the alias/folder is already present, and whose `local-exec` provisioners shell out to [`zoho_api_call.sh`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/zoho_api_call.sh) on **create** and **destroy**:
1. **Alias create**`PUT {api_prefix}/accounts/{zuid}` with `mode=addEmailAlias`, scope `ZohoMail.organization.accounts.UPDATE`; fails fast if the response contains `OPERATION_NOT_PERMITTED`.
2. **Alias destroy** — same endpoint with `mode=deleteEmailAlias` (the bare local-part, split off the `alias:display` key).
3. **Folder create**`POST /api/accounts/{accountId}/folders` with `parentFolderId` = the resolved **Inbox** folder id, scope `ZohoMail.folders.ALL`.
4. **Folder destroy** — looks the folder id up by name, `DELETE`s it, then also sweeps the corresponding `/Trash/<name>` (or `/Trash/Inbox_<name>`) folder Zoho leaves behind.
> [!NOTE]
> `terraform_data` + `local-exec` is used because aliases and folders are Zoho-side mutations with no first-class Terraform provider. The `triggers_replace = { missing = !contains(...) }` guard makes the apply idempotent: the provisioner only re-runs when the alias/folder is genuinely absent, so a clean plan is a no-op rather than a re-create.
## Helper scripts
Both scripts live beside the tofu and are invoked from `local-exec`. They share the OAuth client env vars (`ZOHO_CLIENT_ID`, `ZOHO_CLIENT_SECRET`, `ZOHO_TOKEN_ENDPOINT`) injected from Vault.
| Script | Role |
|---|---|
| [`zoho_api_call.sh`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/zoho_api_call.sh) | Thin HTTP wrapper. Parses `--endpoint`, `-x=<METHOD>`, `--scope`, `--data_json` / `--data_url`, and `--fail_if_str_in_resp`; sources `zoho_gen_token.sh`, attaches the bearer header, `curl`s the call, fails if a sentinel string (e.g. `OPERATION_NOT_PERMITTED`) appears, and emits compact JSON via `jq`. |
| [`zoho_gen_token.sh`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare/zoho/zoho_gen_token.sh) | OAuth token cache. `gen_zoho_token <scope>` returns a cached token from `/tmp/zoho_oauth_tokens.cache` when fresh, otherwise mints a new `client_credentials` token and stores it. |
`zoho_gen_token.sh` is **lock-based and TTL-bounded**:
- A mutex is taken by `mkdir /tmp/zoho_oauth_tokens.lock` (atomic dir creation), with up to 10 one-second retries, so concurrent `local-exec` provisioners don't corrupt the cache. The lock is released on every function exit via `trap`.
- Tokens are keyed by scope in `/tmp/zoho_oauth_tokens.cache` (file mode `600`). A token is reused only while younger than **3600 s (~1 h)**; `cleanup_cache` prunes expired entries on each call.
- The wrapper runs `cleanup_cache` before each request and re-traps it on `INT TERM EXIT`, so stale tokens never leak past their TTL.
## Cross-references
- **Parent tofu / zone & Pages:** [Cloudflare](cloudflare.md) — owns `cloudflare_zone.arcodange_fr` that this module writes records into, and the `vault-action` CI step that supplies the credentials.
- **Where these secrets come from:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) (`kvv1/zoho/self_client`).
- **How apply runs:** [tofu CI flow](../factory-provisioning/opentofu/ci-apply-flow.md).
- **Why a safe environment exists:** [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-env PRD](../../PRD/safe-prod-like-environment/README.md).

View File

@@ -0,0 +1,87 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > **ERP**
# ERP
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [Applications hub](../applications/README.md) · [01 · factory](../lab-ecosystem/01-factory.md)
> **Downstream:** [Deployment](deployment.md) · [Backup & recovery](backup-and-recovery.md) · [Operations](operations.md)
> **Related:** [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md)
This guidebook maps **erp** — the lab's [Dolibarr **22.0.4**](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/Chart.yaml) accounting/business ERP and its **single most data-critical application**. It is a PHP/Apache workload built from the upstream `dolibarr/dolibarr` image, served internally at `erp.arcodange.lab` (Traefik `websecure` + `localIp@file` + a `letsencrypt`-resolver cert). Everything a reader needs to deploy it, keep its data safe, and operate it lives in the three child pages below; this page is the orientation map.
## What makes erp special
`erp` is the complex sibling of the [webapp / url-shortener archetypes](../applications/README.md). It carries the **same four-ingredient app pattern** (Dockerfile-less reuse of an upstream image, a `chart/`, an `iac/`, `.gitea/workflows`) but layers several things on top that the archetypes do not:
| Trait | erp specifics | Why it matters |
|---|---|---|
| **Upstream image** | `dolibarr/dolibarr:22.0.4` — not a repo-built image | No custom Dockerfile; the chart adapts the upstream container at runtime |
| **Postgres, not MySQL** | Dolibarr classically assumes MySQL; erp runs on **PostgreSQL** (`DOLI_DB_TYPE: pgsql`) | A [custom entrypoint](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) rewrites the upstream `docker-run.sh` `mysql` invocation into `psql` before launch |
| **DB path** | pod → `pgbouncer.tools:5432` → the `erp` Postgres database on `pi2` | Shares the [tools pgbouncer pooler](../tools/secrets-and-vso.md) like the webapp archetype |
| **Vault wiring** | **dynamic** rotating DB creds (`postgres/creds/erp`) + **static** KV config (`kvv2` `erp/config`) via the shared [`app_roles` module](../tools/secrets-and-vso.md) | The pod cannot start without VSO-injected `DOLI_DB_USER` / `DOLI_DB_PASSWORD` |
| **Document persistence** | a **50Gi Longhorn RWX PVC** (`storageClassName: longhorn`, `accessModes: ReadWriteMany`, `helm.sh/resource-policy: keep`) mounting `/var/www/documents`, `/var/www/html/custom`, and `/var/backups` | Uploaded invoices/PDFs/attachments are real business records — losing them is the worst case |
| **Backup + ops** | its own [backup/restore subsystem](backup-and-recovery.md) plus a **read-only ops CLI** (`bin/arcodange`) | Data-criticality demands both an escape hatch for restores and a safe way to inspect live state |
## Overview — how erp is wired
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef net fill:#b45309,stroke:#92400e,color:#fff
CI["factory / erp CI<br>tofu apply (iac/)"]:::src
VAULT["Vault<br>postgres/creds/erp (dynamic)<br>kvv2 erp/config (static)"]:::store
ARGO["ArgoCD<br>syncs chart/ (ns erp)"]:::proc
POD["Dolibarr pod<br>dolibarr/dolibarr:22.0.4<br>custom entrypoint → psql"]:::proc
VSO["VSO<br>VaultAuth + VaultDynamicSecret<br>+ VaultStaticSecret"]:::proc
PGB["pgbouncer.tools:5432"]:::net
PG["Postgres erp db<br>(pi2)"]:::store
PVC["50Gi Longhorn RWX PVC<br> /var/www/documents"]:::store
BK["backup CronJob / runner<br>pg_dump → documents/admin/backup"]:::proc
CI --> VAULT
ARGO --> POD
VAULT -. "creds + config" .-> VSO
VSO -- "vso-db-credentials + secretkv" --> POD
POD --> PGB --> PG
PVC -- "mounts /var/www/documents" --- POD
BK -- "dumps DB + writes to" --> PVC
```
1. **factory / erp CI** runs `tofu apply` over `iac/` to declare erp's Vault objects — a Postgres dynamic-secret role and a Kubernetes auth role — through the shared [`app_roles` module](../tools/secrets-and-vso.md), and seeds the static `kvv2` `erp/config` KV (admin login, instance UUID).
2. **ArgoCD** (factory's [app-of-apps](../lab-ecosystem/01-factory.md)) syncs the `chart/` into the `erp` namespace.
3. The **Dolibarr pod** comes up from `dolibarr/dolibarr:22.0.4`; its [custom entrypoint](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) rewrites the upstream `docker-run.sh` so SQL runs through `psql` instead of `mysql`.
4. **VSO** authenticates to **Vault** (the `auth` `VaultAuth` CRD), materialising `vso-db-credentials` (dynamic, rotating DB user/password from `postgres/creds/erp`) and `secretkv` (static config from `kvv2` `erp/config`); both are injected into the pod, and a credential rotation triggers a rollout restart.
5. The pod connects to the **`erp` Postgres database** through the [tools `pgbouncer.tools:5432` pooler](../tools/secrets-and-vso.md).
6. A **50Gi Longhorn RWX PVC** mounts `/var/www/documents` (plus `/var/www/html/custom` and `/var/backups`), holding every uploaded document and generated PDF.
7. The **backup subsystem** dumps the `erp` database with a version-matched `pg_dump` and lands the archive under `documents/admin/backup` on that same PVC — see [Backup & recovery](backup-and-recovery.md).
> [!CAUTION]
> **Recovery ordering: Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no usable DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp`. If erp is scaled up while Vault is still sealed, the pod crash-loops with no database access. During a cluster rebuild, unseal Vault first, confirm VSO has reconciled the erp secrets, and only then scale erp. The full sequence (cluster bring-up → Vault unseal → storage → apps) is covered by [Backup & recovery](backup-and-recovery.md), the [storage concept](../lab-ecosystem/storage-and-recovery.md), the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md), and the cluster-wide CLUSTER_RECOVERY.md runbook.
## Index
| Page | What it covers | Status |
|---|---|---|
| [Deployment](deployment.md) | The chart, the upstream image + custom entrypoint, the Postgres-over-pgbouncer wiring, the Vault CRDs (dynamic creds + static config), and the ingress | ✅ Active |
| [Backup & recovery](backup-and-recovery.md) | The document PVC, the `pg_dump`-based backup subsystem, restore procedure, and where erp sits in cluster-recovery ordering | ✅ Active |
| [Operations](operations.md) | The read-only `bin/arcodange` ops CLI and day-to-day operational tasks (table-ownership fix-ups, liveness checks, audits) | ✅ Active |
## Maintenance rule
> [!IMPORTANT]
> **When the erp repo changes shape, these pages change in the same PR.** If you alter the chart structure, the custom entrypoint, the Vault wiring, the document PVC, the backup subsystem, or the ops CLI, update this hub and the relevant child page in the same change. A reference map that drifts from the real `chart/`, `iac/`, and `backup/` sends agents confidently down dead paths — and for the lab's most data-critical app that risk is highest here.
## Cross-references
- [Applications hub](../applications/README.md) — the common four-ingredient app pattern; erp is its complex sibling, beside the webapp and url-shortener archetypes.
- [01 · factory](../lab-ecosystem/01-factory.md) — the ArgoCD app-of-apps that emits erp's `Application` CRD.
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the `app_roles` module + VSO runtime that delivers erp's dynamic DB creds and static config, and the pgbouncer pooler the pod connects through.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role erp runs on.
- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the 50Gi Longhorn RWX document PVC is provisioned and recovered.
- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must precede scaling erp back up.
- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) — why the lab keeps erp deployed prod-like and the data-criticality trade-offs behind it.

View File

@@ -0,0 +1,207 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Backup & recovery**
# Backup & recovery
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [ERP](README.md) · [Deployment](deployment.md)
> **Downstream:** [Operations](operations.md)
> **Related:** [storage concept](../lab-ecosystem/storage-and-recovery.md) · [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) · [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)
`erp` is the lab's **single most data-critical application**, so it carries its own backup/restore subsystem layered on top of the cluster's storage and secrets machinery. Two independent data stores have to survive an incident: the **`erp` PostgreSQL database** (captured by a `pg_dump`) and the **uploaded documents** on the Longhorn PVC (captured by Longhorn snapshots/backups, *not* the `pg_dump`). This page covers both, the daily backup CronJob, the restore Job, and the load-bearing recovery ordering that keeps erp from crash-looping during a cluster rebuild.
## Backup mechanism
The recurring backup is an Ansible-deployed Kubernetes **CronJob** named `dolibarr-backup` in namespace `erp`, declared by [`ansible/arcodange/erp/playbooks/recurrentBackup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml). Each scheduled tick spawns a one-shot `postgres:16.3` Job that takes a **logical** dump of the `erp` database, gzips it, and lands the archive on the same Longhorn PVC that holds erp's documents.
The pipeline inside each run:
1. **Detect version**`psql ... -c "SELECT value FROM llx_const WHERE name='MAIN_VERSION_LAST_UPGRADE';"` reads the live Dolibarr version straight from the database, so the archive name records exactly which schema it came from.
2. **Dump + compress**`pg_dump -d erp --no-tablespaces --inserts | gzip > <archive>`. The `--inserts` flag emits row-by-row `INSERT` statements (portable, version-tolerant restores) and `--no-tablespaces` strips host-specific tablespace clauses.
3. **Write to PVC** — the archive lands at `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz`, where the container mounts the `erp` PVC with `subPath: documents/admin/backup`.
4. **Prune**`find /documents/admin/backup -name "pg_dump_erp_*.sql.gz" -type f -mtime +15 -delete` removes anything older than 15 days.
DB credentials are supplied by the **VSO-materialised `vso-db-credentials` secret** (`envFrom` + `PGPASSWORD` from its `password` key), the same dynamic `postgres/creds/erp` secret the pod uses — see [tools secrets-and-vso](../tools/secrets-and-vso.md). The Job runs `backoffLimit: 0` with `restartPolicy: Never`, so a failed run leaves an inspectable terminated pod rather than retrying blindly.
### Schedule, retention & artifacts
| Property | Value | Source |
|---|---|---|
| Resource | CronJob `dolibarr-backup` (ns `erp`) | [recurrentBackup.yml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/recurrentBackup.yml) |
| Schedule | `0 4 * * *` (04:00 daily) | `spec.schedule` |
| Successful job history | `successfulJobsHistoryLimit: 3` | `spec` |
| Failed job history | `failedJobsHistoryLimit: 3` | `spec` |
| Retention | 15 days (`find -mtime +15 -delete`) | dump script |
| Dump image | `postgres:16.3` | `jobTemplate` container |
| Dump command | `pg_dump --no-tablespaces --inserts` (logical) | dump script |
| Compression | `gzip` (CronJob) / `tar -czf` (ad-hoc) | dump scripts |
| Archive path | `/documents/admin/backup/pg_dump_erp_<version>_<timestamp>.sql.gz` | mount + dump script |
| Mount | PVC `erp`, `subPath: documents/admin/backup` | `volumeMounts` |
| Failure policy | `backoffLimit: 0`, `restartPolicy: Never` | `jobTemplate` |
### Ad-hoc & manual alternatives
Two escape hatches exist for an on-demand dump outside the 04:00 schedule:
| Tool | What it does | When to reach for it |
|---|---|---|
| [`ansible/.../playbooks/backup.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/backup.yml) | One-shot Ansible **Job** `dolibarr-backup` (`postgres:16.3`); fetches the ERP version by scraping `https://erp.arcodange.lab/`, dumps with the same `--no-tablespaces --inserts` flags, `tar -czf` into the PVC, and waits for completion | A single immediate dump driven from a control host that can reach the cluster API |
| [`backup/create_backup.sh`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/create_backup.sh) | Pure `kubectl` shell: `kubectl run pg-dump-temp` (`postgres:16.3`) to `pg_dump -c` locally, then `kubectl cp` the archive into the running erp pod's `/var/www/documents/admin/backup/` | A laptop with `kubectl` access but no Ansible setup |
> [!WARNING]
> **`pg_dump` and the server must match major versions.** The lab's Postgres is **16.3**, so every dump/restore container pins `postgres:16.3`. Dolibarr's built-in *Tools → Database backup* page (`/admin/tools/dolibarr_export.php`) historically shells out to the image's bundled `pg_dump` (e.g. 11.x), which aborts with `server version mismatch`. Use the CronJob, the Ansible playbooks, or `create_backup.sh` — never the in-app export against a newer server.
## What is — and is NOT — in the dump
The `pg_dump` captures **only the relational database**. Everything users *upload* lives on the Longhorn PVC and is protected by a completely separate mechanism. Conflating the two is the classic way to lose business records.
| Data | Where it lives | Protected by |
|---|---|---|
| Invoices, third parties, accounting rows, config rows (`llx_*` tables) | `erp` Postgres DB | `pg_dump` archives (this page) |
| Uploaded documents, generated PDFs, attachments | `/var/www/documents` on the Longhorn RWX PVC | **Longhorn** snapshots / backups |
| Custom modules / overrides | `/var/www/html/custom` on the same PVC | **Longhorn** snapshots / backups |
> [!IMPORTANT]
> A `pg_dump` alone does **not** make erp recoverable. A full recovery needs *both* the latest `pg_dump_erp_*.sql.gz` **and** the Longhorn-restored document volume. The backup archive itself sits on that same PVC (`/documents/admin/backup`), so it rides along with the Longhorn snapshot — but treat the database and the documents as two artifacts that must be restored together. See the [storage concept](../lab-ecosystem/storage-and-recovery.md) for how the Longhorn volume is snapshotted and recovered.
## Restore
The restore is the Ansible-driven Job `dolibarr-restore` from [`ansible/.../playbooks/restore.yml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/ansible/arcodange/erp/playbooks/restore.yml) (`postgres:16.3`). It **auto-discovers the most recent** `pg_dump_erp_*.sql.gz` via `ls -t ... | head -n1`, or you can pin a specific archive with `-e backup_file=...`. It runs `backoffLimit: 0` / `restartPolicy: Never` so a failed restore leaves a terminated pod you can inspect. **Restoring into a live database corrupts state** — scale erp to zero first.
Ordered procedure:
1. **[AGENT]** Confirm the cluster is healthy and inspect available archives (read-only).
2. **[HUMAN]** Scale the `erp` Deployment to **0** to stop all writes.
3. **[HUMAN]** Run the restore Job (latest archive, or a pinned `backup_file`); it `tar -xzf`s the archive and `psql -f`s it into the `erp` DB.
4. **[AGENT]** Watch the Job to completion and read its logs.
5. **[HUMAN]** Scale the `erp` Deployment back to **1**.
6. **[AGENT]** Validate erp is serving and the data is present.
```bash
# [AGENT] read-only: cluster health + list backup archives on the PVC
kubectl get deploy,pods -n erp
kubectl exec -n erp deploy/erp -- ls -t /var/www/documents/admin/backup/pg_dump_erp_*.sql.gz
```
```bash
# [HUMAN] prod-mutating: stop writes before restoring
kubectl scale deploy/erp -n erp --replicas=0
kubectl rollout status deploy/erp -n erp --watch=false # expect 0 replicas
```
```bash
# [HUMAN] prod-mutating: run the restore Job
# default = newest pg_dump_erp_*.sql.gz auto-discovered on the PVC
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml
# or pin an explicit archive:
ansible-playbook ansible/arcodange/erp/playbooks/restore.yml \
-e backup_file=/documents/admin/backup/pg_dump_erp_22.0.4_2606231819.sql.gz
```
```bash
# [AGENT] read-only: follow the restore Job and its logs
kubectl get job/dolibarr-restore -n erp -o wide
kubectl logs -n erp job/dolibarr-restore
```
```bash
# [HUMAN] prod-mutating: bring erp back up
kubectl scale deploy/erp -n erp --replicas=1
kubectl rollout status deploy/erp -n erp
```
```bash
# [AGENT] read-only: validate erp is serving after restore
kubectl get pods -n erp
kubectl exec -n erp deploy/erp -- curl -sf -o /dev/null -w '%{http_code}\n' http://localhost/
```
> [!WARNING]
> **Always scale erp to 0 before restoring.** The restore loads SQL straight into the live `erp` database; concurrent writes from a running Dolibarr pod produce a half-restored, inconsistent state. Scaling back to 1 only after the Job succeeds is part of the procedure, not an optional flourish.
## Recovery ordering (cluster rebuild)
> [!CAUTION]
> **Vault MUST be unsealed before erp is scaled up.** The Dolibarr pod has no DB credentials of its own — it depends entirely on VSO materialising `vso-db-credentials` from `postgres/creds/erp` (`DOLI_DB_USER` / `DOLI_DB_PASSWORD`). If erp is scaled up while Vault is still sealed, VSO cannot reconcile the secret and the pod crash-loops with no database access. During a cluster rebuild the order is fixed:
>
> 1. **Recover Longhorn volumes** — bring the document PVC (and the `documents/admin/backup` archives riding on it) back online.
> 2. **Unseal Vault** — so VSO can issue erp's dynamic DB credentials and static config.
> 3. **Scale erp to 1** — only now does the pod come up with usable creds.
> 4. **(Optional) restore data** — if the DB needs rolling back to a `pg_dump`, scale to 0, run the restore Job, scale back to 1 (see [Restore](#restore) above).
>
> This sequence is the storage→secrets→apps backbone described in the [storage concept](../lab-ecosystem/storage-and-recovery.md) and executed by the [factory recover playbooks](../factory-provisioning/ansible/06-recover.md); the cluster-wide ordering lives in the CLUSTER_RECOVERY.md runbook.
## The ownership fix
After activating a new Dolibarr module — or whenever the dynamic DB user rotates and creates tables under a fresh role — `public` schema tables can end up owned by a credential that no longer exists, breaking subsequent migrations and dumps. Two SQL helpers reassign ownership back to the stable **`erp_role`**:
| Script | Mechanism | Use |
|---|---|---|
| [`backup/erp_role_as_table_owner.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/backup/erp_role_as_table_owner.sql) | Loops every `public` table and `ALTER TABLE ... OWNER TO erp_role` | Force-set ownership table-by-table |
| [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | Detects the current schema owner and `REASSIGN OWNED BY <owner> TO erp_role` only when it differs | The idempotent, chart-shipped fix |
The chart wires `update_ownership.sql` into a `pg-fix-table-ownership` CronJob; trigger it on demand after a module activation:
```bash
# [HUMAN] prod-mutating: reassign public-schema table ownership to erp_role
kubectl create job \
--from=cronjob/pg-fix-table-ownership \
pg-fix-table-ownership-manual-trigger-$(date +%Y%m%d%H%M%S) \
-n kube-system
```
Run this **before** a backup if you suspect ownership drift, so the dump records the correct owner. More on day-to-day fix-ups and audits in [Operations](operations.md).
## Flow
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef sched fill:#2563eb,stroke:#1e40af,color:#fff
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef db fill:#b45309,stroke:#92400e,color:#fff
VSO["VSO secret<br>vso-db-credentials<br>(postgres/creds/erp)"]:::store
CRON["CronJob dolibarr-backup<br>schedule 0 4 * * *"]:::sched
DUMPJOB["pg_dump Job<br>postgres:16.3"]:::proc
GZIP["gzip stream<br>--inserts --no-tablespaces"]:::proc
PVC["Longhorn RWX PVC<br> /documents/admin/backup<br>pg_dump_erp_*.sql.gz"]:::store
RESTOREJOB["Restore Job dolibarr-restore<br>postgres:16.3"]:::proc
PSQL["psql -f dump.sql"]:::proc
DB["erp Postgres DB<br>via pgbouncer.tools"]:::db
CRON -- "spawns" --> DUMPJOB
DUMPJOB -- "pg_dump" --> GZIP
GZIP -- "writes archive" --> PVC
PVC -- "ls -t latest .sql.gz" --> RESTOREJOB
RESTOREJOB -- "tar -xzf then" --> PSQL
PSQL -- "loads into" --> DB
DUMPJOB -- "dumps from" --> DB
VSO -. "DB creds" .-> DUMPJOB
VSO -. "DB creds" .-> RESTOREJOB
```
1. The **CronJob `dolibarr-backup`** fires at `0 4 * * *` and **spawns** a `pg_dump` Job (`postgres:16.3`).
2. The Job **dumps** the live `erp` database (logical, `--inserts --no-tablespaces`) — reading credentials from the **VSO `vso-db-credentials`** secret.
3. The dump streams through **gzip** and the resulting `pg_dump_erp_<version>_<timestamp>.sql.gz` is **written** to `/documents/admin/backup` on the **Longhorn RWX PVC**; archives older than 15 days are pruned.
4. On restore, the **`dolibarr-restore` Job** picks the **newest** `.sql.gz` (`ls -t | head -n1`, or a pinned `backup_file`) from the PVC — also using the **VSO** credentials.
5. The restore Job **`tar -xzf`s** the archive and **`psql -f`s** it back **into** the `erp` database (with erp scaled to 0 first).
## Gotchas
> [!WARNING]
> - **15-day retention only.** The CronJob deletes any `pg_dump_erp_*.sql.gz` older than 15 days. If you need long-term or compliance copies, pull archives **off-cluster** before they age out — nothing here keeps a month-old dump.
> - **Version match is mandatory.** `pg_dump`/`psql` major version must equal the server's (16.x). Every Job pins `postgres:16.3`; the in-app Dolibarr export against the newer server aborts with `server version mismatch`.
> - **Scale to 0 before restore.** Restoring into a running erp produces an inconsistent database; scale the Deployment to 0, restore, then back to 1.
> - **Vault unseal precedes scale-up.** erp's DB creds come from VSO; a sealed Vault means a crash-looping pod. Follow the [recovery ordering](#recovery-ordering-cluster-rebuild) on any rebuild.
> - **The admin/Postgres password lives in OpenTofu state.** The per-app database and role are declared in IaC, so the authoritative credential material is held in the **TF state** — treat that state as a secret and recover it alongside Vault. See [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md).
## Cross-references
- [Deployment](deployment.md) — the chart, the document PVC, and the Vault CRDs (dynamic creds + static config) that this subsystem depends on.
- [Operations](operations.md) — day-to-day operational tasks including the table-ownership fix-ups and liveness checks.
- [storage concept](../lab-ecosystem/storage-and-recovery.md) — how the Longhorn document PVC (and its riding backup archives) is snapshotted and recovered.
- [factory recover playbooks](../factory-provisioning/ansible/06-recover.md) — the Ansible recovery steps that must run before erp is scaled back up.
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the VSO runtime that materialises `vso-db-credentials`, feeding both the backup and restore Jobs.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — the per-app `erp` PostgreSQL database + role, and the TF state that holds its admin password.

View File

@@ -0,0 +1,189 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Deployment**
# Deployment
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [ERP hub](README.md) · [Applications hub](../applications/README.md) · [01 · factory](../lab-ecosystem/01-factory.md)
> **Downstream:** [Backup & recovery](backup-and-recovery.md) · [Operations](operations.md)
> **Related:** [tools secrets-and-vso](../tools/secrets-and-vso.md) · [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) · [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [naming-conventions](../lab-ecosystem/naming-conventions.md) · [webapp](../applications/webapp.md)
This page maps how **erp** is deployed: the chart that wraps the **upstream Dolibarr image**, the runtime trick that makes a MySQL-assuming application speak **PostgreSQL**, the **50Gi document PVC** that holds every business record, the Vault CRDs that feed it credentials, and the OpenTofu + CI that declare its Vault objects. It is the most data-critical app in the lab; the `iac/` runs through the same `tofu apply` pipeline as every other app — see [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md).
## 1 · App & image
erp is **Dolibarr** pulled straight from the upstream `dolibarr/dolibarr` Docker Hub image — there is **no repo-built image** and no `Dockerfile`. The chart adapts the upstream container at runtime instead of forking it.
| Field | Value | Source |
|---|---|---|
| Application | Dolibarr ERP/CRM (PHP / Apache) | [chart/Chart.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/Chart.yaml) |
| Version | **22.0.4** (chart `appVersion: "22.0.4"`) | [chart/Chart.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/Chart.yaml) |
| Image | `dolibarr/dolibarr:22.0.4` — upstream, `pullPolicy: IfNotPresent` | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| Image tag | `image.tag` empty → defaults to chart `appVersion` (`{{ .Values.image.tag \| default .Chart.AppVersion }}`) | [chart/templates/deployment.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/deployment.yaml) |
| Served at | `https://erp.arcodange.lab` (internal only) | [chart/templates/config.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/config.yaml) |
| Container command | `["/bin/bash", "/usr/local/bin/custom-entrypoint.sh", "apache2-foreground"]` | [chart/templates/deployment.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/deployment.yaml) |
> [!NOTE]
> Because erp consumes an upstream image, there is **no `docker-build-and-push` workflow** in `.gitea/workflows/` — unlike [webapp](../applications/webapp.md), which builds and pushes its own image. erp's only workflow is the OpenTofu/Vault one (see [§8](#8--ci--vaultyaml)).
## 2 · Postgres, not MySQL
Dolibarr classically assumes MySQL, but erp runs on **PostgreSQL**. Two pieces make that work, both at startup, both inside the [custom entrypoint](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) which wraps the upstream `docker-run.sh`:
1. **MySQL → psql rewrite.** When `DOLI_DB_TYPE == "pgsql"`, the entrypoint `sed`s the upstream `/usr/local/bin/docker-run.sh` in place, replacing its `mysql -u ... < ${file}` SQL invocation with `PGPASSWORD=... psql -U ... -h ... -p ... -d ... < ${file}`.
2. **Apache `ServerName`.** It strips the scheme from `DOLI_URL_ROOT` and sets the Apache `ServerName` in `000-default.conf` (and appends to `apache2.conf`) so the vhost matches `erp.arcodange.lab`.
3. It then `exec`s the original `docker-run.sh "$@"` (i.e. `apache2-foreground`).
The non-secret database wiring lives in the `erp-config` ConfigMap, injected via `envFrom`:
| Env var | Value | Meaning |
|---|---|---|
| `DOLI_DB_TYPE` | `pgsql` | Selects PostgreSQL — triggers the entrypoint rewrite |
| `DOLI_DB_HOST` | `pgbouncer.tools` | Connects through the [tools pgbouncer pooler](../tools/secrets-and-vso.md) |
| `DOLI_DB_HOST_PORT` | `5432` | Pooler port |
| `DOLI_DB_NAME` | `erp` | The per-app database (provisioned by [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)) |
| `DOLI_URL_ROOT` | `https://erp.arcodange.lab` | Drives the Apache `ServerName` |
| `DOLI_ENABLE_MODULES` | `Societe,Facture` | Third-parties + invoicing modules |
| `DOLI_COMPANY_NAME` | `Arcodange` | Seeded company name |
| `DOLI_COMPANY_COUNTRYCODE` | `FR` | Seeded country |
| `PHP_INI_DATE_TIMEZONE` | `Europe/Paris` | PHP timezone |
| `DOLI_AUTH` | `dolibarr` | Native Dolibarr auth |
| `DOLI_CRON` | `0` | In-container cron disabled |
`DOLI_DB_USER` / `DOLI_DB_PASSWORD` are **not** in the ConfigMap — they come from Vault (see [§5](#5--vault-crds)).
> [!WARNING]
> The psql rewrite is a **textual `sed` against an upstream file**. If a future Dolibarr image changes the exact `mysql ... < ${file}` line in `docker-run.sh`, the substitution silently stops matching and SQL imports fall back to `mysql` (which is absent) — startup SQL then fails. Re-verify the entrypoint pattern whenever the `appVersion` is bumped.
## 3 · Persistence — the document PVC
A single PVC named `erp` holds every business record. It is the most important object in the chart.
| Field | Value | Source |
|---|---|---|
| Name | `erp` (`erp.fullname`) | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) |
| Access mode | `ReadWriteMany` (RWX) | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) |
| Size | `50Gi` | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) |
| StorageClass | `longhorn` | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) |
| Retention | annotation `helm.sh/resource-policy: keep` — survives a `helm uninstall` | [chart/templates/pvc.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/pvc.yaml) |
The Deployment mounts the **same PVC** at three paths via `subPath`:
| Mount path | subPath | Holds |
|---|---|---|
| `/var/www/documents` | `documents` | **Invoices, attachments, generated PDFs — the critical business data** |
| `/var/www/html/custom` | `custom` | Custom/installed Dolibarr modules |
| `/var/backups` | `backups` | In-pod backup landing area |
> [!CAUTION]
> **Losing this PVC loses all business documents.** `/var/www/documents` contains the only copy of uploaded invoices, attachments, and generated PDFs — these are real accounting records, not regenerable cache. The `helm.sh/resource-policy: keep` annotation protects it from a chart uninstall, but it does **not** protect against a Longhorn-volume loss or a node failure. Treat the PVC as primary data and rely on [Backup & recovery](backup-and-recovery.md) for off-volume copies.
## 4 · Chart shape
| Aspect | Value | Source |
|---|---|---|
| `replicaCount` | **1** (single replica) | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| Autoscaling | **disabled** (`autoscaling.enabled: false`; no HPA rendered) | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| Service | `ClusterIP`, port `80``targetPort: http` | [chart/templates/service.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/service.yaml) |
| Ingress host | `erp.arcodange.lab`, path `/` (`Prefix`) | [chart/templates/ingress.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/ingress.yaml) |
| Ingress entrypoint | Traefik `websecure` + `router.tls: "true"` | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| TLS cert | `certresolver: letsencrypt`, domain `arcodange.lab` / SAN `erp.arcodange.lab` | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| Middleware | `localIp@file`**internal only**, no public `.fr` host | [chart/values.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| revisionHistoryLimit | `5` | [chart/templates/deployment.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/deployment.yaml) |
> [!WARNING]
> **Single replica on an RWX PVC.** `replicaCount: 1` with autoscaling off means erp has **no redundancy** — a node or pod failure is a full outage until rescheduled. A credential rotation or config change triggers a rollout that briefly takes the only pod down. This is deliberate for a stateful, low-traffic internal app, but do not raise the replica count without first confirming Dolibarr tolerates concurrent writes to the shared `documents` volume.
The Deployment carries `configmap-hash` / `configmap2-hash` / `configmap3-hash` annotations (sha256 of the three ConfigMaps) so a change to config or the init scripts forces a pod roll.
## 5 · Vault CRDs
erp cannot start without VSO-injected credentials. Three CRDs (from the chart) wire it to Vault — see [tools secrets-and-vso](../tools/secrets-and-vso.md) for the VSO runtime.
| CRD | Name | What it does |
|---|---|---|
| `VaultAuth` | `auth` | Kubernetes auth — `mount: kubernetes`, `role: erp`, ServiceAccount `erp`, audience `vault`. Every other CRD references it via `vaultAuthRef: auth`. |
| `VaultStaticSecret` | `vault-kv-app` | `type: kv-v2`, `mount: kvv2`, `path: erp/config` → k8s Secret **`secretkv`**, `refreshAfter: 24h`. Injected via `envFrom` `secretRef`. Holds `DOLI_ADMIN_LOGIN`, `DOLI_ADMIN_PASSWORD`, `DOLI_INSTANCE_UNIQUE_ID`. |
| `VaultDynamicSecret` | `vso-db` | `mount: postgres`, `path: creds/erp` → k8s Secret **`vso-db-credentials`** (rotating DB user/password). `rolloutRestartTargets` the erp Deployment so a rotation rolls the pod. `DOLI_DB_USER` / `DOLI_DB_PASSWORD` are wired into the pod via `secretKeyRef`. |
Credential delivery in the Deployment:
- `envFrom: secretRef: secretkv` — static admin config + instance UUID.
- `env: DOLI_DB_USER` / `DOLI_DB_PASSWORD``secretKeyRef` on `vso-db-credentials` (`username` / `password`).
| Sources | [chart/templates/vaultauth.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/vaultauth.yaml) · [vaultsecret.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/vaultsecret.yaml) · [vaultdynamicsecret.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/templates/vaultdynamicsecret.yaml) |
|---|---|
## 6 · Init scripts (mounted from ConfigMaps)
Three scripts ship in `chart/scripts/` and are mounted into the pod via ConfigMaps. The entrypoint runs at container start; the `before-starting.d/` scripts run before Apache.
| Script | Mounted at | Role |
|---|---|---|
| [custom_entrypoint.sh](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/custom_entrypoint.sh) | `/usr/local/bin/custom-entrypoint.sh` (ConfigMap `dolibarr-custom-entrypoint-script`) | Wraps `docker-run.sh`: MySQL→psql `sed` rewrite + Apache `ServerName` from `DOLI_URL_ROOT` (see [§2](#2--postgres-not-mysql)) |
| [update_conf_db_credentials.sh](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_conf_db_credentials.sh) | `/var/www/scripts/before-starting.d/` (ConfigMap `dolibarr-before-start-scripts`) | `sed`s the Vault-injected `DOLI_DB_USER` / `DOLI_DB_PASSWORD` into Dolibarr's `conf.php` at startup, so the running app uses the freshly rotated creds |
| [update_ownership.sql](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) | `/var/www/scripts/before-starting.d/update_table_ownership.sql` | `REASSIGN OWNED BY` the current `public`-schema owner → `erp_role`. Run if you hit read-only-filesystem / permission errors after a credential change |
> [!CAUTION]
> **The ownership SQL must run after the DB role behind the dynamic creds changes.** Because `postgres/creds/erp` mints a **new** Postgres user on each rotation, freshly created tables can end up owned by a transient user. If the in-pod `update_table_ownership.sql` cannot write its temp file (`Read-only file system`), it is skipped and Dolibarr eventually loses query rights once Vault rotates creds. The fix is to run that SQL by hand against the `erp` database — see [Operations](operations.md). The script reassigns ownership to the stable **`erp_role`** created by [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md).
## 7 · iac/ — Vault objects via the shared module
erp's `iac/` declares only its Vault footprint; the Postgres database and `erp_role` themselves come from factory ([postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)).
| Element | Value | Source |
|---|---|---|
| Shared module | `app_roles` from `arcodange-org/tools` (`hashicorp-vault/iac/modules/app_roles`, `ref=main`), `name = "erp"` | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) |
| What the module provisions | the `postgres/creds/erp` dynamic role + a Kubernetes auth `role erp` + the `kvv2` path prefix | [tools secrets-and-vso](../tools/secrets-and-vso.md) |
| Admin password | `random_password.admin_initial_password` (length 32) → `DOLI_ADMIN_PASSWORD` | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) |
| Instance ID | `random_uuid.dolibarr_id` with `lifecycle { prevent_destroy = true }``DOLI_INSTANCE_UNIQUE_ID` (encryption salt + module licensing) | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) |
| KV secret | `vault_kv_secret_v2` at `<kvv2 prefix>config` (i.e. `erp/config`), data = `DOLI_ADMIN_LOGIN` + `DOLI_ADMIN_PASSWORD` + `DOLI_INSTANCE_UNIQUE_ID` | [iac/main.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/main.tf) |
| Backend | GCS bucket `arcodange-tf`, prefix `erp/main` | [iac/backend.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/backend.tf) |
| Vault provider | `address = https://vault.arcodange.lab`, `auth_login_jwt` `mount = gitea_jwt`, `role = gitea_cicd_erp`, provider `vault` `4.4.0` | [iac/providers.tf](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/iac/providers.tf) |
The Postgres dynamic role created here `GRANT`s **`erp_role`** — the stable role created by factory ([postgres-iac](../factory-provisioning/opentofu/postgres-iac.md)) — so every rotated DB user inherits the right schema privileges. This is the same KV secret the chart's `VaultStaticSecret` reads back as `secretkv`, closing the loop between `iac/` (writes config) and `chart/` (consumes it).
> [!WARNING]
> **The OpenTofu state holds the plaintext admin password.** `random_password.admin_initial_password` is stored unencrypted in the GCS state at `arcodange-tf/erp/main`. Anyone with read access to that state bucket can read `DOLI_ADMIN_PASSWORD`. Treat the `erp/main` state prefix as a secret; do not copy it locally unprotected. The `random_uuid` instance ID is similarly in state but is guarded by `prevent_destroy` because losing it breaks decryption of stored data and invalidates purchased modules.
## 8 · CI — vault.yaml
| Element | Value | Source |
|---|---|---|
| Workflow | `Hashicorp Vault` (`.gitea/workflows/vault.yaml`) | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) |
| Triggers | `workflow_dispatch`, plus `push` / `pull_request` on `iac/*.tf` | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) |
| Job 1 | `gitea_vault_auth` — mints a Gitea OIDC JWT for Vault | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) |
| Job 2 | `tofu``dflook/terraform-apply` over `iac/`, `auto_approve: true`, **OpenTofu `1.8.2`** | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) |
| Secrets | `TERRAFORM_SSH_KEY` (SSH key to clone the `app_roles` module from `tools`) + `HOMELAB_CA_CERT` (Vault self-signed CA) + `GOOGLE_BACKEND_CREDENTIALS` (GCS state) | [.gitea/workflows/vault.yaml](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/.gitea/workflows/vault.yaml) |
This `tofu apply` follows the lab-wide pattern documented in [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md). There is **no application-image build step** — the chart is delivered by ArgoCD ([01 · factory](../lab-ecosystem/01-factory.md)), and the image is upstream.
## 9 · `<app>` convention mapping
erp follows the lab's per-app naming convention — see [naming-conventions](../lab-ecosystem/naming-conventions.md). With `<app> = erp`:
| `<app>` slot | erp value |
|---|---|
| Repo | `arcodange-org/erp` |
| K8s namespace | `erp` |
| Internal host | `erp.arcodange.lab` |
| ServiceAccount | `erp` |
| Vault Kubernetes auth role | `erp` |
| Vault KV path | `kvv2` `erp/config` → Secret `secretkv` |
| Vault dynamic DB path | `postgres/creds/erp` → Secret `vso-db-credentials` |
| Postgres database | `erp` |
| Postgres stable role | `erp_role` |
| OpenTofu state prefix | GCS `arcodange-tf/erp/main` |
| Gitea CI Vault role | `gitea_cicd_erp` |
| Document PVC | `erp` (50Gi Longhorn RWX) |
## Cross-references
- [ERP hub](README.md) — the orientation map for the whole guidebook.
- [Backup & recovery](backup-and-recovery.md) — protecting the 50Gi document PVC and the `erp` database; cluster-recovery ordering (unseal Vault before scaling erp up).
- [Operations](operations.md) — day-to-day operational tasks, including running the table-ownership SQL by hand.
- [tools secrets-and-vso](../tools/secrets-and-vso.md) — the `app_roles` module, the VSO runtime that materialises `secretkv` + `vso-db-credentials`, and the `pgbouncer.tools` pooler.
- [factory postgres-iac](../factory-provisioning/opentofu/postgres-iac.md) — provisions the `erp` database and the stable `erp_role` the dynamic creds inherit.
- [factory ci-apply-flow](../factory-provisioning/opentofu/ci-apply-flow.md) — the shared `tofu apply` CI pattern erp's `vault.yaml` follows.
- [naming-conventions](../lab-ecosystem/naming-conventions.md) — the `<app>` slots filled in [§9](#9--app-convention-mapping).
- [webapp](../applications/webapp.md) — the archetype that *does* build its own image; erp differs by reusing the upstream Dolibarr image.

View File

@@ -0,0 +1,173 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [ERP](README.md) > **Operations**
# ERP Operations
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [ERP hub](README.md) · [Deployment](deployment.md)
> **Downstream:** [Backup & recovery](backup-and-recovery.md)
> **Related:** [Applications hub](../applications/README.md) · [Web app](../applications/webapp.md) · [Secrets & VSO](../tools/secrets-and-vso.md) · [Postgres IaC](../factory-provisioning/opentofu/postgres-iac.md)
This page covers day-2 operation of the Arcodange Dolibarr ERP: the read-only operations CLI, the static identity assets, the Playwright bootstrap test suite, and the recurring scaling / module-activation / storage chores. For how the workload is deployed onto the cluster, see [Deployment](deployment.md). For backups and disaster recovery, see [Backup & recovery](backup-and-recovery.md).
---
## 1. The read-only ops CLI — `bin/arcodange`
[`bin/arcodange`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/bin/arcodange) is a Bash dispatcher that gives a human-friendly entry point to safe, **strictly read-only** Dolibarr operations against `erp.arcodange.lab`. Every subcommand `exec`s a script under `.claude/skills/<skill>/scripts/`; the dispatcher itself only locates the project root (via `git rev-parse --show-toplevel`, falling back to walking up from the script) and routes arguments.
> [!IMPORTANT]
> The CLI authenticates with credentials read from [`.claude/skills/dolibarr/.env`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/bin/arcodange) — a **gitignored** file expected at **mode `600`**. The underlying API key belongs to the `ai_agent` service account, which has **no write permissions**: the CLI cannot mutate Dolibarr. Corrections always go through the Dolibarr web UI.
### Command map
| Command | Subcommand | What it does |
| --- | --- | --- |
| `ping` | — | `GET /status` — liveness probe + reports the running Dolibarr version |
| `whoami` | — | `GET /users/info` — confirms auth is the `ai_agent` service account |
| `invoice` | `list [--since YYYY-MM-DD]` | Table of KissMetrics customer invoices with payment state |
| `invoice` | `audit <invoice-id>` | JSON facts + PDF mandatory-mention audit for one invoice |
| `payments` | `state [--since YYYY-MM-DD]` | Per-invoice TTC vs payments reconciliation |
| `payments` | `timeline [--year\|--since\|--until]` | Payment timeline with cumulative balance |
| `payments` | `by-month [--year\|--all-clients]` | Monthly cash-receipt aggregation |
| `tva` | `summary [--year\|--since\|--until]` | CA3-ready monthly TVA summary (collectée déductible) |
| `tva` | `collect` / `collect-detail` | TVA collectée by month × rate (CA3 A1/A4/E2) + per-line audit |
| `tva` | `deductible` / `deductible-detail` | TVA déductible by month × rate (CA3 19/20/17+24) + per-line audit |
| `thirdparty` | `audit <socid>` | Country-aware completeness audit for one thirdparty |
| `thirdparty` | `audit-all [--clients-only\|--suppliers-only]` | Audit every visible thirdparty |
| `templates` | `list [--max-id N]` / `inspect <id>` | Enumerate / health-check recurring invoice templates |
| `bank` | `probe` / `balance` / `match` / `qonto-transactions` / `wise-transactions` / `curl` | Qonto + Wise bank data and Dolibarr reconciliation |
| `email` | `list` / `inspect <id>` / `curl` | Supplier-invoice ingestion from the Zoho mailbox |
| `snapshot` | `--out FILE` (or `--print-only`) | Bundle the full read-only state into one JSON dump (with `content_hash`) |
| `curl` | `<path>` | Raw read-only call through `dol-curl.sh` (e.g. `arcodange curl /invoices/12`) |
| `help` | `[command]` | Full command tree, or per-command help |
### Health checks first
| Check | Command | Expected outcome |
| --- | --- | --- |
| Is Dolibarr up? | `bin/arcodange ping` | HTTP `200` + Dolibarr version string |
| Is auth wired? | `bin/arcodange whoami` | The `ai_agent` user record |
| Full state dump | `bin/arcodange snapshot --out /tmp/erp.json` | One JSON file with a `content_hash` |
> [!TIP]
> `snapshot` is the fastest way to capture a point-in-time, read-only view of the ERP (invoices, payments, TVA, thirdparties, templates) for offline diffing or for attaching to an incident. It does not touch the database — it only reads.
The CLI's per-domain credentials beyond Dolibarr (Qonto/Wise for `bank`, Zoho OAuth for `email`) also live in the same gitignored `.env`. The skills' `SKILL.md` files remain the source of business-logic documentation; the CLI is just the ergonomic front door.
---
## 2. Static identity assets — `static/`
[`static/`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static) holds the company's legal identity and branding, consumed by the Playwright bootstrap suite when it configures a fresh Dolibarr install.
| Path | Purpose |
| --- | --- |
| [`static/config/company.json`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static/config/company.json) | Legal identity used for Dolibarr company setup and display |
| [`static/img/logo512.png`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static/img/logo512.png) | Company logo (referenced from `company.json` as `$IMG/logo512.png`) |
| [`static/img/loginBackground.jpeg`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/static/img/loginBackground.jpeg) | Login-page background image |
`company.json` carries two blocks — `info` (postal/contact identity) and `ID` (legal identity):
| Field | Value |
| --- | --- |
| Raison sociale | Arcodange |
| Forme juridique | SAS (Société par actions simplifiée) |
| Adresse | 73 Boulevard de l'Yerres, 91000 Évry-Courcouronnes, France (FR) |
| Site / email | arcodange.fr · gabrielradureau@arcodange.fr |
| SIREN / SIRET | (legal registration identifiers) |
| NAF / APE | 62.02A |
| N° TVA | (intra-community VAT number) |
| Capital | (share capital) |
| RCS | R.C.S. Évry |
| Mois début d'exercice | Juillet |
| Logo | `$IMG/logo512.png` |
> [!NOTE]
> The `$IMG` token in `company.json` resolves to `static/img/` via the test harness's `IMG_FOLDER` (see §3). The same image folder feeds the optional login-page background and logo upload during display setup.
---
## 3. Bootstrap test suite — `test/`
[`test/`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test) is a **Deno + Playwright** UI suite that drives a real browser through the Dolibarr first-install and admin configuration flows. It is not a unit-test runner — it is the scripted bootstrap that stands up a fresh Dolibarr instance and applies the company identity.
| File | Role |
| --- | --- |
| [`test/main.ts`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/main.ts) | Entry point: launches Chromium (`fr-FR` locale), wires `globalCtx`, runs install + admin setup steps |
| [`test/deno.json`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/deno.json) | Imports `npm:playwright` and `jsr:@std/dotenv/load`; `checkJs: true` |
| [`test/.env.example`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/.env.example) | Template for `DOLIBARR_ADDRESS`, DB password, admin login, `ROOT_FOLDER` |
| [`test/scripts/admin/`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/test/scripts/admin) | `initialSetup.ts`, `companySetup.ts`, `displaySetup.ts`, `moduleSetup.ts` |
### Run the suite
1. Install Deno and the Playwright browsers:
- `curl -fsSL https://deno.land/install.sh | sh`
- `deno run --allow-all npm:playwright install`
2. Populate `test/.env` from the live cluster secrets (DB password from the `vso-db-credentials` secret; admin password from `secretkv`). See [Secrets & VSO](../tools/secrets-and-vso.md) for how those secrets land in the namespace.
3. Run: `deno run --allow-all main.ts`.
### Lock the installer after install
> [!CAUTION]
> Dolibarr's `install/` wizard stays reachable until an `install.lock` exists. After a successful first install, **always** create the lock — an unlocked installer is a live takeover risk on a production-like instance.
The post-install step touches the lock file inside the pod and chowns it to `www-data`:
```sh
kubectl -n erp exec $(kubectl get pod -n erp -l app.kubernetes.io/name=erp -o name) -- \
/bin/bash -c '/usr/bin/touch /var/www/html/install.lock && /bin/chown www-data:www-data /var/www/html/install.lock'
```
`initialSetup.isUpgradeLocked()` checks for the same locked state before deciding whether to (re)run the installer, so the lock is both a safety gate and the suite's idempotency signal.
---
## 4. Day-2 operations
### Scaling — manual only
| Setting | Value | Source |
| --- | --- | --- |
| `replicaCount` | `1` | [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
| `autoscaling.enabled` | `false` | [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/values.yaml) |
Dolibarr runs as a **single replica with no HorizontalPodAutoscaler**. The instance is backed by a `ReadWriteOnce` filesystem PVC, so scaling out is not a supported topology — scaling is a deliberate, manual `replicaCount` change in the chart values, applied through the normal [deployment](deployment.md) path. Treat the workload as a single-writer system.
### After activating a new Dolibarr module — fix table ownership
Activating a Dolibarr module creates new SQL tables, and Dolibarr's migration runner creates them under whatever role the live VSO-rotated credentials happen to map to. If that role is not `erp_role`, a subsequent credential rotation by Vault can leave the new tables unreadable.
> [!IMPORTANT]
> After enabling any new module (e.g. via `moduleSetup.configureModule`), run the table-ownership reassignment so the new objects are owned by `erp_role`. The `REASSIGN OWNED BY … TO erp_role` logic lives in [`chart/scripts/update_ownership.sql`](https://gitea.arcodange.lab/arcodange-org/erp/src/branch/main/chart/scripts/update_ownership.sql) and is also mounted into the pod entrypoint as `update_table_ownership.sql`.
Apply it inside the pod:
```sh
kubectl exec -n erp $(kubectl get pod -n erp -l app.kubernetes.io/name=erp -o name) -c erp -- \
sh -c 'PGPASSWORD=${DOLI_DB_PASSWORD} psql -U ${DOLI_DB_USER} -h ${DOLI_DB_HOST} \
-p ${DOLI_DB_HOST_PORT} ${DOLI_DB_NAME} -f /var/www/scripts/before-starting.d/update_table_ownership.sql'
```
If the pod logged `Read-only file system` for the `update_table_ownership.sql` step at startup (the entrypoint cannot write its temp file), the reassignment never ran — run the command above by hand. If the live DB user has already lost rights, run the same SQL with the **admin Postgres credentials** instead. The role model is described in [Postgres IaC](../factory-provisioning/opentofu/postgres-iac.md).
### Watch PVC usage so backups do not fill the volume
ERP data and (where co-located) backup artifacts share storage on a single PVC. If on-volume backup snapshots accumulate, they can exhaust the volume and take Dolibarr down.
| Watch | Why |
| --- | --- |
| PVC used vs. capacity | A full volume crashes Dolibarr (no room for sessions/temp/migrations) |
| Backup artifact growth | Old dumps left on the volume eat the same space data needs |
> [!WARNING]
> Monitor PVC usage and prune/offload old backup artifacts before they fill the **50Gi** volume. Backup retention, artifact layout, and the off-volume target are documented in [Backup & recovery](backup-and-recovery.md) — keep on-volume copies short-lived. See [Storage & recovery](../lab-ecosystem/storage-and-recovery.md) for the durable-copy model.
---
## See also
- [ERP hub](README.md) — overview and entry point for the ERP guidebook.
- [Deployment](deployment.md) — how the workload, chart, and credentials reach the cluster.
- [Backup & recovery](backup-and-recovery.md) — backup artifacts, retention, restore drills.
- [Secrets & VSO](../tools/secrets-and-vso.md) — how `vso-db-credentials` / `secretkv` land in the namespace.

View File

@@ -0,0 +1,88 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > **Factory provisioning**
# Factory provisioning
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Upstream:** [Lab ecosystem guidebook](../lab-ecosystem/README.md) · [01 · factory](../lab-ecosystem/01-factory.md)
> **Related:** [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [safe-prod-like-environment PRD](../../PRD/safe-prod-like-environment/README.md)
This guidebook is the deep dive into **how the `factory` repo turns three Raspberry Pis + a handful of cloud accounts into the running lab.** Where the [lab-ecosystem](../lab-ecosystem/README.md) map shows *which* components exist and how they join, this guidebook drills into the two provisioning **engines** that build and maintain them: the Ansible collection that the operator runs from the Mac, and the OpenTofu modules that Gitea CI applies. Every page below describes the engine *as it is wired right now* — playbook imports, role responsibilities, inventory placement, provider versions, state backends, and the CI flow that ties Tofu to Vault.
## Two engines, two trigger models
The factory splits provisioning along a hard line: **imperative, operator-driven host/cluster build** (Ansible) versus **declarative, CI-driven forge/cloud/database state** (OpenTofu). They never overlap on the same resource, and they run at different moments.
| Engine | Trigger | Runs from | Owns | Lives at |
|---|---|---|---|---|
| **Ansible** | One-shot, operator-run on demand | The Mac (control node) | The cluster + base layer + stateful services: k3s, Longhorn, Pi-hole, step-ca, PostgreSQL, Gitea, Vault, CrowdSec — plus the disaster-recovery playbooks | [`ansible/`](../../../ansible/) → [sub-hub](ansible/README.md) |
| **OpenTofu** | CI-applied on Gitea (path-filtered `push`/`pull_request` + `workflow_dispatch`) | Gitea act-runners | Forge/cloud edge state (Cloudflare, OVH, GCP, Gitea, Vault) and **per-app PostgreSQL databases** | [`iac/`](../../../iac/) + [`postgres/`](../../../postgres/) → [sub-hub](opentofu/README.md) |
> [!NOTE]
> Ansible is **imperative and human-gated** because it touches bare hosts and one-time bootstrap (disk prep, k3s install, Vault init). OpenTofu is **declarative and machine-gated** because its targets are reconcilable API objects (a DNS record, a bucket, a database) whose desired state belongs in version control and converges on every merge.
## How a green-field lab comes up
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef op fill:#1e3a8a,stroke:#1e40af,color:#fff
classDef eng fill:#059669,stroke:#047857,color:#fff
classDef host fill:#7c3aed,stroke:#6d28d9,color:#fff
classDef store fill:#b45309,stroke:#92400e,color:#fff
OP["Operator<br>at the Mac"]:::op -->|"runs playbooks 01→05"| ANS["Ansible collection<br>arcodange.factory"]:::eng
ANS -->|"OS · k3s · Longhorn · base layer"| PIS["3× Raspberry Pi<br>pi1 / pi2 / pi3"]:::host
PIS -->|"hosts Gitea + act-runners"| CI["Gitea CI<br>act-runners"]:::store
CI -->|"path-filtered apply"| TOFU["OpenTofu<br>iac/ + postgres/iac/"]:::eng
TOFU -->|"forge · cloud · PG state"| EDGE["Cloudflare · OVH · GCP<br>Gitea · Vault · PostgreSQL"]:::store
TOFU -. "state in GCS gs://arcodange-tf" .- EDGE
```
1. The **operator**, working from the **Mac control node**, runs the numbered Ansible playbooks `01_system``05_backup` in order.
2. **Ansible** lays the OS, k3s (`v1.34.3+k3s1`), Longhorn, and the base layer (Pi-hole, step-ca, Vault, CrowdSec) plus the stateful out-of-cluster services (PostgreSQL + Gitea) onto the **three Raspberry Pis** (`pi1`/`pi2`/`pi3`).
3. Once `pi2` is hosting **Gitea** and `pi1`/`pi3` are running the **act-runners** (registered by `03_cicd`), the forge can run CI.
4. A push or merge to `factory` that touches `iac/**` or `postgres/**` triggers the corresponding **Gitea CI** workflow on those runners.
5. The CI job authenticates to Vault via Gitea OIDC JWT and runs **OpenTofu**, which reconciles the **forge/cloud/database edge** — Cloudflare, OVH, GCP, Gitea action-secrets, Vault KV/policies, and the per-app PostgreSQL objects.
6. All OpenTofu state is kept in **GCS** under `gs://arcodange-tf` (prefix `factory/main` for the cloud edge, `factory/postgres` for the databases), so each CI run reads and writes the authoritative state remotely.
## Master index
| Sub-hub | What it maps | Status |
|---|---|---|
| [Ansible](ansible/README.md) | The `arcodange.factory` collection: numbered playbooks `01``06`, the inventory + group_vars, and the reusable roles that build hosts, the cluster, and the stateful services | ✅ Active |
| [OpenTofu](opentofu/README.md) | The CI-applied IaC: the cloud/forge edge (`iac/`), the per-app PostgreSQL provisioning (`postgres/iac/`), and the Gitea-OIDC → Vault apply flow | ✅ Active |
### All pages
- **Ansible**
- [System (`01`)](ansible/01-system.md) — OS, DNS, SSL, disks, Docker, iSCSI, k3s, CoreDNS, cert-issuer, Longhorn/Traefik config
- [Setup (`02`)](ansible/02-setup.md) — PostgreSQL + Gitea docker-compose on `pi2` (and the optional backup-NFS share)
- [CI/CD (`03`)](ansible/03-cicd.md) — Gitea act-runner registration on `pi1`/`pi3` and the ArgoCD/Image-Updater install
- [Tools (`04`)](ansible/04-tools.md) — Vault + CrowdSec bootstrap into the cluster
- [Backup (`05`)](ansible/05-backup.md) — scheduled PostgreSQL / Gitea / k3s-PVC backups to `/mnt/backups`
- [Recover (`06`)](ansible/06-recover.md) — the Longhorn disaster-recovery playbooks (`recover/`)
- [Inventory & variables](ansible/inventory.md) — `hosts.yml` groups and the `group_vars` tree
- [Roles reference](ansible/roles.md) — `deploy_docker_compose`, the `gitea_*` family, `traefik_certs`, `playwright`, and the service sub-roles
- **OpenTofu**
- [factory iac](opentofu/factory-iac.md) — `iac/`: Cloudflare/OVH/GCP/Gitea/Vault edge + the `cloudflare_token` module
- [postgres iac](opentofu/postgres-iac.md) — `postgres/iac/`: per-app databases, roles, and the pgbouncer `user_lookup()` function
- [CI apply flow](opentofu/ci-apply-flow.md) — the Gitea workflows, OIDC-JWT → Vault auth, and the GCS state backend
## Maintenance rule
> [!IMPORTANT]
> **Alter a documented component → update its page in the same change.** If you change a playbook, a role, an inventory entry, a provider version, a Tofu resource, or the CI flow, the matching page in this guidebook MUST be edited in the same PR. A provisioning map that drifts from the code sends operators (and agents) down dead paths during a rebuild or a recovery — exactly when the map matters most.
## Why this guidebook earns its keep
The safe-prod-like-environment work rehearses **exactly these playbooks and Tofu modules** in a throwaway sandbox before they touch the real lab: the sandbox stands up the same `01``05` narrative and runs the same `iac/` + `postgres/iac/` apply, so the rehearsal only holds if this guidebook tracks the engines faithfully. See the [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) for the decision and the [PRD](../../PRD/safe-prod-like-environment/README.md) (with its [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)) for what the sandbox must reproduce.
## Cross-references
- [Lab ecosystem guidebook](../lab-ecosystem/README.md) — the higher-altitude whole-lab map; this guidebook is its provisioning deep dive.
- [01 · factory](../lab-ecosystem/01-factory.md) — the four-pillar summary of the `factory` repo that this guidebook expands.
- [secrets-and-vault.md](../lab-ecosystem/secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and the dynamic PostgreSQL credentials these engines set up.
- [storage-and-recovery.md](../lab-ecosystem/storage-and-recovery.md) — Longhorn + GCS backup + the power-cut recovery the `06 · recover` playbooks serve.
- [naming-conventions.md](../lab-ecosystem/naming-conventions.md) — the `<app>` join key shared by the OpenTofu state prefixes and per-app PostgreSQL objects.
- [safe-prod-like-environment ADR](../../ADR/0001-safe-prod-like-environment.md) · [PRD](../../PRD/safe-prod-like-environment/README.md) — the sandbox that rehearses these engines before they touch the real lab.

View File

@@ -0,0 +1,94 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **01 · System**
# 01 · System — base OS, Docker, K3s, Longhorn, DNS, SSL
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
> **Downstream:** [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md)
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
## What it does
`01 · System` takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a configured K3s cluster. The wrapper [`playbooks/01_system.yml`](../../../../ansible/arcodange/factory/playbooks/01_system.yml) does nothing but `import_playbook` the stage orchestrator [`playbooks/system/system.yml`](../../../../ansible/arcodange/factory/playbooks/system/system.yml), which in turn imports ten sub-playbooks **in strict order**. Each sub-play layers one capability: hostname/DNS hygiene, Pi-hole HA DNS, the step-ca PKI, the external backup disk, Docker, the iSCSI/dm-crypt prerequisites for Longhorn, K3s itself, CoreDNS forwarding, the cert-manager issuer, and finally the cluster config (Longhorn + Traefik).
All host-facing plays target `raspberries:&local` — the intersection of the `raspberries` group and the `local` group, which resolves to `pi1`/`pi2`/`pi3` (see [Inventory & variables](inventory.md)). The K3s server/agent split is decided at runtime: the **first host (alphabetically) becomes the server**, the rest become agents.
## Ordered steps
| # | Sub-playbook | Purpose | Key vars / versions |
| --- | --- | --- | --- |
| 1 | [`system/rpi.yml`](../../../../ansible/arcodange/factory/playbooks/system/rpi.yml) | Set each node's hostname to its `inventory_hostname`. On Pi-hole nodes (`pi1`/`pi3`) add `dnsmasq` to the `dip` group, then **stop & disable `dnsmasq`** to free port 53 for `pihole-FTL`. | `tags: never` (opt-in only) |
| 2 | [`dns/dns.yml`](../../../../ansible/arcodange/factory/playbooks/dns/dns.yml) → [`dns/pihole.yml`](../../../../ansible/arcodange/factory/playbooks/dns/pihole.yml) | Install & configure **Pi-hole HA DNS** via the `pihole` role. Adds custom records mapping `.arcodange.lab` and `.arcodange.duckdns.org` to `pi1`. | `pihole_custom_dns``pi1.preferred_ip` |
| 3 | [`ssl/ssl.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/ssl.yml) → [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml) | Install **step-ca** (the `step_ca` role) on all three Pis; fetch the root CA from `pi1`; build a **Gitea runner image that trusts the CA** (`runner-images:ubuntu-latest-ca`) and push it to the registry. | `step_ca_primary: pi1`, root at `/home/step/.step/certs/root_ca.crt` |
| 4 | [`system/prepare_disks.yml`](../../../../ansible/arcodange/factory/playbooks/system/prepare_disks.yml) | Auto-detect the largest external (non-`mmcblk0`) USB partition, format it **ext4 with label `arcodange_500`**, mount at `/mnt/arcodange`, and persist in `fstab`. Skips format if the label already exists. **`pause` confirm before any format.** | `mount_point: /mnt/arcodange`, `disk_label: arcodange_500` |
| 5 | [`system/system_docker.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_docker.yml) | Install Docker via `geerlingguy.docker`; write `daemon.json` with **json-file logging** (`max-size 10m`, `max-file 5`) and **`data-root: /mnt/arcodange/docker`** (only when the external disk is mounted). | `tags: never`; `storage-driver: overlay2` |
| 6 | [`system/iscsi_longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/system/iscsi_longhorn.yml) | Install `open-iscsi` (+ enable `iscsid`) and `cryptsetup`, and load the **`dm_crypt`** kernel module (persisted in `/etc/modules`) — Longhorn's encrypted-volume prerequisites. Creates `/mnt/arcodange/longhorn`. | module `dm_crypt` |
| 7 | [`system/system_k3s.yml`](../../../../ansible/arcodange/factory/playbooks/system/system_k3s.yml) | Build the K3s inventory dynamically (first sorted host → `server`, rest → `agent`), install the `k3s-ansible` content, run `k3s.orchestration.site`, then **fetch the kubeconfig** to `~/.kube/config` (rewriting `127.0.0.1` → server IP). | **k3s `v1.34.3+k3s1`**; server args `--docker --disable traefik` |
| 8 | [`system/k3s_dns.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_dns.yml) | Create the **`coredns-custom`** ConfigMap so cluster DNS forwards `arcodange.lab:53` to the Pi-hole IPs; also patch the main CoreDNS Corefile to forward to the same HA Pi-holes. | `pihole_ips` (extracted from hostvars) |
| 9 | [`system/k3s_ssl.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_ssl.yml) | Deploy **cert-manager** + **step-issuer** as k3s static HelmCharts; create the `StepClusterIssuer` `step-ca` wired to the JWK provisioner and root CA. | cert-manager `v1.19.2`, step-issuer `1.9.11`, `caUrl: https://ssl-ca.arcodange.lab:8443`, **ARM64 `kube-rbac-proxy` override** |
| 10 | [`system/k3s_config.yml`](../../../../ansible/arcodange/factory/playbooks/system/k3s_config.yml) | Deploy **Longhorn** + **Traefik** as HelmCharts; issue the wildcard cert, set the default `TLSStore`, wire Gitea, the IP-allow-list middleware, and the CrowdSec bouncer plugin; then **delete the old Traefik** to force a redeploy. | Longhorn `v1.9.1`, Traefik `v37.4.0` (see detail below) |
## How the stages fit together
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%%
flowchart TD
classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4;
classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
rpi["1 · rpi.yml<br>hostname + dnsmasq off"]:::host
dns["2 · pihole<br>HA DNS"]:::host
ssl["3 · step-ca<br>root CA + CA-trusting runner image"]:::host
disk["4 · prepare_disks.yml<br>ext4 arcodange_500 -> /mnt/arcodange"]:::danger
docker["5 · system_docker.yml<br>data-root on external disk"]:::host
iscsi["6 · iscsi_longhorn.yml<br>open-iscsi + dm_crypt"]:::host
k3s["7 · system_k3s.yml<br>k3s v1.34.3 (--disable traefik)"]:::cluster
cdns["8 · k3s_dns.yml<br>coredns-custom -> Pi-hole"]:::cluster
cmgr["9 · k3s_ssl.yml<br>cert-manager + step-issuer"]:::cluster
cfg["10 · k3s_config.yml<br>Longhorn + Traefik + redeploy"]:::cluster
rpi --> dns --> ssl --> disk --> docker --> iscsi --> k3s --> cdns --> cmgr --> cfg
```
1. **`rpi.yml`** fixes the hostname and, on Pi-hole nodes, stops `dnsmasq` so `pihole-FTL` can own port 53.
2. **Pi-hole** comes up as the HA DNS authority for `arcodange.lab`.
3. **step-ca** is installed; its root CA is fetched and baked into a Gitea runner image so CI can trust internal TLS.
4. **`prepare_disks.yml`** formats and mounts the external USB disk at `/mnt/arcodange` (with a confirmation pause).
5. **Docker** installs with its data-root pointed at that disk and capped logging.
6. **iSCSI + dm_crypt** prerequisites land so Longhorn can attach (and encrypt) volumes.
7. **K3s** installs with the first host as server, Docker as the container runtime, and Traefik disabled.
8. **CoreDNS** is reconfigured to forward `arcodange.lab` to the Pi-holes.
9. **cert-manager + step-issuer** wire the in-cluster issuer to step-ca.
10. **`k3s_config.yml`** deploys Longhorn and a fully-customized Traefik, then deletes the old Traefik so the helm-controller redeploys with the new config.
## `k3s_config.yml` — Longhorn & Traefik detail
| Resource | Value | Notes |
| --- | --- | --- |
| Longhorn HelmChart | `v1.9.1` | `defaultSettings.defaultDataPath: /mnt/arcodange/longhorn` — volumes live on the external disk. |
| Traefik HelmChart | `v37.4.0` | Deployed as a k3s static manifest (`traefik-v3.yaml`) with an inline `traefik-configmap`. |
| Wildcard cert | `wildcard-arcodange-lab` | `Certificate` for `arcodange.lab` + `*.arcodange.lab`, issued by the `step-issuer` `StepClusterIssuer`. |
| `TLSStore` `default` | `defaultCertificate: wildcard-arcodange-lab` | Makes the wildcard cert the cluster-wide default. |
| Gitea exposure | `gitea-external` `ExternalName` Service → `pi2` port 3000 | Gitea runs **outside** K3s as Docker Compose on `pi2`; Traefik routes `gitea.arcodange.lab` to it. |
| `localIp` middleware | `ipAllowList` | Restricts dashboard/Gitea routers to LAN + pod CIDR + the detected public IP. |
| CrowdSec bouncer | plugin `v1.3.3` | Traefik experimental plugin `crowdsec-bouncer-traefik-plugin` (config completed in [04 · Tools](04-tools.md)). |
| DuckDNS token | `traefik-duckdns-token` Secret → `DUCKDNS_TOKEN` | Consumed by the `letsencrypt` ACME DNS-challenge resolver via `envFrom`. |
## Gotchas
> [!CAUTION]
> **Step 4 formats a disk — data loss is real.** `prepare_disks.yml` picks the **largest non-system partition** and runs `mkfs.ext4 -F` on it when the `arcodange_500` label is absent. The `run_once` `pause` prompt ("tapez 'oui' pour continuer") is the only guard, and a wrong USB stick plugged into the wrong Pi will be wiped. Confirm `target_device` in the debug output before answering. If a candidate already carries the label, the format is skipped and the disk is only (re)mounted.
> [!WARNING]
> **K3s ships with `--disable traefik`.** The bundled Traefik is intentionally turned off in step 7 so step 10 can deploy its own fully-customized `v37.4.0`. If you re-enable the bundled Traefik or run `k3s_config.yml` out of order, two Traefiks will fight over the ingress ports.
> [!WARNING]
> **ARM64 needs the `kube-rbac-proxy` image override.** step-issuer's default `gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0` is AMD64-only and **crash-loops on `pi3` (ARM64)**. `k3s_ssl.yml` overrides it to `quay.io/brancz/kube-rbac-proxy:v0.15.0`. Do not remove this override.
> [!WARNING]
> **Traefik is force-redeployed.** The last play of `k3s_config.yml` deletes the `traefik` Deployment **and** the `helm-install-traefik` Job so the k3s helm-controller re-runs the install against the new manifest. Expect a brief ingress outage during this window; the play then waits for the new Deployment to come back before finishing.
> [!NOTE]
> **`tags: never` plays are opt-in.** `rpi.yml` and `system_docker.yml` carry `tags: never`, so they are skipped unless you explicitly pass their tag (e.g. `--tags rpi` / `--tags ...`) or `--tags all`. The K3s/Longhorn/Traefik plays run on a normal invocation.

View File

@@ -0,0 +1,82 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **02 · Setup**
# 02 · Setup — Postgres, Gitea, NFS backup target
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [01 · System](01-system.md)
> **Downstream:** [03 · CI/CD](03-cicd.md)
> **Related:** [Inventory & variables](inventory.md) · [Roles reference](roles.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)
## What it does
`02 · Setup` deploys the **stateful services the rest of the platform leans on**: a PostgreSQL server and a Gitea instance — both running as **Docker Compose stacks on `pi2`, outside K3s** — plus the in-cluster NFS backup target. The wrapper [`playbooks/02_setup.yml`](../../../../ansible/arcodange/factory/playbooks/02_setup.yml) imports [`playbooks/setup/setup.yml`](../../../../ansible/arcodange/factory/playbooks/setup/setup.yml), which pings the Pis, then imports three sub-playbooks: `backup_nfs.yml` (tagged `never`), `postgres.yml`, and `gitea.yml`.
> [!IMPORTANT]
> **Postgres and Gitea do not run in Kubernetes.** They are Docker Compose stacks on `pi2` (the sole member of the `postgres` group, which `gitea` inherits as a child — see [Inventory & variables](inventory.md)). K3s only references them: Traefik exposes Gitea via an `ExternalName` Service, and the `pg-fix-table-ownership` CronJob reaches Postgres over the LAN. This keeps the two services available even when the cluster is being rebuilt.
## Ordered steps
| # | Sub-playbook | Purpose | Key vars / versions |
| --- | --- | --- | --- |
| 1 | [`setup/backup_nfs.yml`](../../../../ansible/arcodange/factory/playbooks/setup/backup_nfs.yml) | Provision the shared backup volume: a **Longhorn RWX PVC `backups-rwx` (50Gi)**, a Longhorn `RecurringJob`, a `busybox` deploy to spawn the share-manager, then mount the resulting NFS share at `/mnt/backups` on every Pi. | `tags: never`; `backup_size: 50Gi`, RecurringJob `thrice-a-month-backup` (`cron 0 5 */2 * *`, retain 2) |
| 2 | [`setup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/setup/postgres.yml) | Deploy the Postgres Compose stack (`deploy_docker_compose` + `deploy_postgresql` role), create the `gitea` DB/user, create the **pgbouncer auth_user + `user_lookup()` functions** in both `postgres` and `gitea` DBs, publish the K8s Secret `postgres-admin-credentials`, and install the **`pg-fix-table-ownership` CronJob**. | **Postgres `16.3-alpine`**; container `postgres`; CronJob daily `0 3 * * *` |
| 3 | [`setup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/setup/gitea.yml) | Deploy the Gitea Compose stack (`deploy_docker_compose` + `deploy_gitea` role), create admin `arcodange`, mint an API token via `gitea_token`, upload the avatar, register the SSH key, create org `arcodange-org`, then **delete the temp token**. | **Gitea `1.25.5`**; base URL `http://pi2:3000` |
## NFS backup target — how the share is born
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'13px'}}}%%
flowchart TD
classDef cluster fill:#1e4032,stroke:#22c55e,color:#f0fdf4;
classDef host fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
pvc["RWX PVC backups-rwx (50Gi)<br>longhorn-system"]:::cluster
rj["RecurringJob thrice-a-month-backup<br>cron 0 5 */2 *"]:::cluster
dep["busybox Deployment rwx-nfs<br>mounts the PVC"]:::cluster
sm["Longhorn share-manager<br>(spawned by the mount)"]:::cluster
svc["Service nfs-backups-rwx<br>ClusterIP :2049"]:::cluster
mount["mount /mnt/backups on pi1/pi2/pi3<br>NFS vers=4.1"]:::host
pvc --> rj
pvc --> dep --> sm --> svc --> mount
```
1. A **ReadWriteMany Longhorn PVC** (`backups-rwx`, 50Gi) is created in `longhorn-system`.
2. A **`RecurringJob`** is attached to the volume so Longhorn snapshots/backs it up on the `0 5 */2 * *` schedule.
3. A **`busybox` Deployment (`rwx-nfs`)** mounts the PVC — the act of mounting an RWX volume makes Longhorn spawn an **NFS share-manager** pod.
4. A stable **ClusterIP Service** (`nfs-backups-rwx`, port 2049) is created (or reused) to front the share-manager.
5. Each Pi installs `nfs-common` and **mounts the share at `/mnt/backups`** (`vers=4.1`, `nofail`, `x-systemd.automount`), persisted in `fstab`.
## Postgres — what gets created
| Artifact | Where | Purpose |
| --- | --- | --- |
| Compose stack `arcodange_factory` | `pi2` Docker | Runs `postgres:16.3-alpine`, container `postgres`, port `5432`, data under `/home/pi/arcodange/docker_composes/postgres/data`. |
| `gitea` DB + user | inside Postgres | Created by the `deploy_postgresql` role from `applications_databases.gitea` (`gitea_database`). |
| pgbouncer `auth_user` (`pgbouncer_auth`) | `postgres` + `gitea` DBs | Login role used by the [pgbouncer pooler](../../lab-ecosystem/02-tools.md) for SCRAM lookups. |
| `user_lookup(text)` function | `postgres` + `gitea` DBs | `SECURITY DEFINER` function over `pg_shadow`; `EXECUTE` granted only to `pgbouncer_auth`. |
| K8s Secret `postgres-admin-credentials` | `kube-system` | Base64 admin user/password so the in-cluster CronJob can authenticate. |
| CronJob `pg-fix-table-ownership` | `kube-system` | Runs `postgres:16.3` daily at **03:00**; discovers `%_role` roles, derives each DB by stripping `_role`, and re-`ALTER TABLE ... OWNER TO` every public table — repairing ownership after a restore. |
## Gitea — bootstrap sequence
1. **Compose deploy** via `deploy_docker_compose`, then the `deploy_gitea` role wires Gitea to the Postgres DB (host/db/user/password pulled from the compose env).
2. **Admin user** `arcodange` (`arcodange@gmail.com`) is created with `--random-password --admin` if absent.
3. **API token** is minted by the `gitea_token` role and used for the next HTTP calls.
4. **Avatar** upload, **SSH public key** registration (idempotent), and **org `arcodange-org`** (full name "Arcodange") creation + avatar.
5. **Cleanup** — a `post_tasks` invocation of `gitea_token` with `gitea_token_delete: true` removes the temporary token.
## Gotchas
> [!WARNING]
> **The NFS play is `never`-tagged and order-sensitive.** `backup_nfs.yml` only runs when explicitly tagged, and several of its tasks (`Créer PVC RWX`, `Lancer un Deployment pour déclencher NFS`, `Attendre que le pod rwx-nfs soit Running`) are themselves `tags: never`. The RWX volume must already exist for the busybox deploy to spawn the share-manager; running the mount step before the share-manager is `Running` will hang on the `until` retry loop.
> [!WARNING]
> **Postgres lives on `pi2` outside K3s.** Treat it as a single-host service: there is no Postgres pod to `kubectl get`. The cluster only sees the `postgres-admin-credentials` Secret and the `pg-fix-table-ownership` CronJob, both of which reach the DB over the LAN at `pi2:5432`. A `pi2` outage takes Postgres (and Gitea) down regardless of cluster health.
> [!CAUTION]
> **`pg-fix-table-ownership` exists because restores break ownership.** After a Longhorn/data recovery, tables can come back owned by the wrong role and apps lose write access. The daily CronJob silently re-owns every `public` table to the `<db>_role` matching each `%_role` PostgreSQL role. If you add a database whose owning role does **not** follow the `<db>_role` naming convention, this job will not fix it — see [Naming conventions](../../lab-ecosystem/naming-conventions.md).
> [!NOTE]
> **The admin password is random and printed once.** Gitea's admin is created with `--random-password`; capture it from the play output (or reset it via `docker exec`) — it is not stored in the inventory. The bootstrap API token is deliberately deleted at the end, so re-running the play re-mints a fresh one.

View File

@@ -0,0 +1,34 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **03 · CI/CD**
# 03 · CI/CD — Gitea Actions runners
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [02 · Setup](02-setup.md)
> **Downstream:** [04 · Tools](04-tools.md)
> **Related:** [Lab ecosystem · 01 factory (ArgoCD caveat)](../../lab-ecosystem/01-factory.md) · [Roles reference](roles.md) · [Inventory & variables](inventory.md)
## What it does
`03 · CI/CD` registers and deploys the **Gitea Actions runner (`act_runner`)** on every Pi that is *not* the Gitea host, so CI jobs have executors. The whole stage is one playbook, [`playbooks/03_cicd.yml`](../../../../ansible/arcodange/factory/playbooks/03_cicd.yml) — there is no stage subdirectory.
It targets `raspberries:&local:!gitea`, i.e. the raspberries that are local **minus** the `gitea` group. Since `gitea` resolves to `pi2`, the runner lands on **`pi1` and `pi3`** (see [Inventory & variables](inventory.md)).
## Steps
| # | Task / role | Purpose | Key detail |
| --- | --- | --- | --- |
| 1 | role `arcodange.factory.gitea_token` | Mint a `gitea_api_token` for later API use. | Reused across the collection (see [Roles reference](roles.md)). |
| 2 | `gitea actions generate-runner-token` (delegated to the Gitea host) | Fetch a **runner registration token** by `docker exec`-ing into the `gitea` container. | `delegate_to: groups.gitea[0]` |
| 3 | role `arcodange.factory.deploy_docker_compose` | Render the `act_runner` Compose stack with the registration token, instance URL, runner name, and labels. | image `gitea/act_runner:latest`; labels point at `runner-images:ubuntu-latest-ca` |
| 4 | `community.docker.docker_compose_v2` (down→up loop) | Apply the stack: a `loop: [absent, present]` recreates the runner so token/label changes take effect. | cache dirs under `/mnt/arcodange/gitea-runner-*` |
The runner registers with `GITEA_INSTANCE_URL: http://<gitea-host>:3000`, names itself `arcodange_global_runner_<host>`, and advertises the **`ubuntu-latest` / `ubuntu-latest-ca`** labels — both mapped to the CA-trusting image built back in [01 · System](01-system.md). It mounts the Docker socket and the host CA store (`/etc/ssl/certs`, `/usr/local/share/ca-certificates`) so jobs trust internal TLS, and runs with `insecure: true` against the Gitea TLS endpoint.
## Gotchas
> [!WARNING]
> **ArgoCD is present in design but not deployed.** The factory pipeline intends `03_cicd` to also bring up ArgoCD (the app-of-apps), but **that step is commented out / not currently deployed in-cluster** — this stage only deploys the Gitea runners. Treat ArgoCD as "designed, not live" until the install is enabled. See the [ArgoCD caveat in lab-ecosystem · 01 factory](../../lab-ecosystem/01-factory.md).
> [!WARNING]
> **The registration token is single-use and host-delegated.** Step 2 generates a fresh token every run via the Gitea container, so the runner re-registers on each apply. If the Gitea host (`pi2`) is down, token generation fails and no runner can register.

View File

@@ -0,0 +1,125 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **04 · Tools**
# 04 · Tools — Vault + CrowdSec
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
> **Downstream:** [Roles reference](roles.md) — deep mechanics of the `hashicorp_vault` and `crowdsec` roles
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [05 · Backup](05-backup.md) · [03 · CI/CD](03-cicd.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
Stage 4 installs the **operational tooling layer** on top of a running cluster: HashiCorp **Vault** (the lab's single secret store) and **CrowdSec** (the WAF/IPS that fronts Traefik). The entry point [`playbooks/04_tools.yml`](../../../../ansible/arcodange/factory/playbooks/04_tools.yml) is a one-line wrapper that imports [`playbooks/tools/tools.yml`](../../../../ansible/arcodange/factory/playbooks/tools/tools.yml), which in turn chains two sub-playbooks — `hashicorp_vault.yml` then `crowdsec.yml`. Both run against `localhost` (they drive the cluster through `kubectl` / `kubernetes.core`, not over SSH to the Pis).
> [!IMPORTANT]
> Vault is the chokepoint of the whole secret model. This page covers **what the playbook orchestrates**; the byte-level role internals (init, unseal, root-token minting, the OpenTofu OIDC backend) live in the [Roles reference](roles.md). Read [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) first for the conceptual model — the two auth backends, the unseal posture, and why there is no secret material in git.
---
## What stage 4 deploys
| Sub-playbook | File | Builds | Role invoked |
| --- | --- | --- | --- |
| Vault | [`tools/hashicorp_vault.yml`](../../../../ansible/arcodange/factory/playbooks/tools/hashicorp_vault.yml) | Initialises + unseals Vault, wires the Gitea OIDC/JWT auth backends via OpenTofu, publishes the `vault_oauth__sh_b64` Gitea Action secret | `hashicorp_vault` |
| CrowdSec | [`tools/crowdsec.yml`](../../../../ansible/arcodange/factory/playbooks/tools/crowdsec.yml) | A `VaultAuth` + `VaultStaticSecret` for the Turnstile captcha keys, a fresh bouncer API key, and the Traefik `crowdsec` middleware | `crowdsec` |
---
## Step 1 — `hashicorp_vault.yml`
### The credential prompt
The play opens with a single `vars_prompt` for the **Gitea admin password** (`gitea_admin_password`, marked `unsafe: true` because the password may contain shell-hostile characters like `{`). This is the only interactive input the stage needs — everything else is derived or minted on the fly.
### Orchestration flow
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef prompt fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef mint fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef vault fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef revoke fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
P["vars_prompt:<br/>gitea_admin_password"]:::prompt
T["Mint temp GITEA_ADMIN_TOKEN<br/>(role gitea_token, replace=true)"]:::mint
R["Run hashicorp_vault role:<br/>init · unseal · OIDC backend · gitea secret"]:::vault
D["post_tasks:<br/>delete GITEA_ADMIN_TOKEN"]:::revoke
P --> T --> R --> D
```
1. **Mint a temporary token.** The `arcodange.factory.gitea_token` role generates a `GITEA_ADMIN_TOKEN` with scopes `write:admin,write:organization,write:repository,write:user` (and `gitea_token_replace: true`, so any stale token of the same name is rotated). It is stashed in the fact `vault_GITEA_ADMIN_TOKEN`.
2. **Run the `hashicorp_vault` role.** Invoked with three derived vars: the Postgres admin credentials (read straight out of the Postgres host's docker-compose `environment` via `hostvars[groups.postgres[0]]`), the `gitea_admin_token` (= the temp token), and the prompted `gitea_admin_password`. The role does the heavy lifting — see below.
3. **Revoke the temporary token.** A `post_tasks` block re-invokes `gitea_token` with `gitea_token_delete: true`, so the admin token never outlives the run.
### What the `hashicorp_vault` role does
The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) runs a fixed sequence; the OIDC backend setup is wrapped in a `block`/`always` so the freshly minted **root token is always revoked**, even on failure:
| Phase | Task file | What happens |
| --- | --- | --- |
| **Init** | `init.yml` | First-time only. Checks `vault operator init -status`; if uninitialised, runs `vault operator init` with **1 key share / threshold 1** and writes the keys to `~/.arcodange/cluster-keys.json` (mode `600`). Idempotent on re-run. |
| **Unseal** | `unseal.yml` | Reads `cluster-keys.json` and runs `vault operator unseal` on every server pod. Required on **every reboot** — Vault always restarts sealed. |
| **Root token** | `new_root_token.yml` | Mints a one-shot root token via the `generate-root` OTP/nonce dance (using the unseal key), needed to authenticate the OpenTofu apply. |
| **OIDC backend** | `gitea_oidc_auth.yml` | Drives a Playwright script to register/read the Gitea OAuth app, then runs **OpenTofu in a throwaway Docker volume** to provision the `gitea` (OIDC) + `gitea_jwt` (JWT) auth backends, the admin identity, and the `kvv1` static secrets. Finally writes the `vault_oauth__sh_b64` script to Gitea Actions secrets. |
| **Revoke** | `revoke_token.yml` (in `always`) | Revokes the root token unconditionally. |
> [!IMPORTANT]
> The OpenTofu apply runs the [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside an ephemeral Docker volume (`docker volume create` → `tofu init` + `tofu apply` → `docker volume rm`), with the state in a GCS backend (`gs://arcodange-tf`, prefix `tools/hashicorp_vault/gitea_oidc`). The CA is mounted read-only via `VAULT_CACERT`. The destroy step is commented out by design — this provisions, it does not tear down.
### The `vault_oauth__sh_b64` Gitea secret
The last act of the role renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) (an OIDC authorization-code → access-token helper for CI), base64-encodes it, and publishes it as the **org-level** Gitea Action secret `vault_oauth__sh_b64`. Because Gitea Action secrets are scoped per owner, the role then **re-publishes the identical secret to each user-owned namespace** listed in `gitea_secret_propagation_users` — repos under a personal account cannot read org-level secrets. This is what lets a Gitea Actions workflow obtain the OIDC JWT that authenticates to Vault under the `gitea_cicd_<app>` role (the CI half of the [secret model](../../lab-ecosystem/secrets-and-vault.md)).
> [!CAUTION]
> The role has an **off-by-default** `vault_oidc_force_reset` flag. When set, it runs `vault auth disable gitea` **and** `gitea_jwt` before re-applying — which **wipes every `gitea_cicd_<app>` per-app JWT role** created by the tools-repo IaC. Leave it `false` unless you are deliberately rebuilding the OIDC backend from scratch (e.g. `bound_issuer` config drift).
---
## Step 2 — `crowdsec.yml`
The CrowdSec sub-playbook is a thin wrapper that runs the `crowdsec` role to bolt a CrowdSec-bouncer middleware onto Traefik. The role's [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) wires three things together.
| Step | What it creates | Detail |
| --- | --- | --- |
| **Turnstile secret** | `ServiceAccount` + `VaultAuth` + `VaultStaticSecret` in `kube-system` | Authenticates via the Kubernetes auth backend (role `factory_crowdsec_conf`) and pulls the Cloudflare Turnstile keys from `kvv2` path `cms/factory/turnstile` into a K8s Secret (`refreshAfter: 30s`). |
| **Bouncer key** | A CrowdSec LAPI bouncer named `traefik-plugin` | Runs `cscli bouncers add traefik-plugin` inside the LAPI pod; on collision it deletes and re-adds, so the run is repeatable. |
| **Traefik middleware** | A `traefik.io/v1alpha1` `Middleware` named `crowdsec` | Stream mode, captcha provider `turnstile` (site/secret keys from the Turnstile secret), Redis cache, trusted-IP allow-lists. |
After applying the middleware the role **cleans up `Failed` CrowdSec pods** and **bounces Traefik** (scale to 0 → back to 1, inside a `block`/`rescue`/`always` that guarantees Traefik returns to 1 replica no matter what) so the new middleware config is loaded.
> [!NOTE]
> The Turnstile keys come from the **CMS-managed** Vault path `cms/factory/turnstile` — they are provisioned outside this stage. CrowdSec only *reads* them here. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how `VaultStaticSecret` materialises a Vault path into a Kubernetes Secret.
---
## Gotchas
> [!WARNING]
> - **Vault must be unsealed before anything secret-dependent recovers.** Stage 4's unseal step reads `~/.arcodange/cluster-keys.json`; if that file is missing, init/unseal cannot proceed and the OpenTofu apply (which needs a live Vault) fails. The same file gates step 2 of the [power-cut recovery order](../../lab-ecosystem/storage-and-recovery.md).
> - **Docker is required on the control node.** The OIDC backend provisioning shells out to `docker run … opentofu` and `docker volume`. The Playwright step also runs containerised. A control node without Docker will fail this stage.
> - **`gitea_admin_password` is `unsafe`.** Do not strip the `unsafe: true` flag from the prompt — passwords with `{`/`}` are mangled by Jinja templating otherwise.
> - **Re-running is safe by default.** Init and unseal are idempotent; the temp admin token and root token are both revoked on the way out. Only `vault_oidc_force_reset` makes a re-run destructive.
> - **CrowdSec bounces Traefik.** The middleware step briefly scales Traefik to 0 — expect a short ingress blip during stage 4. The `always` block restores it to 1 even if the scale-down errors.
---
## Where stage 4 sits
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef next fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
s03["03 · CI/CD"]:::done
s04["04 · Tools<br/>Vault · CrowdSec"]:::here
s05["05 · Backup"]:::next
s03 --> s04 --> s05
```
1. **03 · CI/CD** registered the `act_runner` executors — a prerequisite, since the `vault_oauth__sh_b64` secret published here is consumed by those CI runners.
2. **04 · Tools** (this page) stands up Vault and CrowdSec.
3. **05 · Backup** is next — it schedules the cron dumps that protect the state the cluster now holds.

View File

@@ -0,0 +1,107 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **05 · Backup**
# 05 · Backup — daily cron dumps
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
> **Downstream:** [06 · Recover](06-recover.md) — how these dumps are replayed
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [04 · Tools](04-tools.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
Stage 5 installs three independent **cron-driven backup jobs** that protect the platform's persistent state: the PostgreSQL database, the Gitea instance, and the K3s volume metadata (PV/PVC + Longhorn CRDs). The entry point [`playbooks/05_backup.yml`](../../../../ansible/arcodange/factory/playbooks/05_backup.yml) imports [`playbooks/backup/backup.yml`](../../../../ansible/arcodange/factory/playbooks/backup/backup.yml), which chains the three sub-playbooks, each passing `backup_root_dir: /mnt/backups`.
Every job follows the **same anatomy**: run a daily cron at **04:00**, write a date-stamped archive to `/mnt/backups/<kind>/`, prune anything older than **3 days**, and drop a matching `restore.sh` next to the backup script. `/mnt/backups` is a Longhorn RWX volume, so Longhorn itself snapshots, replicates, and ships these archives off-site — the cron jobs only produce the dumps.
> [!NOTE]
> All three sub-playbooks **install** scripts and cron entries; they do not run a backup themselves (beyond a one-shot `test backup_cmd` smoke check that pipes to `/dev/null`). The actual backups fire from cron. To read failures, SSH to the host and use `sudo su` → `mails` (see [`backup/README.md`](../../../../ansible/arcodange/factory/playbooks/backup/README.md)).
---
## The three jobs
| Job | Sub-playbook | Host | Backup command | Artifact | Scripts dir |
| --- | --- | --- | --- | --- | --- |
| **Postgres** | [`backup/postgres.yml`](../../../../ansible/arcodange/factory/playbooks/backup/postgres.yml) | `postgres` | `docker exec <pg> pg_dumpall -U <user>` `gzip` | `backup_YYYYMMDD.sql.gz` | `…/docker_composes/postgres/scripts` |
| **Gitea** | [`backup/gitea.yml`](../../../../ansible/arcodange/factory/playbooks/backup/gitea.yml) | `gitea` | `docker exec -u git <gitea> gitea dump --skip-log --skip-db --skip-package-data --type tar.gz` | `backup_YYYYMMDD.gitea.gz` | `…/docker_composes/gitea/scripts` |
| **K3s PVC** | [`backup/k3s_pvc.yml`](../../../../ansible/arcodange/factory/playbooks/backup/k3s_pvc.yml) | `pi1` | `kubectl get pv,pvc` + `volumes.longhorn.io` + `settings.longhorn.io` (YAML) | `backup_YYYYMMDD.volumes` | `/opt/k3s_volumes` |
All three share: `keep_days: 3`, cron `minute: 0 hour: 4 user: root`, and `backup_dir: /mnt/backups/<kind>`.
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef cron fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef job fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef store fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef ship fill:#14532d,stroke:#22c55e,color:#f0fdf4;
C["cron · daily 04:00 · user root"]:::cron
PG["postgres.yml<br/>pg_dumpall gzip"]:::job
GT["gitea.yml<br/>gitea dump tar.gz"]:::job
PV["k3s_pvc.yml<br/>PV · PVC · Longhorn CRDs"]:::job
D["/mnt/backups/{postgres,gitea,k3s_pvc}/<br/>keep 3 days"]:::store
L["Longhorn:<br/>snapshot · replicate · off-site"]:::ship
C --> PG --> D
C --> GT --> D
C --> PV --> D
D --> L
```
1. A single daily **04:00 root cron** triggers each job's `backup.sh`.
2. **postgres.yml** runs `pg_dumpall` through `gzip`, **gitea.yml** streams a `gitea dump` tarball, **k3s_pvc.yml** serialises the volume metadata.
3. Each writes a date-stamped archive into `/mnt/backups/<kind>/` and prunes files older than 3 days (`find … -mtime +3 -delete`).
4. Because `/mnt/backups` is a Longhorn RWX volume, Longhorn snapshots, replicates across nodes, and ships an off-site copy — no separate upload step in the cron.
---
## Job details
### Postgres — `postgres.yml`
The backup command is built from the Postgres host's docker-compose facts (`container_name`, `POSTGRES_USER`). `pg_dumpall` captures **all databases plus globals (roles)** in one logical dump, gzipped. The generated `restore.sh` takes an optional `YYYYMMDD` argument (defaults to the latest dump), `docker cp`s it into the container, gunzips, and replays with `psql -f`. If the restore misbehaves, the script reminds you to wipe the data dir before replaying.
### Gitea — `gitea.yml`
The dump runs as the `git` user with `--skip-db` (Postgres is backed up separately by the Postgres job) and `--skip-package-data`, streamed to stdout (`-f -`) so it never lands on the container's own disk. The `restore.sh` unpacks the tarball back into `/data/gitea` (config/data) and `/data/git/repositories` (repos), fixes `git:git` ownership, and **regenerates hooks** (`gitea admin regenerate hooks`) — without that step the restored repos have stale hook paths.
### K3s PVC — `k3s_pvc.yml`
This job does **not** back up volume *data* (Longhorn handles the bytes). It backs up the **Kubernetes objects** needed to re-bind those volumes: all `pv` + `pvc`, the **`volumes.longhorn.io` CRDs**, and `settings.longhorn.io`, concatenated into one `.volumes` YAML (`---`-separated). It writes the dump to both `/mnt/backups/k3s_pvc/` *and* a copy alongside the script. The `restore.sh` prefers a fallback dir (`/home/pi/arcodange/backups/k3s_pvc`) then the primary, picks the latest (or a dated) dump, and `kubectl apply`s it.
> [!IMPORTANT]
> **Backing up the Longhorn `volumes.longhorn.io` CRDs is what enables *fast* recovery.** With the Volume CRDs in the backup, recovery is a single `kubectl apply` that re-associates the surviving on-disk replicas with their PVs (see [06 · Recover → `longhorn.yml`](06-recover.md)). **Without** the Volume CRDs, a Longhorn reinstall assigns **new engine IDs**, cannot adopt the orphaned replica directories, and you fall through to the slow **block-device data recovery** (`longhorn_data.yml`). The k3s_pvc backup_cmd carries an inline comment to this effect and points at the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). This is the prevention half of the [storage failure mode](../../lab-ecosystem/storage-and-recovery.md).
---
## Gotchas
> [!WARNING]
> - **3-day retention is tight.** A failure that goes unnoticed for 3 days loses all recoverable history. The off-site Longhorn copy is the longer-horizon safety net — the local `/mnt/backups` files are short-lived.
> - **The smoke test runs the real dump.** Each play has a `test backup_cmd` task that executes the backup command (output discarded) at provisioning time. If Postgres/Gitea/kubectl is unreachable when you run stage 5, provisioning fails fast — by design.
> - **Cron runs as `root`, scripts live in app dirs.** The `backup.sh`/`restore.sh` are written into the app's docker-compose `scripts/` dir (or `/opt/k3s_volumes`); the cron job invokes them as root. Don't relocate the compose dirs without re-running stage 5.
> - **Gitea restore needs the hook regeneration.** Skipping `gitea admin regenerate hooks` leaves repos with broken push hooks — the `restore.sh` already does it, so use the script rather than a manual untar.
> - **Postgres and Gitea DB are backed up by *different* jobs.** Gitea dumps with `--skip-db`; its database rows come from the Postgres `pg_dumpall`. Restoring Gitea fully means restoring **both** archives.
---
## Where stage 5 sits
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef here fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef rec fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
s04["04 · Tools"]:::done
s05["05 · Backup<br/>Postgres · Gitea · K3s PVC"]:::here
rec["recover/*<br/>(on disaster)"]:::rec
s04 --> s05
s05 -. "feeds restore" .-> rec
```
1. **04 · Tools** stood up Vault and CrowdSec — the secret store stage 5's dumps help protect.
2. **05 · Backup** (this page) is the last linear stage: it schedules the daily dumps.
3. The artifacts here are the **input** to the on-demand [06 · Recover](06-recover.md) branch — the `.volumes` dump in particular gates whether recovery is fast (CRDs present) or slow (block-device).

View File

@@ -0,0 +1,149 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **06 · Recover**
# 06 · Recover — Longhorn disaster recovery
> [!NOTE]
> **Status:** 🟡 beta · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [Factory provisioning hub](../README.md)
> **Downstream:** [05 · Backup](05-backup.md) — the dumps these playbooks consume
> **Related:** [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [PRD — QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) · [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
The `recover/` playbooks are **not** part of the linear `01..05` pipeline — they are an **on-demand disaster-recovery branch**, invoked only after a power cut or data loss. There are two, and which one you run depends on a single question: **do the Longhorn Volume CRDs still exist?**
> [!IMPORTANT]
> **Decision — pick the right playbook before you start:**
> - **Volume CRDs still present** (e.g. they were captured by the [05 · Backup k3s_pvc dump](05-backup.md), or never wiped) → run [`recover/longhorn.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn.yml). Fast: it re-applies the CRDs and the surviving on-disk replicas are re-adopted.
> - **Volume CRDs are GONE** (a nuclear Longhorn reinstall assigned new engine IDs) but the raw replica `.img` files survive on disk → run [`recover/longhorn_data.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data.yml). Slow: it merges replica layers at the block-device level and injects the data into a fresh volume.
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef q fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef fast fill:#14532d,stroke:#22c55e,color:#f0fdf4;
classDef slow fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff;
Q{"Do the Longhorn<br/>Volume CRDs<br/>still exist?"}:::q
F["longhorn.yml<br/>CSI/CRD recovery (fast)"]:::fast
S{"Raw replica<br/>.img files<br/>survive?"}:::q
D["longhorn_data.yml<br/>block-device recovery (slow)"]:::slow
X["Data unrecoverable<br/>(replicas zeroed)"]:::dead
Q -- "yes" --> F
Q -- "no" --> S
S -- "yes" --> D
S -- "no" --> X
```
1. **CRDs present?** Yes → `longhorn.yml` re-applies the Volume CRDs and the on-disk replicas re-attach. Done fast.
2. **CRDs gone?** Then ask whether the raw replica `.img` files survived on disk.
3. **Replicas survive?** Yes → `longhorn_data.yml` reconstructs the filesystem at the block level and injects it into a new volume.
4. **Replicas zeroed** by Longhorn reconciliation → the data is unrecoverable; there is no playbook for this.
> [!NOTE]
> This branch sits at step 1 of the broader tested startup order — **Longhorn first, then Vault unseal, then VSO re-auth, ERP scaled up last**. The full order, the engine-ID failure mode, and the once-real-once-rehearsed history are in [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md). The single tested-recovery record (1-key/threshold-1 unseal, the four-step order) lives in CLUSTER_RECOVERY.md, kept at the lab root outside this repo.
---
## `longhorn.yml` — CSI/CRD recovery (CRDs present)
Runs against `raspberries:&local` as root. It diagnoses how broken Longhorn is and applies the **least invasive** fix that works, escalating only if needed. Most logic runs `run_once` on `pi1`, delegating cluster reads to `localhost`.
| Phase | What it does |
| --- | --- |
| **0 · Pre-flight** | Verifies the data dir `/mnt/arcodange/longhorn` exists on `pi1` (fails hard if missing) and that at least one `backup_*.volumes` dump exists in the primary or fallback backup dir. |
| **1 · Diagnosis** | Checks the `longhorn-system` namespace, the `driver.longhorn.io` **CSIDriver** registration, and the `longhorn-manager` pods, then sets `recovery_phase` = `soft` (CSI driver gone), `hard` (managers unhealthy), or `none`. |
| **2 · Soft** | Touches `longhorn-install.yaml` to make k3s reconcile the HelmChart, waits, and checks pods recreate. |
| **3 · Hard** | Force-deletes the `longhorn-driver-deployer` pods so the HelmChart recreates them. |
| **4 · Nuclear** | Full reinstall: delete the HelmChart, strip finalizers off all Longhorn CRs / PVCs / the namespace, delete + redeploy the `longhorn-install` HelmChart manifest (`v1.9.1`, `defaultDataPath` preserved), wait for pods. |
| **5 · Restore** | Waits for managers to be ready, then `kubectl apply`s the latest `backup_*.volumes` dump (PV/PVC + Longhorn CRDs) and any `longhorn_metadata_*.yaml`. |
| **6 · Verify** | Polls until the CSIDriver is registered, ≥3 managers are Running, the CSI socket exists, and the replica data dir is present; prints a summary. |
> [!IMPORTANT]
> Phase 5 is exactly where the [05 · Backup k3s_pvc dump](05-backup.md) pays off: re-applying the captured **Volume CRDs** lets Longhorn re-adopt the surviving replica directories instead of forcing the block-device path. The playbook is **idempotent** — it re-diagnoses and escalates only as far as needed, so re-running after a partial recovery is safe.
---
## `longhorn_data.yml` — block-device data recovery (CRDs gone)
This is the fallback when a nuclear reinstall has destroyed the Volume CRDs and assigned new engine IDs, leaving the real data in **orphaned** replica directories. It bypasses Kubernetes objects entirely and reconstructs the filesystem at the block level. It is **driven by a vars file**`vars/recovery_volumes.yml`, one entry per volume — and the format is documented in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
```sh
ansible-playbook -i inventory/hosts.yml \
playbooks/recover/longhorn_data.yml \
-e @vars/recovery_volumes.yml
```
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef pre fill:#5f4a1e,stroke:#d97706,color:#fffbeb;
classDef merge fill:#4c1d95,stroke:#7c3aed,color:#f5f3ff;
classDef k8s fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef done fill:#14532d,stroke:#22c55e,color:#f0fdf4;
P0["Pre-flight + Phase 0:<br/>auto-discover largest replica dir (>16K)"]:::pre
P1["Phase 1: back up untouched replica dir<br/>(safe copy before any op)"]:::merge
P2["Phase 2: merge-longhorn-layers.py<br/>→ single .img · test-mount RO"]:::merge
P3["Phase 3: create Volume CRD<br/>(scale down workload, clear stuck PVCs)"]:::k8s
P5["Phase 5: attach via maintenance ticket<br/>→ /dev/longhorn/&lt;pv&gt;"]:::k8s
P6["Phase 6: mkfs + rsync merged image<br/>into live block device"]:::merge
P8["Phase 8: recreate PV (Retain) + PVC<br/>pinned by volumeName"]:::k8s
P9["Phase 9: scale workload up · verify"]:::done
P0 --> P1 --> P2 --> P3 --> P5 --> P6 --> P8 --> P9
```
1. **Pre-flight + Phase 0.** Fail fast if no volumes are defined, the merge tool is missing, or Longhorn managers aren't Running. Then **auto-discover** the best replica source for each volume — the **largest dir >16 MiB** across `pi1/pi2/pi3`, skipping any replica still `Rebuilding`. `source_node`/`source_dir` in the vars file override this.
2. **Phase 1.** `cp -a` the untouched replica dir to a backup location *before* touching anything, and verify it contains `volume.meta`.
3. **Phase 2.** Run `merge-longhorn-layers.py` to collapse the snapshot + head `.img` layers into one image, then test-mount it read-only to confirm the filesystem is sound.
4. **Phase 3.** Scale the workload to 0 and clear any stuck `Terminating` PV/PVCs *before* creating a fresh Longhorn `Volume` CRD (order matters — StatefulSet controllers re-provision empty PVCs otherwise).
5. **Phase 5.** Attach the volume via a Longhorn `VolumeAttachment` **maintenance ticket** so `/dev/longhorn/<pv>` appears on the source node, with the frontend enabled.
6. **Phase 6.** `mkfs.ext4` the live block device if unformatted, then `rsync` the merged recovery image into it (`--ignore-errors`; rsync rc=23 partial-transfer is treated as success for power-cut partitions).
7. **Phase 8.** Detach the recovery ticket, recreate the PV (`Retain`, no `claimRef`) and a PVC pinned by `volumeName`, and wait for Bound.
8. **Phase 9.** Scale the workload back up, wait for ready replicas, and run the optional per-volume `verify_cmd` inside the pod.
> [!CAUTION]
> The `merge-longhorn-layers.py` tool is invoked **per replica dir via `dmsetup`** to stack the copy-on-write layers correctly. Never recover by simply renaming the orphaned replica directory to the new engine ID — Longhorn reconciliation can pick the *empty* new replica as the rebuild source and **overwrite your data**. The block-device injection is the only proven-safe path. The full method comparison is in the [Longhorn PVC recovery ADR](../../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
> [!NOTE]
> **Tested 2026-04-13 power-cut.** This block-device path was proven end to end recovering the **url-shortener's SQLite database** after that power cut forced a nuclear Longhorn reinstall (verified `2026-04-14` with `sqlite3 … 'SELECT COUNT(*) FROM urls;'`). That scenario is the worked example in [`longhorn_data_vars.example.yml`](../../../../ansible/arcodange/factory/playbooks/recover/longhorn_data_vars.example.yml).
---
## Gotchas
> [!WARNING]
> - **Run `longhorn.yml` first if there is any chance the CRDs survived.** It is fast and idempotent; falling straight to `longhorn_data.yml` is unnecessary block-level work when a `kubectl apply` would have sufficed.
> - **`longhorn_data.yml` needs a healthy Longhorn control plane.** Its pre-flight aborts unless ≥1 `longhorn-manager` is Running — it recovers *data into* a working Longhorn, it does not bring Longhorn back. Use `longhorn.yml` for that.
> - **Process volumes one at a time first.** The example vars file recommends validating a single volume before batching — a misidentified `source_dir` can pin the PVC to the wrong (empty) replica.
> - **`python3` on every node.** Phase 0's replica scan and the merge tool both require `python3` on `pi1/pi2/pi3`.
> - **The merge tool path is repo-relative.** `longhorn_data.yml` resolves `merge-longhorn-layers.py` from `docs/incidents/2026-04-13-power-cut/tools/` and `scp`s it to the source node — run the playbook from inside the collection so that path resolves.
---
## Why this is rehearsed
A recovery procedure run once under outage stress is a liability. These two playbooks — and the CRDs-present-vs-gone decision — are **rehearsed deliberately in the production-like sandbox**: kill the cluster, lose the engine IDs on a test volume, and walk both recovery paths back to green without risking production data. That turns the drill into routine QA rather than one-shot incident memory. See the PRD's [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md) for how recovery drills become a regular exercise, and [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for the full startup order these drills validate.
---
## Where this branch sits
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
classDef done fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef here fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
s05["05 · Backup<br/>(produces .volumes dump)"]:::done
rec["recover/*<br/>longhorn.yml · longhorn_data.yml"]:::here
s01["01 · System<br/>(rejoin pipeline)"]:::done
s05 -. "on disaster" .-> rec
rec -. "once recovered" .-> s01
```
1. **05 · Backup** produced the `.volumes` dump that `longhorn.yml`'s restore phase replays.
2. **recover/** (this page) is invoked only on disaster — pick `longhorn.yml` (CRDs present) or `longhorn_data.yml` (CRDs gone).
3. Once volumes are healthy, the cluster **re-enters the normal pipeline** at [01 · System](01-system.md), and you re-run a fresh [05 · Backup](05-backup.md) once everything is green.

View File

@@ -0,0 +1,127 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **Ansible**
# Ansible — factory provisioning
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
> **Downstream:** [01 · System](01-system.md) · [02 · Setup](02-setup.md) · [03 · CI/CD](03-cicd.md) · [04 · Tools](04-tools.md) · [05 · Backup](05-backup.md) · [06 · Recover](06-recover.md) · [Inventory & variables](inventory.md) · [Roles reference](roles.md)
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
Ansible is the **imperative half** of the factory: it takes three bare Raspberry Pis (`pi1`, `pi2`, `pi3`) and turns them into a running K3s cluster with Docker, Longhorn storage, Gitea CI runners, CrowdSec, and Vault. OpenTofu (the declarative half) then provisions everything that lives *outside* the cluster — see the [OpenTofu sub-hub](../opentofu/README.md).
---
## Collection layout
Everything ships as a single Ansible **collection** committed under [`ansible/arcodange/factory/`](../../../../ansible/arcodange/factory). The collection root, not the repo root, is what `ansible-galaxy collection install` and the FQCN references (`arcodange.factory.<role>`) resolve against.
| File | Path | What it declares |
| --- | --- | --- |
| `galaxy.yml` | [`ansible/arcodange/factory/galaxy.yml`](../../../../ansible/arcodange/factory/galaxy.yml) | Collection identity: **namespace `arcodange`**, **name `factory`**, **version `1.0.0`**. Together they form the FQCN prefix `arcodange.factory.*` used by every role and playbook import. |
| `requirements.yml` | [`ansible/requirements.yml`](../../../../ansible/requirements.yml) | External dependencies pulled at install time (see table below). |
| `ansible.cfg` | [`ansible/arcodange/factory/ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg) | `collections_path = ~/.ansible/collections` and `scp_if_ssh = True` for the SSH connection plugin. |
| `inventory/` | [`ansible/arcodange/factory/inventory/`](../../../../ansible/arcodange/factory/inventory) | `hosts.yml` + `group_vars/`. Detailed in [Inventory & variables](inventory.md). |
| `playbooks/` | [`ansible/arcodange/factory/playbooks/`](../../../../ansible/arcodange/factory/playbooks) | The numbered pipeline `01..05` plus the `recover/` branch. |
| `roles/` | [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) | Seven reusable roles. Detailed in [Roles reference](roles.md). |
### External dependencies (`requirements.yml`)
| Dependency | Type | Why it is needed |
| --- | --- | --- |
| `geerlingguy.docker` | role | Installs and configures the Docker engine on each Pi. |
| `ansible.posix` | collection | POSIX primitives (mounts, sysctl, `synchronize`). |
| `community.crypto` | collection | Certificate/key generation for the step-ca PKI and Traefik. |
| `community.docker` | collection | Manages containers and Compose stacks (Gitea, act_runner). |
| `community.general` | collection | Broad utility modules used across the pipeline. |
| `kubernetes.core` | collection | `k8s` / `helm` modules used by every K3s-facing task. Needs the `kubernetes` Python lib at runtime. |
| `k3s-ansible` (`git+https://github.com/k3s-io/k3s-ansible.git`) | git role/collection | Upstream playbooks that install and cluster K3s itself. |
> [!TIP]
> The runtime Python libraries (`kubernetes`, `jmespath`, `dnspython`) that `kubernetes.core` and friends import are declared in the **repo-root `pyproject.toml`**, not in `requirements.yml`. `uv sync` installs them; `ansible-galaxy` installs the Galaxy/git content. Both steps are required.
---
## Invocation pattern
The control node runs Ansible from a `uv`-managed venv. The `localhost` inventory entry sets `ansible_python_interpreter: "{{ ansible_playbook_python }}"`, so `uv run` is enough to put Ansible on the venv's Python — no hardcoded interpreter path. Full recipe lives in [`ansible/README.md`](../../../../ansible/README.md).
1. **Sync the venv** — installs `ansible-core` plus the runtime Python deps:
```sh
uv sync
```
2. **Install collection dependencies** — pulls the Galaxy + git content from `requirements.yml`:
```sh
uv run ansible-galaxy collection install -r ansible/requirements.yml
```
3. **Run a stage** — point `-i` at the inventory directory and pass one numbered playbook:
```sh
uv run ansible-playbook \
-i ansible/arcodange/factory/inventory \
ansible/arcodange/factory/playbooks/<NN_name>.yml
```
### The vault password (`ANSIBLE_VAULT_PASSWORD_FILE`)
Encrypted vars are decrypted with a password that is **sourced from the cluster, not stored on disk**. `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable script that reads the K8s secret `arcodange-ansible-vault` from the `kube-system` namespace:
```sh
kubectl get secret -n kube-system arcodange-ansible-vault \
--template='{{index .data.pass | base64decode}}'
```
> [!IMPORTANT]
> The same `arcodange-ansible-vault` secret in `kube-system` is consumed by the Gitea CI runners (needed for the Gitea mailer). Create it once with `kubectl create secret generic arcodange-ansible-vault --from-literal="pass=<ansible_vault_password>" -n kube-system`. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how this fits the broader secret model.
---
## The provisioning pipeline
The numbered playbooks are meant to be run **in order** on a fresh cluster — each is a thin wrapper that `import_playbook`s a stage directory (e.g. `01_system.yml` → `system/system.yml`). The `recover/` playbooks are **not** part of the linear sequence; they are an on-demand branch used only during disaster recovery.
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart LR
classDef stage fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef recover fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
s01["01 · System<br/>Docker · K3s · Longhorn · DNS · SSL"]:::stage
s02["02 · Setup<br/>Gitea · Postgres · NFS backup"]:::stage
s03["03 · CI/CD<br/>act_runner registration"]:::stage
s04["04 · Tools<br/>CrowdSec · Vault"]:::stage
s05["05 · Backup<br/>cron reports · PVC/db dumps"]:::stage
rec["recover/*<br/>Longhorn + data restore"]:::recover
s01 --> s02 --> s03 --> s04 --> s05
s05 -. "on disaster" .-> rec
rec -. "rejoin pipeline" .-> s01
```
1. **`01 · System`** — base OS hardening on each Pi, then Docker, Longhorn disk prep + iSCSI, K3s install, CoreDNS, the step-ca cert issuer, and final K3s config (kubeconfig, Longhorn, Traefik).
2. **`02 · Setup`** — deploys the cluster-resident services: Gitea, PostgreSQL (on `pi2`), and the NFS backup target.
3. **`03 · CI/CD`** — fetches a Gitea runner-registration token and rolls out the `act_runner` Docker Compose stack on every non-Gitea Pi so CI jobs have executors.
4. **`04 · Tools`** — installs the operational tooling layer: CrowdSec (WAF/IPS) and HashiCorp Vault.
5. **`05 · Backup`** — schedules the cron-driven backup + email-report jobs and the Gitea / Postgres / K3s-PVC dump routines.
6. **`recover/*` (on demand)** — invoked only after data loss to rebuild Longhorn and replay volume data; once recovered, the cluster re-enters the normal pipeline at `01 · System`.
---
## Index
| # | Page | Covers | State |
| --- | --- | --- | --- |
| 01 | [System](01-system.md) | RPi hardening, Docker, K3s, Longhorn/iSCSI, CoreDNS, step-ca SSL | ✅ |
| 02 | [Setup](02-setup.md) | Gitea, PostgreSQL, NFS backup target | ✅ |
| 03 | [CI/CD](03-cicd.md) | Gitea `act_runner` registration & Compose deploy | ✅ |
| 04 | [Tools](04-tools.md) | CrowdSec, HashiCorp Vault | ✅ |
| 05 | [Backup](05-backup.md) | Cron report jobs, Gitea/Postgres/PVC dumps | ✅ |
| 06 | [Recover](06-recover.md) | Longhorn + data restore (on-demand DR branch) | 🟡 |
| — | [Inventory & variables](inventory.md) | `hosts.yml` groups, `group_vars/` layering, host→service mapping | ✅ |
| — | [Roles reference](roles.md) | The seven `arcodange.factory.*` roles | ✅ |
---
## Maintenance rule
> [!IMPORTANT]
> **Alter a playbook, role, inventory entry, or `group_vars` → update the matching page here in the same change.** Adding a stage, renaming a role, bumping the K3s version or a `requirements.yml` dependency, or moving a host between groups all change what the pages above describe — edit the page in the PR that changes the code, never as a follow-up. This is the [factory-provisioning maintenance rule](../README.md#maintenance-rule) applied to the Ansible half; the guidebooks' full [Rules to contribute](../../README.md#rules-to-contribute) also apply.

View File

@@ -0,0 +1,111 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Inventory & variables**
# Inventory & variables
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
> **Downstream:** [Roles reference](roles.md)
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [PRD · isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md)
The inventory is the single source of truth for **which machines exist** and **which service each machine runs**. It is a directory inventory — [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml) plus a layered [`group_vars/`](../../../../ansible/arcodange/factory/inventory/group_vars) tree — passed to every playbook with `-i ansible/arcodange/factory/inventory`.
> [!IMPORTANT]
> This inventory describes **live production**. The three IPs `192.168.1.201-203` are the real Pis that run the public CMS, the Dolibarr ERP, and business email. A playbook pointed at this inventory mutates prod. The safe-environment work treats this file as the prod blast-radius and requires a **separate sandbox inventory + a prod-IP guard** before any sandbox apply — see the [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md) and the first row of the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md).
---
## Hosts
Defined in [`inventory/hosts.yml`](../../../../ansible/arcodange/factory/inventory/hosts.yml). Three physical Pis are each reachable two ways — over the LAN (the canonical path) and through an internet port-forward managed at the firewall — plus the control node as `localhost`.
| Host | `ansible_host` | `preferred_ip` | Port | Reach |
| --- | --- | --- | --- | --- |
| `pi1` | `pi1.home` | `192.168.1.201` | 22 | LAN |
| `pi2` | `pi2.home` | `192.168.1.202` | 22 | LAN |
| `pi3` | `pi3.home` | `192.168.1.203` | 22 | LAN |
| `internetPi1` | `rg-evry.changeip.co` | — | `51022` | WAN port-forward → `pi1` |
| `internetPi2` | `rg-evry.changeip.co` | — | `52022` | WAN port-forward → `pi2` |
| `internetPi3` | `rg-evry.changeip.co` | — | `53022` | WAN port-forward → `pi3` |
| `localhost` | (local connection) | — | — | control node |
> [!NOTE]
> The `internetPiN` entries share one DNS name (`rg-evry.changeip.co`) and differ only by SSH port (`5N022`). The hosts file documents the choice of `changeip.co` over `arcodange.duckdns.org`: changeip is **managed directly with the firewall** rather than depending on a DuckDNS registry update, so the forward is stable. `preferred_ip` is a custom hostvar (not a connection variable) — roles read it to build DNS records, the Gitea SSH domain, and the Pi-hole local-DNS table.
---
## Groups
Groups map machines to roles. The membership is small and deliberate; read the table as "this service runs on these hosts".
| Group | Members | Defined as | What it is for |
| --- | --- | --- | --- |
| `raspberries` | `pi1`, `pi2`, `pi3` + `internetPi1-3` | explicit hosts | Every Pi, LAN and WAN handles. Carries the shared `ansible_user: pi`. |
| `local` | `localhost`, `pi1`, `pi2`, `pi3` | explicit hosts | The control-node-facing group; `localhost` runs `kubectl`/`tofu`/`docker` tasks that talk to the cluster. |
| `postgres` | `pi2` | explicit host | The single PostgreSQL node. `pi2` is the database host. |
| `gitea` | `pi2` (via `children: postgres`) | child of `postgres` | Gitea co-locates with its database, so the group simply inherits `postgres`. `groups.gitea[0]` resolves to `pi2` everywhere. |
| `pihole` | `pi1`, `pi3` | explicit hosts | The HA DNS pair (Pi-hole + Gravity Sync). |
| `step_ca` | `pi1`, `pi2`, `pi3` | explicit hosts | Every Pi runs a step-ca node (primary `pi1`, standbys `pi2`/`pi3`). |
| `all` | everything (`children: raspberries`) | implicit + child | Ansible's universal group; `group_vars/all/` applies to all hosts. |
> [!TIP]
> Because `gitea` is a **child of `postgres`** and `postgres` has exactly one host, every reference to `groups.gitea[0]` (the Gitea container, the API base URL `http://{{ groups.gitea[0] }}:3000`, the SSH domain) points at `pi2`. Move Postgres and Gitea follows automatically.
---
## Connection variables
| Variable | Where set | Value / effect |
| --- | --- | --- |
| `ansible_user` | `raspberries.vars` | `pi` — the SSH login on every Pi. |
| `ansible_ssh_extra_args` | per-host (`pi1`/`pi2`/`pi3`) | `-o StrictHostKeyChecking=no` — Pis get reimaged, so host-key churn is expected; the check is disabled rather than forcing `known_hosts` edits. |
| `ansible_port` | `internetPiN` | `51022` / `52022` / `53022` — the firewall's per-Pi SSH forwards. |
| `ansible_connection` | `localhost` | `local` — run on the control node, no SSH. |
| `ansible_python_interpreter` | `localhost` | `"{{ ansible_playbook_python }}"` — uses the `uv`-managed venv's Python, no hardcoded path. |
The control-node tooling chain (`scp_if_ssh = True`) is set in [`ansible.cfg`](../../../../ansible/arcodange/factory/ansible.cfg); the `collections_path` lives there too.
---
## `group_vars/` layering
Variables are split by group so each service owns its own file. The path `group_vars/<group>/<file>.yml` is auto-loaded for every host in `<group>`.
| File | Scope | Declares |
| --- | --- | --- |
| [`all/common.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/common.yml) | all hosts | `user_home` — the control user's `$HOME`, looked up from the environment. |
| [`all/ssh.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/ssh.yml) | all hosts | SSH-public-key discovery: `first_found` over `id_ed25519_arcodange.pub``id_ed25519.pub``id_rsa.pub`, then splits the file into `ssh_public_key`, `ssh_key_title`, `ssh_key_algorithm`. Roles push this key to authorized hosts. |
| [`all/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/all/gitea.yml) | all hosts | `gitea_secret_propagation_users: [arcodange]` — user namespaces that must also receive org-level Gitea Action secrets (see the [`gitea_secret`](roles.md) role). |
| [`gitea/gitea.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea.yml) | `gitea` | `gitea_version: 1.25.5`, the `gitea_database` triple, and the full Gitea Docker Compose: Postgres backend (`postgres:5432`), the `smtps`/orange.fr mailer, SSH on `2222:22`, `ROOT_URL https://gitea.arcodange.lab/`, registration disabled. SSH domain is built from `hostvars[groups.gitea[0]].preferred_ip`. |
| [`gitea/gitea_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/gitea/gitea_vault.yml) | `gitea` | **VAULTED.** The `gitea_vault.*` map — `GITEA__mailer__PASSWD` (consumed by the compose above) plus the `github_api_token` / `gitlab_api_token` read by the mirror roles. |
| [`postgres/postgres.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/postgres/postgres.yml) | `postgres` | The Postgres Docker Compose — `postgres:16.3-alpine`, `5432:5432`, data under `/home/pi/arcodange/docker_composes/postgres/data` — plus the `pgbouncer` auth-user block. |
| [`step_ca/step_ca.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca.yml) | `step_ca` | `step_ca_primary: pi1`, `step_ca_fqdn: ssl-ca.arcodange.lab`, the `step` user/home/dir, and `step_ca_listen_address: ":8443"`. |
| [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml) | `step_ca` | **VAULTED.** `vault_step_ca_password` (the CA root password) and `vault_step_ca_jwk_password` (the cert-manager JWK provisioner password). |
> [!NOTE]
> Encrypted files are conventionally suffixed `_vault.yml`. They are normal `group_vars` files whose **contents** are `ansible-vault`-encrypted; non-vault siblings hold the plaintext structure that references the vaulted keys (e.g. `gitea/gitea.yml` interpolates `gitea_vault.GITEA__mailer__PASSWD`).
---
## The vault model
Two distinct mechanisms share the word "vault" here — keep them apart:
1. **`ansible-vault`** encrypts the `*_vault.yml` files at rest in git (AES256). Decryption happens transparently at playbook runtime.
2. **The vault password itself is never on disk.** `ANSIBLE_VAULT_PASSWORD_FILE` points at a tiny executable that fetches the password from the K8s secret `arcodange-ansible-vault` in the `kube-system` namespace:
```sh
kubectl get secret -n kube-system arcodange-ansible-vault \
--template='{{index .data.pass | base64decode}}'
```
So decrypting any `*_vault.yml` requires `kubectl` access to the live cluster — the cluster *is* the key custodian. The setup recipe (and the `kubectl create secret` to seed it) lives in [`ansible/README.md`](../../../../ansible/README.md); how this fits the broader secret hierarchy is in [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md).
> [!CAUTION]
> This is **not** HashiCorp Vault. HashiCorp Vault (`vault.arcodange.lab`) is a separate, cluster-resident service installed by the [`hashicorp_vault`](roles.md) role in the `04 · Tools` stage. The `arcodange-ansible-vault` K8s secret only holds the `ansible-vault` password and is also read by the Gitea CI runners for the mailer.
---
## Why this page matters for safe-prod
The variables above bind Ansible directly to live infrastructure: the host IPs, the prod Vault address, the prod Postgres superuser, and the prod Gitea forge. The safe-environment design maps each of these to a sandbox control — a parallel `inventory/sandbox/hosts.yml` with VM/cloud hosts, a pre-task guard that aborts on any `192.168.1.201-203` target unless `i_mean_prod=true`, and per-service overrides — detailed in the [PRD isolation boundary](../../../PRD/safe-prod-like-environment/isolation-boundary.md). Until that lands, **assume every run is a prod run**.

View File

@@ -0,0 +1,186 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [Ansible](README.md) > **Roles reference**
# Roles reference
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Ansible sub-hub](README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
> **Downstream:** [Inventory & variables](inventory.md)
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
Roles live in two places, by reuse scope:
- **Shared roles** — reusable across stages — live in [`ansible/arcodange/factory/roles/`](../../../../ansible/arcodange/factory/roles) and are referenced by FQCN `arcodange.factory.<role>`.
- **Nested roles** — owned by one playbook stage — live under [`playbooks/<stage>/roles/`](../../../../ansible/arcodange/factory/playbooks) and are auto-discovered by that stage's playbook.
This page is split by **altitude**. Tier 1 covers the heavyweight platform-service roles (one subsection each); Tier 2 is a single table of the smaller building-block roles.
---
## Tier 1 — platform-service roles
### `hashicorp_vault`
[`playbooks/tools/roles/hashicorp_vault`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault) · runs on `localhost` in the `04 · Tools` stage. It initializes and unseals the cluster Vault and wires Gitea as an OIDC provider so CI jobs can authenticate to Vault.
The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/main.yml) flow is:
1. **Init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/init.yml)) — first run only. Lists the Vault server pods in the `tools` namespace, checks `vault operator init -status`, and if uninitialized runs `vault operator init` with **`key-shares=1`, `key-threshold=1`** (defaults from [`defaults/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/defaults/main.yml)). The JSON output — unseal keys + initial root token — is written to `~/.arcodange/cluster-keys.json` (dir `0700`, file `0600`).
2. **Unseal** ([`unseal.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/unseal.yml)) — required after every reboot. Reads the keys file and runs `vault operator unseal` for each server, then revokes the *initial* root token (idempotent — tolerates an already-revoked token).
3. **Generate a fresh root token** ([`new_root_token.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/new_root_token.yml)) — runs the `generate-root` OTP/nonce dance using the unseal keys to mint a short-lived `vault_root_token`.
4. **Set up Gitea OIDC** ([`gitea_oidc_auth.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/tasks/gitea_oidc_auth.yml)) — drives Gitea through the bundled [`playwright_setupGiteaApp.js`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/playwright_setupGiteaApp.js) (via the [`playwright`](#tier-2--building-block-roles) role) to create an OAuth2 app, then applies the bundled OpenTofu [`hashicorp_vault.tf`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) inside a disposable `ghcr.io/opentofu/opentofu` container (state on a throwaway docker volume) to provision the Vault JWT/OIDC backend. Finally it renders [`oidc_jwt_token.sh.j2`](../../../../ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/templates/oidc_jwt_token.sh.j2) into the Gitea Actions secret **`vault_oauth__sh_b64`** (base64) at **org** scope, then propagates the same secret to each user in `gitea_secret_propagation_users` (Action secrets are per-owner, so user-owned repos can't read org secrets).
5. **Revoke the temp root token** — the `always` block of `main.yml` revokes `vault_root_token` no matter how step 4 ended, so no long-lived root token survives the run.
| Var | Default | Meaning |
| --- | --- | --- |
| `vault_unseal_keys_path` | `~/.arcodange/cluster-keys.json` | Where unseal keys + root token are stored. |
| `vault_unseal_keys_shares` / `_key_threshold` | `1` / `1` | Single-key seal (lab posture; `threshold <= shares`). |
| `vault_address` | `https://vault.arcodange.lab` | The cluster Vault endpoint. |
| `gitea_admin_user` / `gitea_admin_password` | `arcodange@gmail.com` / (prompted) | Credentials Playwright uses to create the OAuth app. |
| `vault_oidc_force_reset` | `false` | When `true`, `vault auth disable gitea` + `gitea_jwt` before re-applying. |
> [!CAUTION]
> `vault_oidc_force_reset=true` is **destructive**: it disables and wipes **all** `gitea_cicd_*` per-app JWT roles created by the bundled tofu, every run. Default is off. Likewise, losing `~/.arcodange/cluster-keys.json` means the Vault can never be unsealed again — that file is the single point of failure for the whole secret plane (see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)).
### `step_ca`
[`playbooks/ssl/roles/step_ca`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca) · runs on the `step_ca` group (all three Pis) in the `01 · System` stage via [`ssl/step-ca.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/step-ca.yml). It is the lab's internal ACME/CA for `*.arcodange.lab` certificates, run **active/standby**: primary `pi1`, replicas `pi2`/`pi3`. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/main.yml) imports five task files in order:
1. **install** — install the `step` / `step-ca` binaries.
2. **init** ([`init.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/init.yml)) — primary only. `step ca init` (non-interactive, password file) with `creates:` guard so it is idempotent. The CA name is `Arcodange Lab CA`, DNS `ssl-ca.arcodange.lab`, listen `:8443`.
3. **sync** ([`sync.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/sync.yml)) — replicates the CA from primary to standbys. It takes a **lockfile** on the primary (`.sync.lock`), computes a deterministic `tar | sha256sum` **checksum** of `~/.step`, compares it to the last checksum cached on the controller, and only `rsync`s (pull → controller → push to standbys) when the checksum changed. This is how the standbys hold an identical CA without a shared filesystem.
4. **systemd** — install/enable the `step-ca` unit (the `restart step-ca` handler fires on cert/config change).
5. **provisioners** ([`provisioners.yml`](../../../../ansible/arcodange/factory/playbooks/ssl/roles/step_ca/tasks/provisioners.yml)) — primary only. Ensures a **JWK provisioner named `cert-manager`** exists: lists provisioners, generates the JWK keypair (`creates:` guard) under `~/.step/provisioners/`, and `step ca provisioner add`s it. This is what lets in-cluster cert-manager request certs from the CA.
| Var | Default | Meaning |
| --- | --- | --- |
| `step_ca_primary` | `pi1` | The writable CA node; standbys sync from it. |
| `step_ca_fqdn` | `ssl-ca.arcodange.lab` | CA DNS name; URL is `https://{fqdn}:8443`. |
| `step_ca_provisioner_name` / `_type` | `cert-manager` / `JWK` | The cert-manager provisioner. |
| `step_ca_force_reinit` | `false` | When `true`, stops the service and **wipes `~/.step`** before re-init. |
| Secret | Source |
| --- | --- |
| `vault_step_ca_password` | CA root password — from vaulted [`step_ca/step_ca_vault.yml`](../../../../ansible/arcodange/factory/inventory/group_vars/step_ca/step_ca_vault.yml). |
| `vault_step_ca_jwk_password` | cert-manager JWK provisioner password — same vaulted file. |
> [!CAUTION]
> `step_ca_force_reinit=true` **wipes the entire CA** (`~/.step`) on the primary and re-issues a new root — every previously issued `*.arcodange.lab` cert immediately becomes untrusted until clients reload the new root. Use only for a deliberate PKI rebuild.
### `crowdsec`
[`playbooks/tools/roles/crowdsec`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec) · runs on `localhost` in the `04 · Tools` stage. It wires CrowdSec's decisions into Traefik as a bouncer middleware with a Turnstile CAPTCHA. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/tools/roles/crowdsec/tasks/main.yml) flow:
1. **Vault → K8s secret plumbing** — creates a `ServiceAccount` (`factory-ansible-tool-crowdsec-traefik-plugin`), a `VaultAuth` (kubernetes auth, role `factory_crowdsec_conf`), and a `VaultStaticSecret` that reads **`kvv2/cms/factory/turnstile`** into a K8s secret (`refreshAfter: 30s`). The Turnstile sitekey/secret come from there.
2. **Bouncer key** — finds the CrowdSec LAPI pod in `tools` and runs `cscli bouncers add traefik-plugin` (deletes + re-adds on conflict) to obtain the bouncer API key.
3. **CAPTCHA HTML**`inject_captcha_html.yml` pushes `captcha.html` into the Traefik PVC; this task is **tagged `never`** (opt-in only) so the default run skips it.
4. **Traefik Middleware** — applies a `traefik.io/v1alpha1` `Middleware` named **`crowdsec-bouncer`** (`crowdsec` in `kube-system`) configured with the bouncer key, stream mode, Turnstile (`captchaProvider: turnstile` + site/secret keys), and a **Redis cache at `redis.tools:6379`**.
5. **Restart Traefik** — scales the Traefik Deployment to 0 then back to 1 (with a `rescue`/`always` guard guaranteeing it scales back up) to load the new middleware.
| Var | Default | Meaning |
| --- | --- | --- |
| `traefik_pvc_name` | `traefik` | The PVC the (tagged-`never`) captcha.html inject targets. |
| Secret | Source |
| --- | --- |
| Turnstile sitekey + secret | Vault `kvv2/cms/factory/turnstile`, surfaced via `VaultStaticSecret`. |
| Bouncer API key | Minted at runtime by `cscli bouncers add`. |
### `pihole`
[`playbooks/dns/roles/pihole`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole) · runs on the `pihole` group (`pi1`, `pi3`) in the `01 · System` stage. It configures **HA DNS**: two Pi-hole nodes kept in sync. The [`tasks/main.yml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/main.yml) includes three task files:
1. **`ha_pihole_setup.yml`** — **waits for a manual Pi-hole install** (it prints the `curl … | sudo bash` command and `wait_for`s `/etc/pihole/pihole-FTL.db` for up to 10 minutes; Pi-hole itself is not installed by Ansible). It then patches [`pihole.toml`](../../../../ansible/arcodange/factory/playbooks/dns/roles/pihole/tasks/ha_pihole_setup.yml) (listen port, `listeningMode = "ALL"`, enable `/etc/dnsmasq.d`) and writes three dnsmasq drop-ins: `10-custom-rules.conf` (wildcard `address=/fqdn/ip` from `pihole_custom_dns`), `20-rpis.conf` (`<host>.home``preferred_ip` for every Pi), and `99-upstream.conf` (explicit upstream from `pihole_upstream_dns`).
2. **`gravity_setup.yml`** — sets up **Gravity Sync** between the two nodes: a `pihole_gravity` system user with a freshly **rotated ed25519 keypair** each run, cross-authorized `authorized_keys`, full **sudo** (`/etc/sudoers.d/gravity-sync`), the installer, and a generated `gravity-sync.conf` (each node points `REMOTE_HOST` at the other), then runs the sync.
3. **`client_setup.yml`** — points DNS clients at the Pi-hole pair by editing `/etc/resolv.conf` (insert nameservers after `search`) and the active NetworkManager connections via `nmcli` (per-interface `ipv4.dns` + `dns-priority`, eth0 50 / wlan0 100).
| Var | Default | Meaning |
| --- | --- | --- |
| `pihole_primary` | `pi1` | First node; the other is derived as the secondary. |
| `pihole_ports` | `8081o,443os,…` | Web-interface listen ports. |
| `pihole_custom_dns` | `{}` | FQDN→IP wildcard records (validated as IPv4). |
| `pihole_upstream_dns` | `[8.8.8.8, 1.1.1.1, 8.8.4.4]` | Explicit upstreams (avoids DHCP-provided DNS). |
> [!WARNING]
> This role is **not fully idempotent**: it depends on a human running the Pi-hole installer first, it **rotates the gravity SSH key on every run**, and it grants the `pihole_gravity` user passwordless **sudo ALL**. Treat reruns as state-changing, not no-ops.
### `deploy_docker_compose`
[`roles/deploy_docker_compose`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose) · shared. This is the **generic compose mechanism** every app deploy builds on. The caller passes a `dockercompose_content` dict; the [`tasks/main.yml`](../../../../ansible/arcodange/factory/roles/deploy_docker_compose/tasks/main.yml):
1. Derives `app_name` from `dockercompose_content.name` and creates `/<root_path>/<partition>/<app_name>/` plus `data/` and `scripts/`.
2. Writes the compose file with `to_nice_yaml` and **validates it** with `validate: 'docker compose -f %s config'` — a bad compose fails the task before anything is written live.
3. Writes a small wrapper script `scripts/docker-compose` that runs `docker compose -f <the file> "$@"`, so the app can be driven without remembering the path.
| Var | Default | Meaning |
| --- | --- | --- |
| `app_name` | `(dockercompose_content.name)` | App directory name. |
| `app_owner` / `app_group` | `pi` / `docker` | File ownership. |
| `root_path` | `/home/pi/arcodange` | Base path; `partition` (`docker_composes`) nests under it. |
---
## Tier 2 — building-block roles
Smaller roles, mostly Gitea/forge plumbing and one-shot helpers. Shared roles live in [`roles/`](../../../../ansible/arcodange/factory/roles); `deploy_gitea`/`deploy_postgresql` are nested under [`playbooks/setup/roles/`](../../../../ansible/arcodange/factory/playbooks/setup/roles).
| Role | Purpose | Key vars / notes | Secrets |
| --- | --- | --- | --- |
| [`gitea_repo`](../../../../ansible/arcodange/factory/roles/gitea_repo) | Ensure a repo exists across Gitea + GitHub + GitLab and add **8h push mirrors** (`sync_on_commit: true`) to GitHub/GitLab. | Creates missing repos on each forge; mirror URLs + namespace IDs in [`vars/main.yml`](../../../../ansible/arcodange/factory/roles/gitea_repo/vars/main.yml). | `github_api_token`, `gitlab_api_token` (from `gitea_vault`). |
| [`gitea_token`](../../../../ansible/arcodange/factory/roles/gitea_token) | Generate / replace / delete a Gitea access token via `docker exec … gitea admin user generate-access-token`. | Stores the raw token in the fact named by `gitea_token_fact_name`; `gitea_token_replace` / `gitea_token_delete` toggles; scopes default to `write:admin,organization,package,repository,user`. | The minted token itself (a fact, not persisted). |
| [`gitea_secret`](../../../../ansible/arcodange/factory/roles/gitea_secret) | `PUT` a Gitea **Actions secret** at user or org scope. | `gitea_secret_name` / `_value`; `gitea_owner_type` (`user`\|`org`) selects the API path. | `gitea_api_token` (Authorization). |
| [`gitea_sync`](../../../../ansible/arcodange/factory/roles/gitea_sync) | List repos on all **three forges**, diff them, and call `gitea_repo` for the repos missing somewhere. | Computes `repos_incomplete = all common`; loops `gitea_repo` over the gaps. | GitHub/GitLab/Gitea API tokens. |
| [`traefik_certs`](../../../../ansible/arcodange/factory/roles/traefik_certs) | Extract the live **`*.arcodange.lab`** cert from Traefik's `acme.json`. | `kubectl exec` into Traefik → `jq` the LetsEncrypt wildcard cert → `traefik_cert_pem` fact; no-op if already set. | — (reads in-cluster acme.json). |
| [`playwright`](../../../../ansible/arcodange/factory/roles/playwright) | Run a Playwright browser-automation script in Docker. | Builds `playwright:<version>` (default `1.47.0`) from `files/`, runs the script with `playwright_env` injected as `-e`; default script `loginGitea.js`. Used by `hashicorp_vault` for the OIDC app setup. | Script-specific env (e.g. Gitea admin creds). |
| [`deploy_gitea`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea) | Deploy Gitea: template [`app.ini.j2`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_gitea/tasks/main.yml), `docker compose up`, then **health-check `:3000`** until ready. | Compose source is `/home/pi/arcodange/docker_composes/gitea`; admin user `arcodange`. | (consumes the vaulted Gitea compose env). |
| [`deploy_postgresql`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql) | Deploy Postgres via compose, then per-app **create DB + user** ([`create_db_and_user.yml`](../../../../ansible/arcodange/factory/playbooks/setup/roles/deploy_postgresql/tasks/create_db_and_user.yml)). | Waits on `pg_isready`, loops `applications_databases` (`{app: {db_name, db_user, db_password}}`). | Per-app DB passwords from `applications_databases`. |
---
## Role dependency view
How the roles relate: shared building blocks feed the `setup`-stage app deploys, and a few platform-service roles include shared roles directly.
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef shared fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef setup fill:#1e4620,stroke:#22c55e,color:#f9fafb;
classDef platform fill:#4a2c1e,stroke:#f59e0b,color:#f9fafb;
dc["deploy_docker_compose<br/>generic compose writer"]:::shared
pw["playwright<br/>browser automation"]:::shared
gt["gitea_token<br/>mint access token"]:::shared
gs["gitea_secret<br/>PUT Actions secret"]:::shared
gr["gitea_repo<br/>mirror to GitHub/GitLab"]:::shared
gsync["gitea_sync<br/>diff 3 forges"]:::shared
tc["traefik_certs<br/>extract lab cert"]:::shared
dpg["deploy_postgresql"]:::setup
dgi["deploy_gitea"]:::setup
hv["hashicorp_vault"]:::platform
sca["step_ca"]:::platform
cs["crowdsec"]:::platform
ph["pihole"]:::platform
gsync --> gr
hv --> pw
hv --> gs
dc -. "used by app deploys" .-> dpg
dc -. "used by app deploys" .-> dgi
```
1. **`gitea_sync``gitea_repo`** — the sync role include-loops `gitea_repo` for each repo missing from one of the three forges.
2. **`hashicorp_vault``playwright`** — Vault's OIDC setup drives Gitea through Playwright to create the OAuth app.
3. **`hashicorp_vault``gitea_secret`** — the rendered `vault_oauth__sh_b64` is published as a Gitea Actions secret at org and user scope.
4. **`deploy_docker_compose``deploy_postgresql` / `deploy_gitea`** — the generic compose writer is the substrate the `setup`-stage app deploys lean on.
5. **`step_ca`, `crowdsec`, `pihole`** stand alone — they configure their own services (PKI, WAF, DNS) without including other roles.
---
## See also
- [Inventory & variables](inventory.md) — the groups (`gitea`, `postgres`, `step_ca`, `pihole`) these roles target, and the vaulted `group_vars` they read.
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — where `hashicorp_vault`'s OIDC tokens and the `kvv2/cms/factory/turnstile` path fit the broader secret model.
- [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) — how the compose `data/` dirs and the step-ca state relate to backup and disaster recovery.

View File

@@ -0,0 +1,102 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > **OpenTofu**
# OpenTofu — factory provisioning
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
> **Downstream:** [factory iac](factory-iac.md) · [postgres iac](postgres-iac.md) · [CI apply flow](ci-apply-flow.md)
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
OpenTofu is the **declarative half** of the factory: it provisions everything that lives *outside* the K3s cluster — Gitea repos & CI users, Vault policies, Cloudflare DNS, OVH domains, a GCS backup bucket, and the in-cluster PostgreSQL roles/databases. The imperative half (the cluster itself) is built by [Ansible](../ansible/README.md).
OpenTofu is pinned to **`1.8.2`** in CI (`OPENTOFU_VERSION`).
---
## Two independent state roots
There are **two separate Terraform/OpenTofu roots**, each with its own `backend.tf`, its own GCS state prefix, its own provider set, and its own CI workflow. They never share state and can be applied independently.
| Root | Code path | State backend (GCS) | Triggered by |
| --- | --- | --- | --- |
| **factory iac** | [`iac/`](../../../../iac) | `gs://arcodange-tf/factory/main` | changes under `iac/**` → [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) |
| **postgres iac** | [`postgres/iac/`](../../../../postgres/iac) | `gs://arcodange-tf/factory/postgres` | changes under `postgres/**` → [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) |
> [!NOTE]
> Both roots share the same GCS **bucket** (`arcodange-tf`) but live under **distinct prefixes** (`factory/main` vs `factory/postgres`), so their state objects never collide.
---
## Providers
| Provider | Version | Endpoint / scope | Auth |
| --- | --- | --- | --- |
| `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var |
| `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` |
| `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` (factory) / `GOOGLE_BACKEND_CREDENTIALS` (postgres backend) |
| `cloudflare/cloudflare` | `~> 5` | DNS / IAM | `CLOUDFLARE_API_TOKEN` env var |
| `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` |
| `cyrilgdn/postgresql` | `1.24.0` | `192.168.1.202` (pi2), `superuser` | `POSTGRES_USERNAME` / `POSTGRES_PASSWORD` (TF vars) |
The first five providers belong to the **factory iac** root ([`iac/providers.tf`](../../../../iac/providers.tf)); the **postgres iac** root ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf)) declares only `postgresql` + `vault`. Both roots configure the `vault` provider identically (JWT, mount `gitea_jwt`, role `gitea_cicd`).
---
## The Vault-JWT auth model
Neither root carries long-lived Vault credentials. Instead CI mints a short-lived Gitea OIDC token and exchanges it for Vault access:
1. A first job decodes the base64 secret **`vault_oauth__sh_b64`** and runs it (`base64 -d | bash`), producing a **Gitea OIDC JWT** as a job output (`gitea_vault_jwt`).
2. That JWT is exported into the apply job as **`TERRAFORM_VAULT_AUTH_JWT`**.
3. The `vault` provider's `auth_login_jwt` block consumes it against mount `gitea_jwt` / role `gitea_cicd`, yielding a scoped Vault token used to read the per-provider secrets (Google creds, Gitea token, Cloudflare token, OVH app keys, Postgres creds).
See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the full Vault policy/mount design and [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough.
---
## CI apply flow
Both workflows share the same two-job shape: authenticate, then apply. The trigger paths differ (`iac/**` vs `postgres/**`) but the structure is identical.
```mermaid
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#1f2937','primaryTextColor':'#f9fafb','lineColor':'#6b7280','fontSize':'14px'}}}%%
flowchart TD
classDef trigger fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb;
classDef job fill:#1e4620,stroke:#22c55e,color:#f0fdf4;
classDef danger fill:#5f1e1e,stroke:#ef4444,color:#fef2f2;
push["push / PR touching<br/> iac/** or postgres/**"]:::trigger
auth["job: gitea_vault_auth<br/>decode vault_oauth__sh_b64<br/> mint Gitea OIDC JWT"]:::job
tofu["job: tofu<br/>read Vault secrets via JWT<br/> set provider env vars"]:::job
apply["dflook/terraform-apply@v1<br/> auto_approve: true"]:::danger
push --> auth
auth -- "gitea_vault_jwt output" --> tofu
tofu --> apply
```
1. A **push or PR** that touches files under `iac/**` (factory) or `postgres/**` (postgres) starts the matching workflow; `workflow_dispatch` allows a manual run.
2. The **`gitea_vault_auth`** job decodes `vault_oauth__sh_b64` and emits the Gitea OIDC JWT as `gitea_vault_jwt`.
3. The **`tofu`** job (`needs: gitea_vault_auth`) sets `TERRAFORM_VAULT_AUTH_JWT` from that output, reads the provider secrets out of Vault, and prepares the homelab CA cert (`VAULT_CACERT`).
4. The job runs **`dflook/terraform-apply@v1`** against the root's `path` (`iac` or `postgres/iac`) with **`auto_approve: true`**.
> [!CAUTION]
> **Applies are auto-approve.** There is no manual plan-review gate — once a change to `iac/**` or `postgres/**` lands on `main`, CI applies it to the real Gitea, Vault, Cloudflare, OVH, GCS, and PostgreSQL targets without further confirmation. Treat every merge as a production change and review the diff *before* merging, not after. This trade-off is recorded in [ADR-0001 · safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md).
---
## Index
| Page | Covers | State |
| --- | --- | --- |
| [factory iac](factory-iac.md) | `iac/` root — Gitea, Vault, Google/GCS backup, Cloudflare, OVH | ✅ |
| [postgres iac](postgres-iac.md) | `postgres/iac/` root — PostgreSQL roles & databases on pi2 | ✅ |
| [CI apply flow](ci-apply-flow.md) | Both Gitea workflows, the Vault-JWT exchange, auto-approve apply | ✅ |
---
## Maintenance rule
> [!IMPORTANT]
> **Alter a `.tf` resource, a provider version, a state backend, or a CI workflow → update the matching page here in the same change.** Adding a resource to `iac/`, changing the `postgres/iac/` application list, bumping a provider pin, or editing `iac.yaml`/`postgres.yaml` all change what the pages above describe — edit the page in the PR that changes the code, never as a follow-up. This is the [factory-provisioning maintenance rule](../README.md#maintenance-rule) applied to the OpenTofu half; the guidebooks' full [Rules to contribute](../../README.md#rules-to-contribute) also apply.

View File

@@ -0,0 +1,114 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **CI apply flow**
# CI apply flow
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Upstream:** [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml), [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml)
> **Downstream:** [factory iac](factory-iac.md), [postgres iac](postgres-iac.md)
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) · [QA strategy](../../../PRD/safe-prod-like-environment/qa-strategy.md)
Two Gitea Actions workflows turn every commit that touches the OpenTofu code into a live `apply`. `IAC` ([`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml)) drives the factory infrastructure under [`iac/`](../../../../iac/); `Postgres` ([`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml)) drives the database stack under [`postgres/iac/`](../../../../postgres/). They share the same two-job shape: a short OIDC-auth job feeds a Vault JWT to a `tofu` job that reads secrets and runs `terraform apply`.
> [!CAUTION]
> **`auto_approve: true` means every merge to `main` applies immediately — there is no plan-gate.** The `dflook/terraform-apply@v1` step skips the interactive approval, so any change that lands on `main` (or any matched `push`) rewrites real cloud and homelab state without a human reviewing the plan. Mitigations are entirely upstream of CI: (1) **mandatory code review** on the PR before merge, and (2) **least-privilege Vault policies** on the `gitea_cicd` role so a runaway apply can only touch the resources its token is scoped to. See [ADR-0001](../../../ADR/0001-safe-prod-like-environment.md): the sandbox lane runs the *same* tofu but **plan-only** against a `sandbox/` state prefix and a throwaway DNS zone, so contributors can validate changes without an auto-apply.
## Triggers
Both workflows fire on the same three events; only the watched path globs differ.
| Event | `IAC` (factory) | `Postgres` |
| --- | --- | --- |
| `push` | `iac/*.tf`, `iac/*.tfvars`, `iac/**/*.tf`, `iac/**/*.tfvars` | `postgres/**/*.tf`, `postgres/**/*.tfvars` |
| `pull_request` | same globs (YAML anchor `*tofuPaths`) | same globs (YAML anchor `*postgresTofuPaths`) |
| `workflow_dispatch` | manual, no inputs | manual, no inputs |
> [!IMPORTANT]
> `concurrency` is keyed on `${{ github.ref }}-${{ github.workflow }}` with `cancel-in-progress: true`, so a newer push to the same branch cancels an in-flight run. A `pull_request` event triggers the workflow — but the `apply` still runs, so the safety contract is "review **before** merge", not "CI only plans on PRs".
## Job 1 — `gitea_vault_auth`
Mints a Gitea OIDC token that Vault will trust. The whole job is one step:
```bash
echo -n "${{ secrets.vault_oauth__sh_b64 }}" | base64 -d | bash
```
| Field | Value |
| --- | --- |
| Runner | `ubuntu-latest` |
| Secret consumed | `vault_oauth__sh_b64` — a base64-encoded shell script |
| Step id | `gitea_vault_jwt` |
| Output | `gitea_vault_jwt``steps.gitea_vault_jwt.outputs.id_token` |
The decoded script asks Gitea for an OIDC `id_token` and emits it as a step output. The `tofu` job declares `needs: [gitea_vault_auth]` so it receives `needs.gitea_vault_auth.outputs.gitea_vault_jwt`.
## Job 2 — `tofu`
| Field | `IAC` | `Postgres` |
| --- | --- | --- |
| Job name | `Tofu` | `Tofu - Postgres` |
| `needs` | `gitea_vault_auth` | `gitea_vault_auth` |
| `OPENTOFU_VERSION` | `1.8.2` | `1.8.2` |
| `TERRAFORM_VAULT_AUTH_JWT` | `needs.gitea_vault_auth.outputs.gitea_vault_jwt` | same |
| `VAULT_CACERT` | `${{ github.workspace }}/homelab.pem` | same |
| Apply path | `iac` | `postgres/iac` |
Step order inside the job:
1. **read vault secret** — the shared `*vault_step` anchor (see below).
2. **`actions/checkout@v4`** — pull the repo into the workspace.
3. **prepare vault self signed cert**`echo -n "${{ secrets.HOMELAB_CA_CERT }}" | base64 -d > $VAULT_CACERT`, writing the homelab CA to `homelab.pem` so the runner trusts `https://vault.arcodange.lab`.
4. **terraform apply**`dflook/terraform-apply@v1` with the path above and `auto_approve: true`.
### Vault secret reads (`*vault_step`)
The `read vault secret` step uses [`arcodange-org/vault-action`](https://gitea.arcodange.lab/arcodange-org/vault-action), authenticating with `method: jwt`, `path: gitea_jwt`, `role: gitea_cicd`, `url: https://vault.arcodange.lab`, `caCertificate: ${{ secrets.HOMELAB_CA_CERT }}`, and `jwtGiteaOIDC` set to the auth job's output. The secrets it exports into the job env differ per workflow:
| Workflow | Vault path | Selector | Exported as |
| --- | --- | --- | --- |
| `IAC` | `kvv1/google/credentials` | `credentials` | `GOOGLE_CREDENTIALS` |
| `IAC` | `kvv1/admin/gitea` | `token` | `GITEA_TOKEN` |
| `IAC` | `kvv1/admin/cloudflare` | `iam_token` | `CLOUDFLARE_API_TOKEN` |
| `IAC` | `kvv1/admin/ovh/app` | `*` (all keys) | `OVH_*` |
| `Postgres` | `kvv1/google/credentials` | `credentials` | `GOOGLE_BACKEND_CREDENTIALS` |
| `Postgres` | `kvv1/postgres/credentials` | `*` (all keys) | `TF_VAR_postgres_*` |
`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS` authenticate the GCS state backend; the `TF_VAR_postgres_*` fan-out feeds the Postgres module's input variables directly. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for how the `gitea_cicd` role and KV v1 mounts are provisioned.
## End-to-end flow
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart TD
push["push / PR / workflow_dispatch<br>on iac/** or postgres/** .tf .tfvars"] --> auth["job: gitea_vault_auth<br>base64 -d | bash -&gt; Gitea OIDC id_token"]
auth -->|"gitea_vault_jwt output"| tofu["job: tofu<br>OPENTOFU_VERSION 1.8.2"]
tofu --> readvault["read vault secret<br>vault-action jwt role gitea_cicd"]
readvault -->|"GOOGLE_CREDENTIALS, TF_VAR_postgres_*, ..."| init["tofu init<br>GCS backend, state prefix"]
init --> apply["dflook/terraform-apply@v1<br>auto_approve: true"]
apply --> state["state updated in GCS<br>real cloud + homelab mutated"]
classDef trigger fill:#1f3a5f,stroke:#7fb0ff,color:#eaf2ff;
classDef job fill:#3a2f5f,stroke:#b39dff,color:#f3eeff;
classDef secret fill:#5f3a2f,stroke:#ffb38a,color:#fff1e8;
classDef danger fill:#5f1f2f,stroke:#ff8a9d,color:#ffe8ec;
class push trigger;
class auth,tofu,init job;
class readvault secret;
class apply,state danger;
```
1. A **push**, **pull_request**, or **workflow_dispatch** event matching the `iac/**` or `postgres/**` path globs starts the workflow.
2. Job **`gitea_vault_auth`** runs `base64 -d | bash` on the `vault_oauth__sh_b64` secret to obtain a Gitea OIDC `id_token`, published as the `gitea_vault_jwt` output.
3. Job **`tofu`** (gated by `needs: gitea_vault_auth`) starts on `ubuntu-latest` with `OPENTOFU_VERSION 1.8.2` and `TERRAFORM_VAULT_AUTH_JWT` set to that output.
4. The **read vault secret** step exchanges the JWT (role `gitea_cicd`, path `gitea_jwt`) for the workflow's secrets and exports them as env vars (`GOOGLE_CREDENTIALS` / `GOOGLE_BACKEND_CREDENTIALS`, `GITEA_TOKEN`, `CLOUDFLARE_API_TOKEN`, `OVH_*`, or `TF_VAR_postgres_*`).
5. **`tofu init`** configures the GCS backend, binding the working dir to its state prefix using the Google credentials just read.
6. **`dflook/terraform-apply@v1`** runs against `iac` (or `postgres/iac`) with `auto_approve: true` — no plan-gate.
7. The **state** in GCS is updated and the real cloud + homelab resources are mutated to match the committed code.
## Related pages
- [factory iac](factory-iac.md) — what the `iac/` stack provisions (the `IAC` workflow's target).
- [postgres iac](postgres-iac.md) — the `postgres/iac/` database stack (the `Postgres` workflow's target).
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) — the `gitea_cicd` role, OIDC trust, and KV mounts behind every secret read here.
- [ADR-0001 · Safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md) — the sandbox lane runs the same tofu plan-only against a `sandbox/` state prefix and a throwaway zone.

View File

@@ -0,0 +1,148 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **factory iac**
# factory iac — the `iac/` state root
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Code:** [`iac/`](../../../../iac) · **State backend:** `gs://arcodange-tf/factory/main` ([`iac/backend.tf`](../../../../iac/backend.tf))
> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
> **Related:** [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [CI apply flow](ci-apply-flow.md) · [postgres iac](postgres-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
The `iac/` root provisions everything that lives **outside** the K3s cluster: the Cloudflare R2 backend that holds OpenTofu state itself, the per-service Cloudflare and OVH API tokens consumed by the [cms](https://gitea.arcodange.lab/arcodange-org/cms) repo, a restricted Gitea CI user for reading private module repos, and the GCS bucket that backs up Longhorn volumes. Each provisioned credential is written **both** to a Gitea Actions secret (where the consuming workflow expects it) **and** to a Vault path (the durable source of truth — see [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md)).
This root's state lives at `gs://arcodange-tf/factory/main` and is applied by [`.gitea/workflows/iac.yaml`](../../../../.gitea/workflows/iac.yaml) on any change under `iac/**` — see [CI apply flow](ci-apply-flow.md) for the job-by-job walkthrough.
---
## Providers
Declared in [`iac/providers.tf`](../../../../iac/providers.tf).
| Provider | Source | Version | Endpoint / scope | Auth |
| --- | --- | --- | --- | --- |
| `gitea` | `go-gitea/gitea` | `0.6.0` | `https://gitea.arcodange.lab` | `GITEA_TOKEN` env var |
| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` |
| `google` | `google` | `7.0.1` | project `arcodange`, region `US-EAST1` | `GOOGLE_CREDENTIALS` env var |
| `cloudflare` | `cloudflare/cloudflare` | `~> 5` | DNS / Pages / R2 / IAM | `CLOUDFLARE_API_TOKEN` env var |
| `ovh` | `ovh/ovh` | `2.8.0` | endpoint `ovh-eu` | `OVH_APPLICATION_KEY` / `OVH_APPLICATION_SECRET` / `OVH_CONSUMER_KEY` |
> [!NOTE]
> The Cloudflare account ID is **not** hard-coded — it is resolved at plan time from `data.cloudflare_account.arcodange` filtered on the account name `arcodange@gmail.com` ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)) and exposed as `local.cloudflare_account_id`.
---
## Cloudflare — R2 backend bucket & service tokens
Defined in [`iac/cloudflare.tf`](../../../../iac/cloudflare.tf). Two tokens are minted through the [`modules/cloudflare_token`](#the-cloudflare_token-module) mechanism: one scoped to the R2 state bucket, one broad token handed to the cms repo.
| Resource | Type | Identity / scope | Secret destination |
| --- | --- | --- | --- |
| `cloudflare_r2_bucket.arcodange_tf` | R2 bucket | name `arcodange-tf`, jurisdiction `eu` | — (holds the *cms* repo's own OpenTofu state) |
| `module.cf_r2_arcodange_tf_token` | module → `cloudflare_account_token` | account: `Workers R2 Storage Read`, `Account Settings Read`; bucket: `Workers R2 Storage Bucket Item Write` | `vault_kv_secret.cf_r2_arcodange_tf``kvv1/cloudflare/r2/arcodange-tf` (S3 access key, secret, `https://<account_id>.eu.r2.cloudflarestorage.com` endpoint) |
| `vault_policy.cf_r2_arcodange_tf` | Vault policy | name `factory__cf_r2_arcodange_tf` | read on `kvv1/cloudflare/r2/arcodange-tf` **and** `kvv1/zoho/self_client` (the Zoho mail client is created manually) |
| `module.cf_arcodange_cms_token` | module → `cloudflare_account_token` | account-scope: `Pages Write`, `Account DNS Settings Write`, `Account Settings Read`, `Zone Write`, `Zone Settings Write`, `DNS Write`, `Cloudflare Tunnel Write`, `Turnstile Sites Write` | Gitea secrets `CLOUDFLARE_API_TOKEN` + `CLOUDFLARE_ACCOUNT_ID` on the `cms` repo; Vault `kvv1/cloudflare/cms/cf_arcodange_cms_token` |
The `cms` repo (`data.gitea_repo.cms`, owner `arcodange-org`) receives the broad token because it manages the public site end to end: Cloudflare Pages deploys, DNS records, zone settings, the Tunnel, and Turnstile.
> [!CAUTION]
> Both tokens are minted with **`expires_on = null`** — they never expire. A leaked `cf_arcodange_cms_token` grants standing DNS/Pages/Tunnel/Turnstile write on the whole account until manually revoked. There is no automatic rotation; rotation means tainting the module's `cloudflare_account_token` and re-applying.
---
## OVH — OAuth2 client for the cms domain
Defined in [`iac/ovh.tf`](../../../../iac/ovh.tf). A `CLIENT_CREDENTIALS` OAuth2 client lets the cms workflow edit DNS nameservers for `arcodange.fr`, constrained by an IAM policy.
| Resource | Type | Scope |
| --- | --- | --- |
| `ovh_me_api_oauth2_client.cms` | OAuth2 client | name `cms repo`, flow `CLIENT_CREDENTIALS` — "arcodange.fr management" |
| `ovh_iam_policy.cms` | IAM policy | name `cms_manager`; identity = the OAuth2 client; resources = account URN + `urn:v1:eu:resource:domain:arcodange.fr`; allow = a handful of `me/*` reads, all domain **READ** reference-actions (computed via `data.ovh_iam_reference_actions.domain`), plus `domain:apiovh:nameServer/edit` |
| `gitea_repository_actions_secret.ovh_cms_client_id` | Gitea secret | `OVH_CLIENT_ID` on the `cms` repo |
| `gitea_repository_actions_secret.ovh_cms_client_secret` | Gitea secret | `OVH_CLIENT_SECRET` on the `cms` repo |
| `vault_kv_secret.ovh_cms_token` | Vault secret | `kvv1/ovh/cms/app``client_id`, `client_secret`, `urn` |
> [!NOTE]
> The write surface is deliberately narrow: the policy grants **only** `nameServer/edit` for writes; everything else is read-only. This lets the cms pipeline point `arcodange.fr` at Cloudflare nameservers without exposing the broader OVH account.
---
## Gitea — restricted CI module-reader user
Defined in [`iac/gitea_tofu_ci_user.tf`](../../../../iac/gitea_tofu_ci_user.tf). A locked-down Gitea account whose SSH key lets CI clone private Terraform module repos without exposing a privileged token.
| Resource | Type | Notes |
| --- | --- | --- |
| `random_password.tofu` | password | length 32 — the user's login password |
| `gitea_user.tofu` | Gitea user | username `tofu_module_reader`, email `tofu-module-reader@arcodange.fake`, `restricted = true`, `visibility = private`, `prohibit_login = false` |
| `tls_private_key.tofu` | keypair | algorithm **ED25519** |
| `gitea_public_key.tofu` | SSH key | public half attached to `tofu_module_reader` |
| `vault_kv_secret.gitea_admin_token` | Vault secret | `kvv1/gitea/tofu_module_reader``ssh_private_key` + `ssh_public_key` |
> [!NOTE]
> Despite the Terraform resource name `gitea_admin_token`, the stored payload is the **SSH keypair**, not an admin token. The user is `restricted`, so it can only read repos it is explicitly granted access to.
---
## Google / GCS — Longhorn backup target
Defined in [`iac/gcs_backup.tf`](../../../../iac/gcs_backup.tf). A GCS bucket plus an HMAC key wired into Vault so the in-cluster Longhorn controller can pull S3-compatible backup credentials. See [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) for how this fits the cluster-recovery story.
| Resource | Type | Value |
| --- | --- | --- |
| `google_storage_bucket.longhorn_backup` | GCS bucket | name `arcodange-backup`, location `NAM4` (dual-region), `force_destroy = true`, `public_access_prevention = enforced` |
| `google_service_account.longhorn_backup` | service account | account_id `longhorn-backup` |
| `google_storage_bucket_iam_member.longhorn_backup` | IAM binding | `roles/storage.admin` on the bucket, member = the SA |
| `google_storage_hmac_key.longhorn_backup` | HMAC key | S3-compatible access_id + secret for that SA |
| `vault_kv_secret_v2.longhorn_gcs_backup` | Vault **KVv2** secret | mount `kvv2`, name `longhorn/gcs-backup`, `cas = 1`, `delete_all_versions = true``AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_ENDPOINTS = https://storage.googleapis.com` |
| `vault_policy.longhorn_gcs_backup` | Vault policy | name `longhorn-gcs-backup` — read on `kvv2/data/longhorn/gcs-backup` |
| `vault_kubernetes_auth_backend_role.longhorn` | Vault k8s auth role | role `longhorn`, bound SA `longhorn-vault-secret-reader` in namespace `longhorn-system`, audience `vault`, policy `longhorn-gcs-backup` |
The bound service-account name `longhorn-vault-secret-reader` must match the `VaultAuth` manifest in-cluster — that's the handshake that lets Longhorn read the HMAC creds at runtime.
> [!WARNING]
> The HMAC key is an **S3-compatible** credential and is weaker than a native GCS service-account key: it is a long-lived static secret with no key rotation built into this config, and `roles/storage.admin` grants full read/write/delete on the backup bucket. Combined with `force_destroy = true`, a state operation that destroys `arcodange-backup` will delete every Longhorn backup without prompting. Treat this bucket as critical and irreplaceable infrastructure.
---
## The `cloudflare_token` module
Source: [`iac/modules/cloudflare_token/`](../../../../iac/modules/cloudflare_token). This local module turns **human-readable permission names** into a working Cloudflare account token, so callers never hard-code permission-group UUIDs.
How it works ([`main.tf`](../../../../iac/modules/cloudflare_token/main.tf)):
1. It reads **all** available permission groups via `data.cloudflare_account_api_token_permission_groups_list`, then builds `local.permission_map`: `"<scope>:<name>" => id` (e.g. `"account:Pages Write" => <uuid>`), keyed by the last dotted segment of the group's scope.
2. Caller-supplied names (`var.permissions.account` / `var.permissions.bucket`) are looked up against that map; any name with no match lands in `local.missing_permissions` and trips a **`precondition`** that fails the apply with a clear "Permissions introuvables" error.
3. Policies are assembled dynamically — an `account` policy targeting `com.cloudflare.api.account.<id>` and, if `var.bucket` is set, a `bucket` policy targeting `com.cloudflare.edge.r2.bucket.<id>_<jurisdiction>_<name>`.
4. The `cloudflare_account_token.token` resource sets `expires_on = null` and **ignores** drift on `expires_on` and `policies` (the upstream permission IDs are unstable). Instead, a `null_resource.cloudflare_account_token_replace` hashes the **sorted permission names** into its triggers, and `replace_triggered_by` forces a fresh token whenever the *names* change — surviving id churn while still rotating on a real permission change.
5. Outputs ([`outputs.tf`](../../../../iac/modules/cloudflare_token/outputs.tf)): `token` (sensitive), `token_id`, `token_sha256`, and — when `var.bucket` is set — `r2_credentials` mapping `access_key_id = token.id` and `secret_access_key = sha256(token.value)` for S3-compatible R2 access.
---
## Vault layout: mixed KVv1 / KVv2
This root writes to **both** KV engines, which is easy to trip over.
| Path | Engine | Written by |
| --- | --- | --- |
| `kvv1/cloudflare/r2/arcodange-tf` | KVv1 (`vault_kv_secret`) | R2 backend token |
| `kvv1/cloudflare/cms/cf_arcodange_cms_token` | KVv1 | cms Cloudflare token |
| `kvv1/ovh/cms/app` | KVv1 | OVH OAuth2 client |
| `kvv1/gitea/tofu_module_reader` | KVv1 | CI user SSH key |
| `kvv2/longhorn/gcs-backup` | KVv2 (`vault_kv_secret_v2`) | Longhorn GCS HMAC |
> [!WARNING]
> Most secrets here use the **KVv1** engine (`vault_kv_secret`), but the Longhorn backup secret uses **KVv2** (`vault_kv_secret_v2`). The policy paths differ accordingly — KVv2 reads target `kvv2/data/longhorn/gcs-backup` (note the `/data/` segment), whereas KVv1 policies read the literal path. Mixing the two engines means a policy copied from one secret to another will silently grant nothing. See [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) for the engine-level design.
---
## Outputs
The root exposes a single top-level `output "token"` (sensitive) = the cms Cloudflare token ([`iac/cloudflare.tf`](../../../../iac/cloudflare.tf)). Everything else is delivered side-effect-style into Gitea secrets and Vault paths rather than as Terraform outputs.
---
## See also
- [CI apply flow](ci-apply-flow.md) — how `iac/**` changes reach `gs://arcodange-tf/factory/main` via the Vault-JWT exchange and auto-approve apply.
- [postgres iac](postgres-iac.md) — the sibling root that provisions in-cluster PostgreSQL.
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [Storage & recovery](../../lab-ecosystem/storage-and-recovery.md) · [Naming conventions](../../lab-ecosystem/naming-conventions.md).

View File

@@ -0,0 +1,116 @@
[vibe](../../../README.md) > [Guidebooks](../../README.md) > [Factory provisioning](../README.md) > [OpenTofu](README.md) > **postgres iac**
# postgres iac — the `postgres/iac/` state root
> [!NOTE]
> **Status:** ✅ active · **Last Updated:** 2026-06-23
> **Code:** [`postgres/iac/`](../../../../postgres/iac) · **State backend:** `gs://arcodange-tf/factory/postgres` ([`postgres/iac/backend.tf`](../../../../postgres/iac/backend.tf))
> **Upstream:** [OpenTofu hub](README.md) · [Factory provisioning hub](../README.md) · [Lab ecosystem · 01 factory](../../lab-ecosystem/01-factory.md)
> **Related:** [Naming conventions](../../lab-ecosystem/naming-conventions.md) · [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md) · [CI apply flow](ci-apply-flow.md) · [factory iac](factory-iac.md) · [ADR-0001 safe prod-like environment](../../../ADR/0001-safe-prod-like-environment.md)
The `postgres/iac/` root provisions **PostgreSQL roles, databases, and the pgbouncer auth function** on the live cluster database — one strand of the per-application `<app>` join key described in [Naming conventions](../../lab-ecosystem/naming-conventions.md). For each application it creates a non-login owner role, an `<app>` database owned by that role, and a `user_lookup()` function that lets PgBouncer authenticate against `pg_shadow`. A single `credentials_editor` login role (whose password is stored in Vault) is granted admin over every per-app role so that downstream tooling can mint application credentials without superuser rights.
This root's state lives at `gs://arcodange-tf/factory/postgres` and is applied by [`.gitea/workflows/postgres.yaml`](../../../../.gitea/workflows/postgres.yaml) on any change under `postgres/**` — see [CI apply flow](ci-apply-flow.md).
> [!CAUTION]
> This root runs as a **PostgreSQL superuser** ([`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf): `superuser = true`) pinned to the live database at **`192.168.1.202`** (pi2) **through PgBouncer**, with `sslmode = disable`. The provider can therefore **drop or alter live application databases** — an errant `terraform destroy` or a renamed `applications` entry will delete real data. And because the only route to Postgres is via PgBouncer on that host, **if PgBouncer is down OpenTofu cannot connect and no apply can run.** Treat every `postgres/**` merge as a production database change ([ADR-0001](../../../ADR/0001-safe-prod-like-environment.md)).
---
## Providers
Declared in [`postgres/iac/providers.tf`](../../../../postgres/iac/providers.tf).
| Provider | Source | Version | Connection | Auth |
| --- | --- | --- | --- | --- |
| `postgresql` | `cyrilgdn/postgresql` | `1.24.0` | host `192.168.1.202` (pi2), via PgBouncer, `sslmode = disable`, `superuser = true` | `var.POSTGRES_USERNAME` / `var.POSTGRES_PASSWORD` (TF vars from `TF_VAR_POSTGRES_*`, sourced from Vault in CI) |
| `vault` | `vault` | `4.4.0` | `https://vault.arcodange.lab` | JWT login — mount `gitea_jwt`, role `gitea_cicd` |
The two `POSTGRES_*` variables are declared `sensitive` in the same file; CI populates them from Vault as `TF_VAR_POSTGRES_USERNAME` / `TF_VAR_POSTGRES_PASSWORD` (see [CI apply flow](ci-apply-flow.md)).
---
## The application set
Everything in this root fans out over one variable. `var.applications` is a `set(string)` ([`variables.tf`](../../../../postgres/iac/variables.tf)) whose members are listed in [`terraform.tfvars`](../../../../postgres/iac/terraform.tfvars):
| `applications` member |
| --- |
| `webapp` |
| `erp` |
| `crowdsec` |
| `plausible` |
| `dance-lessons-coach` |
Adding an app to that list creates a full role + database + lookup-function bundle on the next apply; **removing** one would `DROP` the live database (see the caution above).
---
## The `credentials_editor` role
Defined in [`postgres/iac/main.tf`](../../../../postgres/iac/main.tf). A single login role, granted admin over every per-app role, whose credentials downstream tooling uses to provision application logins.
| Resource | Type | Detail |
| --- | --- | --- |
| `random_password.credentials_editor` | password | length 24, `override_special = "-:!+<>"` |
| `postgresql_role.credentials_editor` | role | `login = true`, `create_role = true`; `lifecycle { ignore_changes = [roles] }` so its grant membership isn't reverted |
| `vault_kv_secret.postgres_admin_credentials` | Vault **KVv1** secret | `kvv1/postgres/credentials_editor/credentials``username` + `password` |
---
## Per-application resources
For each member of `var.applications`, `main.tf` creates the following (all `for_each` over the set):
| Resource | Type | What it creates |
| --- | --- | --- |
| `postgresql_role.app_role["<app>"]` | role | non-login role `<app>_role` (`login = false`) — owns the database |
| `postgresql_grant_role.credentials_editor_app_role["<app>"]` | grant | `credentials_editor``<app>_role` **WITH ADMIN OPTION** |
| `postgresql_database.app_db["<app>"]` | database | database `<app>`, owner `<app>_role`, `template = template0`, `alter_object_ownership = true` |
| `postgresql_function.pgbouncer_user_lookup["<app>"]` | function | `user_lookup(i_username text)` in db `<app>` — see below |
| `postgresql_grant.pgbouncer_user_lookup_public_revoke["<app>"]` | grant | revoke (empty `privileges`) of `user_lookup` from role `public` in schema `public` |
| `postgresql_grant.pgbouncer_user_lookup["<app>"]` | grant | `EXECUTE` on `user_lookup` to role `pgbouncer_auth`; `depends_on` the public-revoke (the two grants can't run in parallel) |
So `webapp` yields role `webapp_role`, database `webapp`, function `webapp.user_lookup`, and the matching grants; likewise for `erp`, `crowdsec`, `plausible`, and `dance-lessons-coach`.
### The pgbouncer `user_lookup()` function
`postgresql_function.pgbouncer_user_lookup` defines a `plpgsql` function with **`security_definer = true`** and `parallel = "SAFE"`. It takes `i_username` (IN, text) and returns a record of `uname` + `phash`:
```sql
BEGIN
SELECT usename, passwd FROM pg_catalog.pg_shadow
WHERE usename = i_username INTO uname, phash;
RETURN;
END;
```
PgBouncer's `auth_query` calls this to fetch the stored password hash. Because reading `pg_shadow` is privileged, the function is `SECURITY DEFINER` (runs as its owner). Access is locked down in two steps: first **revoke** the default `public` execute grant, then **grant** `EXECUTE` only to the `pgbouncer_auth` role — the `pgbouncer_auth` role itself is expected to already exist on the server (it is not created by this root).
> [!NOTE]
> The two grants are ordered with an explicit `depends_on`: `postgresql_grant.pgbouncer_user_lookup` waits for `postgresql_grant.pgbouncer_user_lookup_public_revoke` because the provider can't apply both grants on the same object concurrently.
---
## Vault layout
This root writes a single KVv1 secret.
| Path | Engine | Contents |
| --- | --- | --- |
| `kvv1/postgres/credentials_editor/credentials` | KVv1 (`vault_kv_secret`) | `username`, `password` of the `credentials_editor` login role |
---
## No outputs
There is **no `outputs.tf`** in this root. Nothing is exported as a Terraform output — the `credentials_editor` credentials are delivered into Vault, and the per-app roles/databases/functions are side effects on the live server. Consumers read the credentials from `kvv1/postgres/credentials_editor/credentials`, not from state outputs.
---
## See also
- [Naming conventions](../../lab-ecosystem/naming-conventions.md) — the `<app>` databases here are one strand of the per-application `<app>` join key (alongside namespaces, Vault paths, and repos).
- [CI apply flow](ci-apply-flow.md) — how `postgres/**` changes reach `gs://arcodange-tf/factory/postgres` and where `TF_VAR_POSTGRES_*` come from.
- [factory iac](factory-iac.md) — the sibling root for everything outside the cluster.
- [Secrets & Vault](../../lab-ecosystem/secrets-and-vault.md).

View File

@@ -0,0 +1,127 @@
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **01 · factory**
# 01 · factory
> **Status:** ✅ Active
> **Last Updated:** 2026-06-23
> **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md)
> **Deeper dive:** [Factory provisioning guidebook](../factory-provisioning/README.md) — page-by-page walkthrough of the Ansible playbooks/roles and OpenTofu modules summarized here
> **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md)
`factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md).
## Pillar 1 — Ansible ([`ansible/`](../../../ansible/))
The collection lives at `ansible/arcodange/factory/`. The inventory groups the three Pis and pins the service placement; numbered playbooks run an ordered narrative from bare OS to backups; `recover/` holds the disaster-recovery playbooks.
### Inventory (`inventory/hosts.yml`)
| Group | Hosts | Purpose |
|---|---|---|
| `raspberries` | `pi1`, `pi2`, `pi3` (`192.168.1.201-203`) | All three Pis; `ansible_user: pi` |
| `postgres` | `pi2` | The PostgreSQL host (docker-compose, outside k3s) |
| `gitea` | children of `postgres` (→ `pi2`) | Gitea co-located with PG on `pi2` |
| `pihole` | `pi1`, `pi3` | Internal DNS resolvers |
| `step_ca` | `pi1`, `pi2`, `pi3` | Step-CA PKI for `*.arcodange.lab` (primary `pi1`, replicas `pi2`/`pi3`) |
| `local` | `localhost` + the Pis | Control-node-local tasks |
### Numbered playbooks (`playbooks/`)
| Playbook | Imports / does | Notes |
|---|---|---|
| `01_system` | `system/system.yml` → rpi base, DNS, SSL, prepare disks, Docker, iSCSI, **k3s install** (`--docker --disable traefik`), CoreDNS, cert-issuer, Longhorn/Traefik config | k3s `v1.34.3+k3s1` via upstream `k3s-ansible`; pi1 server, pi2/pi3 agents |
| `02_setup` | `setup/setup.yml` → PostgreSQL + Gitea docker-compose; optional backup-NFS share | Stands up the two out-of-cluster source-of-truth services on `pi2` |
| `03_cicd` | Gitea **act-runner** docker-compose on `pi1`/`pi3` (`raspberries:&local:!gitea`), plus the ArgoCD/Image-Updater install | See the ArgoCD caveat below |
| `04_tools` | `tools/tools.yml``hashicorp_vault.yml`, `crowdsec.yml` | Platform tooling that bootstraps the cluster's Vault + CrowdSec |
| `05_backup` | `backup/backup.yml``postgres.yml`, `gitea.yml`, `k3s_pvc.yml` to `/mnt/backups` | Scheduled PG/Gitea/PVC backups; cron-report wiring present |
### Recovery playbooks (`playbooks/recover/`)
| Playbook | When to use |
|---|---|
| `longhorn.yml` | Recover Longhorn after a power cut when **Volume CRDs still exist** (CSI driver registration loss) |
| `longhorn_data.yml` | Recover app data from **raw replica `.img` files** when Volume CRDs are gone (block-device level) |
The tested power-cut recovery sequence (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last) is documented in `CLUSTER_RECOVERY.md` at the lab root (outside this repo) and summarized in [storage-and-recovery.md](storage-and-recovery.md). Background on PVC recovery is in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
### Key roles
`deploy_docker_compose` (renders compose stacks), `gitea_repo` / `gitea_token` / `gitea_secret` / `gitea_sync` (Gitea repo/token/secret/mirror management), `traefik_certs`, `playwright`, plus sub-roles `step_ca`, `hashicorp_vault`, `crowdsec`, `pihole`.
## Pillar 2 — ArgoCD app-of-apps ([`argocd/`](../../../argocd/))
A Helm chart whose `templates/apps.yaml` loops over `values.gitea_applications` and emits one `Application` CRD per app. Each Application derives everything from the app name: `repoURL = https://gitea.arcodange.lab/<org>/<app>`, `path = chart`, `namespace = <app>` (`CreateNamespace=true`), with `syncPolicy.automated` `prune: true` + `selfHeal: true` by default.
> [!TIP]
> **Deeper dive:** the [Applications guidebook](../applications/README.md) maps what these `Application` CRDs deploy — the common app-repo pattern (Dockerfile + `chart/` + optional `iac/` + CI) every app in the list below shares, and the two archetypes (Go + Postgres vs Rust + SQLite).
| App | Org override | Image Updater |
|---|---|---|
| `url-shortener` | — | — |
| `tools` | — | explicit `prune`+`selfHeal` |
| `webapp` | — | ✅ digest strategy |
| `telegram-gateway` | `arcodange` | ✅ digest strategy |
| `erp` | — | — |
| `cms` | — | ✅ digest strategy |
| `dance-lessons-coach` | `arcodange` | ✅ digest strategy |
> [!NOTE]
> The chart also templates a `longhorn_backup_target` and the ArgoCD Image Updater config (`argocd.arcodange.lab`). **ArgoCD itself is not currently deployed in-cluster** — its install is commented out in `03_cicd`. This page documents the intended steady state; treat ArgoCD as "designed, not live" until that step is enabled.
## Pillar 3 — OpenTofu ([`iac/`](../../../iac/))
Manages the cloud/Gitea/Vault edge. State lives in **GCS** (`backend "gcs"`, bucket `arcodange-tf`, prefix `factory/main`). Tofu authenticates to Vault via **Gitea OIDC JWT** (mount `gitea_jwt`, role `gitea_cicd`).
| Provider | Used for |
|---|---|
| `go-gitea/gitea` (`0.6.0`) | Repos, users, action secrets (e.g. the restricted `tofu_module_reader` CI user, CMS secrets) |
| `vault` (`4.4.0`) | KV secrets + policies + k8s auth roles (e.g. Longhorn GCS-backup creds & policy) |
| `google` (`7.0.1`) | GCS backup bucket + service account + HMAC key for Longhorn |
| `cloudflare/cloudflare` (`~> 5`) | R2 bucket, API tokens, CMS edge wiring (detailed in [03 · cms](03-cms.md)) |
| `ovh/ovh` (`2.8.0`) | OAuth2 client + IAM policy for the `arcodange.fr` domain (registrar = OVH) |
`modules/cloudflare_token` is a reusable scoped-token factory. The whole module reuses the `<app>` name as the GCS state prefix (`<app>/main`) — see [naming-conventions.md](naming-conventions.md).
## Pillar 4 — per-app PostgreSQL ([`postgres/iac/`](../../../postgres/))
OpenTofu using the `cyrilgdn/postgresql` provider against PG on `192.168.1.202` (state prefix `factory/postgres`). It iterates over a `var.applications` set and, **per app**, creates:
| Resource | Name pattern | Purpose |
|---|---|---|
| Database | `<app>` | The app's database (`template0`, owned by the role) |
| Owner role (non-login) | `<app>_role` | Database owner; granted to dynamic users by Vault |
| Editor role (login) | `credentials_editor` | Shared admin role that can grant the per-app roles |
| `user_lookup()` function | per-`<app>` db | `SECURITY DEFINER` lookup for **pgbouncer** auth (granted to `pgbouncer_auth`, revoked from `public`) |
Current `applications` set: `webapp`, `erp`, `crowdsec`, `plausible`, `dance-lessons-coach`. Vault's PostgreSQL secrets engine then issues **dynamic** credentials on top of these roles — see [secrets-and-vault.md](secrets-and-vault.md). The pooler (`pgbouncer`) that consumes `user_lookup()` lives in the `tools` namespace — see [02 · tools](02-tools.md).
## Provisioning order
```mermaid
%%{init: {'theme': 'base'}}%%
flowchart LR
classDef proc fill:#059669,stroke:#047857,color:#fff
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
S1["01_system<br>OS + k3s + Longhorn"]:::proc --> S2["02_setup<br>PG + Gitea (pi2)"]:::proc --> S3["03_cicd<br>runners + ArgoCD"]:::proc --> S4["04_tools<br>Vault + CrowdSec"]:::proc --> S5["05_backup<br>PG/Gitea/PVC"]:::proc
IAC["iac/ + postgres/iac<br>(OpenTofu state in GCS)"]:::store -. "declares cloud/Gitea/Vault/PG" .- S2
```
1. **`01_system`** lays the OS, disks, Docker, and k3s with Longhorn + Traefik onto the three Pis.
2. **`02_setup`** stands up PostgreSQL and Gitea as docker-compose on `pi2` — the out-of-cluster source-of-truth services.
3. **`03_cicd`** registers the Gitea act-runners (and is where ArgoCD would install, currently commented out).
4. **`04_tools`** bootstraps the cluster's Vault and CrowdSec.
5. **`05_backup`** schedules PostgreSQL, Gitea, and k3s-PVC backups to `/mnt/backups`.
6. In parallel, **OpenTofu** (`iac/` and `postgres/iac/`) declares the cloud, Gitea, Vault, and PostgreSQL objects, keeping state in GCS.
## Cross-references
- [Lab ecosystem hub](README.md) — the whole-lab map this page sits under.
- [Applications guidebook](../applications/README.md) — the apps ArgoCD's app-of-apps deploys: the common app-repo pattern and the Go+Postgres / Rust+SQLite archetypes.
- [02 · tools](02-tools.md) — what ArgoCD deploys into the `tools` namespace (incl. pgbouncer that consumes the PG `user_lookup()`).
- [03 · cms](03-cms.md) — the CMS edge that `iac/cloudflare.tf` and `iac/ovh.tf` wire up.
- [naming-conventions.md](naming-conventions.md) — the `<app>` join key these pillars share.
- [secrets-and-vault.md](secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and dynamic PG creds.
- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn + GCS backup + power-cut recovery.
- [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md) · [conventions](../../../doc/runbooks/new-web-app/conventions.md) — the step-by-step procedure these pillars support.
- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs.
- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — recovery background.

Some files were not shown because too many files have changed in this diff Show More