Commit Graph

129 Commits

Author SHA1 Message Date
a38c8b39f1 Merge pull request 'feat(multi-env): Phase D1 — provision erp-sandbox Postgres DB + role' (#17) from claude/phaseD-erp-sandbox-postgres into main 2026-06-28 17:09:45 +02:00
00a838799b feat(multi-env): Phase D1 — provision erp-sandbox Postgres DB + role
Activates the sandbox environment for the ERP on the Postgres side
(ADR-0002 Phase D). `erp` gains `envs = ["prod", "sandbox"]`, so the
elision flatten now materialises a second instance `erp-sandbox`:
  - database `erp-sandbox`
  - owner role `erp_sandbox_role` (snake-case per the convention)
  - pgbouncer user_lookup function + grants for the new DB

The prod `erp` instance is unchanged (db `erp`, role `erp_role`) — the
apply is purely additive (~6 resources for erp-sandbox, 0 changed,
0 destroyed on everything else). Verified the flatten output with a
standalone tofu apply before pushing.

This is D1 of the Phase D activation. D2 (tools Vault policies),
D3 (erp iac creds + KV), D4 (ArgoCD Application) follow in order.

Refs ADR-0002 (factory#15), Phase B (factory#16).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-28 17:05:50 +02:00
235ff72ac0 Merge pull request 'feat(multi-env): Phase B — factory machinery env-capable (no activation)' (#16) from claude/multi-env-phaseb into main 2026-06-28 16:53:39 +02:00
c00c4cdd5c feat(multi-env): Phase B — make factory machinery env-capable (no activation)
ADR-0002 Phase B. Makes postgres/iac, argocd, and the conventions docs
multi-environment-capable WITHOUT activating any sandbox yet — every app
stays prod-only, so this change is behaviour-neutral:
  - postgres/iac `tofu plan` is a no-op (proven: the elision flatten keys
    are bare app names, db=<app>, role=<app>_role — identical addresses)
  - the argocd apps.yaml render is byte-identical (181→181 lines, diff
    empty) since no app declares `envs`

postgres/iac:
- variables.tf: `applications` becomes set(object({name, envs=optional(["prod"])}))
- main.tf: a `local.app_instances` flatten of applications × envs keyed by the
  elided instance id (env=prod → "<app>"); per-app resources iterate it and
  reference each.key / each.value.{database,role}. For prod-only apps every
  resource address + attribute is unchanged. (main.tf also got a full
  `tofu fmt` pass — the pgbouncer function block reindents 4→2 spaces, which
  is cosmetic; the correctness gate is the CI tofu plan, not the text diff.)
- terraform.tfvars: string entries → { name = "..." } objects.

argocd/templates/apps.yaml:
- after the prod Application, a `range $app_attr.envs` loop renders one extra
  Application per non-prod env: name/namespace `<app>-<env>`, shared repoURL,
  helm.valueFiles [values.yaml, values-<env>.yaml], per-env syncPolicy override.
  Renders nothing while no app sets `envs` → prod render unchanged.

docs:
- doc/runbooks/new-web-app/conventions.md (FR, authoritative): new section
  "Plusieurs environnements pour une même app" — elision rule, suffix rule,
  snake-case owner-role exception, erp/erp-sandbox table, ADR-0002 link.
- vibe/guidebooks/lab-ecosystem/naming-conventions.md (EN mirror): the env
  coordinate section + a "Two sandbox models" section reconciling the
  separate-cluster (ADR-0001, names repeat) vs in-cluster sibling (ADR-0002,
  <env> suffix) strategies; Last Updated bumped; ADR-0002 cross-links.

Activation (erp gets envs=["prod","sandbox"] in postgres tfvars + argocd
values + erp/iac) is Phase D, gated by its own plan review.

Refs ADR-0002 (factory#15). Phase A = tools#2 (merged). Phase C = erp#11 (merged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-28 16:28:28 +02:00
8a1a63ee10 Merge pull request 'docs(adr): ADR-0002 — per-application environments via an env coordinate' (#15) from claude/adr-multi-env into main 2026-06-28 16:17:37 +02:00
c35b510040 docs(adr): fill the ADR-0002 ↔ PR backlink (factory#15)
Replaces the placeholder References line with the PR URL so the
ADR↔PR crosslink is bidirectional per the AGENTS.md rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-25 14:56:09 +02:00
3961914613 docs(adr): ADR-0002 — per-application environments via an env coordinate
Records the decision to extend the <app> join key with a second
coordinate <env>, governed by an elision rule (env=prod elides → every
existing app's derived names are byte-identical and its tofu plan is a
no-op; non-prod envs take the <app>-<env> suffix, with the Postgres
owner role staying snake-case <app>_<env>_role).

Motivated by the ERP's incoming write-capable AI-agent skill: it needs
an in-cluster sandbox instance (erp-sandbox) with a prod-like Dolibarr
API + isolated database to rehearse writes before a human promotes them
to prod. The ADR reconciles this against ADR-0001 honestly — ADR-0001
rejected an in-cluster sandbox for INFRA-change rehearsal (shared
fleet-wide control planes); ADR-0002 operates one layer up where the
agent's only reach is the app's HTTP API against an isolated DB, so the
fleet blast radius is not in scope. The two are complementary; ADR-0002
does not supersede ADR-0001.

Also:
- vibe/ADR/README.md: index row for 0002 + Last Updated 2026-06-25
- PRD safe-prod-like-environment README: bidirectional back-link to
  ADR-0002 on the Adjacent line + Last Updated 2026-06-25

Authored via the ADR Scribe persona, validated via the Continuity Warden
checklist (no-tombstone, breadcrumb, MADR-lite sections, dead-link scan,
bidirectional links).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-25 14:55:19 +02:00
801724e1bc Merge pull request 'chore(iac): remove spent R2 import block' (#14) from arcodange/r2-import-cleanup into main 2026-06-24 13:24:09 +02:00
7727b244ad chore(iac): remove spent R2 import block
The one-time import block from the previous change reconciled
cloudflare_r2_bucket.arcodange_tf into state (run #29: "Import complete",
"Apply complete! Resources: 1 imported"). It is now a no-op, so remove it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:23:42 +02:00
e2a79a08a7 Merge pull request 'fix(iac): import existing EU R2 bucket into state' (#13) from arcodange/r2-state-import into main 2026-06-24 13:19:56 +02:00
a0fbe5c655 fix(iac): import existing EU R2 bucket into state
Run #28 applied cleanly except cloudflare_r2_bucket.arcodange_tf: the bucket
exists in the EU jurisdiction, but its prior state entry lacked the jurisdiction,
so cloudflare provider >=5.20 read it as not-found, removed it from state, and
then failed to recreate it ("already exists"). Add a config-driven import block
with the jurisdiction-qualified id (<account_id>/<bucket_name>/<jurisdiction>) so
the next apply adopts the real bucket. No-op once reconciled; removable after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:19:32 +02:00
fc28c52b85 Merge pull request 'fix(iac): pin cloudflare provider + lockfile, trust homelab CA in gitea provider' (#12) from arcodange/iac-provider-fixes into main 2026-06-24 13:03:16 +02:00
bfa05ff633 Merge pull request 'fix(ci): run factory tofu workflows on the CA-trusting runner' (#11) from arcodange/focused-dirac-151213 into main 2026-06-24 13:02:58 +02:00
9b545e6f8f fix(iac): pin cloudflare provider + lockfile, trust homelab CA in gitea provider
With the runner CA fix (#11) the iac workflow now runs far enough to apply,
which exposed two provider problems:

cloudflare drift — `cloudflare/cloudflare` floated on `~> 5` with no committed
lock file, so CI pulled v5.21.1 where `cloudflare_account_token.policies[].resources`
is a JSON string, not a map ("Incorrect attribute value type"). Fix:
- pin to `~> 5.21` and commit a multi-platform `.terraform.lock.hcl`
  (linux_arm64 for the runner + darwin_arm64 for local);
- `jsonencode(...)` the module's policy resources;
- bind the cloudflare_token module to `cloudflare/cloudflare` explicitly (it was
  defaulting to `hashicorp/cloudflare`, pulling a redundant provider);
- stop `.gitignore` from hiding the lock file (the old `.terraform.*` rule did).

gitea provider TLS — it runs inside the dflook/terraform-apply container, which
doesn't trust the homelab CA (only the ubuntu-latest-ca runner does), so it
failed `x509: certificate signed by unknown authority` reaching
gitea.arcodange.lab. Fix: feed it the homelab CA via the provider's `cacert_file`
(TF_VAR_gitea_cacert_file -> the homelab.pem the workflow already materializes).

Validated locally with `tofu validate` + provider-schema inspection (no prod
calls). Complements #11. Out of scope (need a live run / operator): the OVH
consumer-key scope, and the R2 bucket "not found" on refresh (a state reconcile).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:56:46 +02:00
e5c537a967 fix(ci): run factory tofu workflows on the CA-trusting runner
After the move to the self-signed internal DNS (gitea.arcodange.lab /
vault.arcodange.lab), the default `ubuntu-latest` runner image does not
trust the homelab CA, so the `uses:` clone of the vault-action over HTTPS
fails TLS verification. webapp's workflows already moved to the
`ubuntu-latest-ca` runner (whose image ships the homelab CA); apply the
same to the factory `iac` and `postgres` tofu workflows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 11:22:54 +02:00
3b0919b804 Merge pull request 'docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md' (#10) from arcodange/focused-dirac-151213 into main
Reviewed-on: #10
2026-06-24 11:01:09 +02:00
053b04337a chore: gitignore .claude/worktrees
Per-session Claude Code checkouts live under .claude/worktrees/<slug>/
on the trunk; keep them out of git so the main checkout stays clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:55:43 +02:00
1824a1885d docs(vibe): add maintenance rule to the ansible + opentofu sub-hubs
The two factory-provisioning sub-hubs were the only guidebook index pages without
the "alter a documented component -> update its page in the same PR" reminder that
every sibling hub carries. Add a scoped maintenance rule to each, pointing back to
the factory-provisioning maintenance rule and the guidebooks' Rules to contribute,
so no folder hub silently drifts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 23:42:24 +02:00
2d76eb45c1 docs(vibe): add new-tool and new-app runbooks (grounded in real PRs)
Two agent-oriented runbooks under vibe/runbooks/ with [AGENT]/[HUMAN] step
markers, grounded in real diffs:

- new-tool.md : add a platform component to the tools repo so ArgoCD deploys it
  into the tools namespace (wrapper Chart.yaml + the tool library + a row in
  chart/values.yaml; optional iac/ for secrets). Mirrors the prometheus/crowdsec
  additions.
- new-app.md  : stand up a brand-new application across THREE repos (app +
  factory + tools) with the strict ordering dependency and the TERRAFORM_SSH_KEY
  pitfall. Phase-by-phase mapped to the dance-lessons-coach onboarding PRs
  (#89/#97/#98/#99/#100), factory #1/#2, tools #1; the FR doc/runbooks/new-web-app
  is linked as the detailed companion.

2 mermaid diagrams MCP-validated; zero dead links across the vibe tree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 22:22:09 +02:00
7bf83e75ed docs(vibe): add erp/ guidebook (Dolibarr deployment + backup/recovery + ops)
Dedicated tree-docs guidebook under vibe/guidebooks/erp/ for the lab's most
data-critical app, cross-linked from the applications hub (bidirectional):

- README.md             : Dolibarr 22.0.4 on Postgres; data-criticality; overview
  diagram; the Vault-unseal-before-scale recovery ordering (CAUTION).
- deployment.md         : upstream image + custom entrypoint (MySQL->psql), the
  50Gi Longhorn RWX documents PVC, Vault CRDs + the shared app_roles iac, init
  scripts (conf.php creds, table-ownership), ingress, CI.
- backup-and-recovery.md: the Ansible CronJob pg_dump (daily 04:00, 15-day
  retention) + restore Job (scale-0 -> restore -> scale-1); the cluster recovery
  ordering (Longhorn -> Vault unseal -> erp scale-up).
- operations.md         : the read-only bin/arcodange CLI, static/company.json,
  Deno+Playwright tests, day-2 ops.

erp code via full gitea URLs; CLUSTER_RECOVERY.md by name; 2 mermaid diagrams
MCP-validated; zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 22:12:11 +02:00
4823394e0e docs(vibe): add applications/ guidebook (webapp + url-shortener)
Tree-docs guidebook under vibe/guidebooks/applications/ documenting the common
app pattern and two contrasting archetypes, drilling into lab-ecosystem/01-factory
(bidirectional):

- README.md  : the shared app pattern (repo = Dockerfile + chart + optional iac +
  CI; ArgoCD app-of-apps; the <app> join key; .fr vs .lab ingress conventions) +
  a two-archetype comparison.
- webapp.md  : canonical Go + Postgres exemplar (chart, VaultAuth/Static/Dynamic
  CRDs, inline iac vs the shared app_roles module, CI); notes the current nuance
  that the live pod still uses the static pgbouncer_auth DATABASE_URL.
- url-shortener.md : Rust + SQLite-on-Longhorn-RWO counterpart (single replica,
  no iac/no Vault, CI mirrors the upstream image); the power-cut recovery story.

erp is referenced in prose only (its own guidebook lands next). Sibling-repo code
via full gitea URLs; 2 mermaid diagrams MCP-validated; zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:58:36 +02:00
548dacfc44 docs(vibe): add tools/ and cms/ guidebooks
Two code-grounded tree-docs guidebooks under vibe/guidebooks/, drilling into the
lab-ecosystem 02-tools and 03-cms pages (bidirectional):

- tools/  : hub + components.md (Vault+VSO, Prometheus, Grafana, CrowdSec,
  pgbouncer, Redis/KeyDB, Plausible, ClickHouse; pgcat/tool as Tier-2) +
  secrets-and-vso.md (Vault engines/auth, the app_roles/app_policy modules =
  the <app> join-key machinery, VSO CRDs, secret-paths inventory).
- cms/    : hub + site.md (Nuxt + dual Pages/k3s deploy) + cloudflare.md
  (zone via OVH->CF, Pages, cloudflared tunnel, Turnstile, R2 state) +
  zoho-email.md (OAuth, MX/SPF/DKIM/DMARC/BIMI, the 7 aliases).

Sibling-repo code linked via full gitea URLs; vibe-internal links bidirectional.
Reconciled the cloudflared tunnel token path to kvv2 cms/cloudflared (the chart
VaultStaticSecret is kv-v2; the kvv1 tofu reference is a commented-out stub).
6 mermaid diagrams MCP-validated; zero dead links. Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:41:15 +02:00
dbe32161dc docs(vibe): add factory-provisioning guidebook (Ansible + OpenTofu)
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:

- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
  a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
  concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
  crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
  cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
  ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).

Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:11:51 +02:00
b886f06824 docs(vibe): backfill PR #10 crosslink into ADR-0001 + PRD STATUS
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:53:39 +02:00
7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00
827af6b392 Merge pull request 'docs(runbooks): runbook « mettre en service une nouvelle application web »' (#9) from claude/new-web-app-runbook into main 2026-05-31 17:25:37 +02:00
8330d82225 docs(runbooks): add "new web app" setup runbook under doc/runbooks/
Document, as a tree-docs tree, the end-to-end procedure to stand up a new
web application on the Arcodange platform — a mechanic spread across the
factory, tools and app repos with non-trivial ordering dependencies.

Covers: Gitea repo creation (org-secret inheritance), Postgres DB + owner
role (factory/postgres/iac), platform Vault declaration (gitea_cicd_<app>
+ policies, tools/hashicorp-vault/iac), the app Helm chart (VSO dynamic
secrets via pgbouncer), the app Terraform (app_roles module), the CI
workflows (tofu apply + image build, incl. the copy-pasted role pitfall),
and ArgoCD registration (factory/argocd/values.yaml). Adds a naming-
conventions concept page and an ordered checklist.

Wires the legacy doc/adr "setup hello world web app" item and the factory
README to the runbook. New docs live under doc/ (singular) per the PR #8
convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 17:22:30 +02:00
54b3092305 Merge pull request 'fix(docs): place ADR under doc/adr (singular) per convention' (#8) from fix/adr-path-doc-singular into main
Reviewed-on: #8
2026-05-09 15:29:22 +02:00
e0fb337a5f docs: place new ADR under doc/adr (singular) per convention
The 20260509 ADR landed in docs/adr/ (plural) by mistake. Convention
is doc/adr/ (alongside the existing 00_*, 01_*, … docs and the
network-architecture/cicd-architecture ADRs that pre-existed there).

Note : 20260407-*.md files in the typo'd docs/adr/ are still untracked
(never committed) — separate cleanup task.
2026-05-09 14:25:37 +02:00
ea500abe62 Merge pull request 'docs(adr): telegram-gateway auth (Phase 1.5)' (#7) from docs/telegram-gateway-auth-adr into main
Reviewed-on: #7
2026-05-09 14:22:12 +02:00
62673a2d65 docs(adr): telegram-gateway auth (Phase 1.5)
Documents the authentication layer added to telegram-gateway in
Phase 1.5 :

- principal bot @arcodange_factory_bot (handler=auth) gère /auth, /whoami, /logout
- session Redis 24h keyed by Telegram from.id (TTL via AUTH_SESSION_TTL)
- allowlist optionnelle (ALLOWED_USERS) — silent drop avant la gate
- requireAuth secure-by-default (true), opt-out explicite par bot
- handler=auth force requireAuth=false (chicken-and-egg)

Cross-links bidirectionnels avec le code (Gitea URLs vers
arcodange/telegram-gateway), AUTH.md (user-facing) et HOWTO_ADD_BOT.md
(Cas 2 mis à jour). Diagrammes mermaid avec contrastes explicites.
2026-05-09 13:58:27 +02:00
4163b06659 Merge pull request 'argocd: add telegram-gateway application' (#6) from feat/homelab-gateway-app into main
Reviewed-on: #6
2026-05-09 12:41:49 +02:00
3fb7544351 argocd: rename homelab-gateway → telegram-gateway
Aligns with the upstream repo rename
(arcodange/homelab-gateway → arcodange/telegram-gateway) so the name
matches the public URL tg.arcodange.fr and Arcodange's naming
conventions.
2026-05-09 12:35:37 +02:00
5038956332 argocd: add homelab-gateway application
Adds the homelab-gateway Argo CD Application pointing at
arcodange/homelab-gateway (user space, like dance-lessons-coach).

Image Updater watches gitea.arcodange.lab/arcodange/homelab-gateway:latest
with digest strategy.

Phase 1 of the Telegram webhook gateway — a long-running pod that
receives webhooks (no more polling) and routes per-bot to handler
implementations. Initial bot: @arcodange_factory_bot, slug=factory,
echo handler.
2026-05-09 12:25:30 +02:00
6ede249da9 🔒 fix(ansible): gate vault auth disable behind vault_oidc_force_reset (default off) (#5)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 15:03:33 +02:00
9e821e1626 ♻️ refactor(ansible): move gitea secret user-propagation list to inventory (#4)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 14:48:05 +02:00
69b7e9ddcb Merge remote-tracking branch 'origin/main' 2026-05-06 14:38:01 +02:00
069edd72f1 chore(cicd): drop temporary commented-out tasks from 03_cicd.yml
Removes the commented PACKAGES_TOKEN/HOMELAB_CA_CERT blocks and the legacy
"Deploy Argo CD" play that were left behind during the migration to
Helm-based ArgoCD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:37:48 +02:00
a644436746 🔒 fix(ansible): propagate vault_oauth__sh_b64 to user-owned namespaces (arcodange) (#3)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 14:18:06 +02:00
a3526e51f8 Merge remote-tracking branch 'origin/main' 2026-05-06 12:58:01 +02:00
01f0f37691 chore(ansible): add per-collection ansible.cfg + drop trailing whitespace
ansible/arcodange/factory/ansible.cfg sets collections_path so ansible
commands run from inside the collection directory still find user-installed
collections under ~/.ansible/collections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:54 +02:00
f114d7e6f0 feat(argocd): allow per-app syncPolicy override in values.yaml
The apps template hardcoded automated{prune,selfHeal} for every app. Some
apps (e.g. tools, where Vault unseal is manual) need a custom syncPolicy
without selfHeal. Read $app_attr.syncPolicy when set, fall back to the
existing automated default otherwise. Use the override on `tools` to keep
the existing behavior explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:49 +02:00
1688fe0dfd fix(crowdsec): clean up Failed pods before Traefik middleware reload
Re-running the role would leave behind crowdsec pods stuck in Failed phase
(typically after a config error on a previous run), which then blocked the
Traefik middleware refresh. Delete them up front so the next reconcile
schedules fresh pods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:39 +02:00
499410a160 feat(cicd): persist gitea act-runner cache + isolate on dedicated docker network
Pins the actcache server to a fixed port (43707) and exposes it, then
mounts /mnt/arcodange/gitea-runner-cache and /mnt/arcodange/gitea-runner-act
into the runner so the actions/cache and act image layer cache survive
container restarts. Moves the runner onto a dedicated `gitea_action_network`
so CI job containers can reach the cache server by name without sharing the
host network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:34 +02:00
e3e0decd98 docs(adr): extend network-architecture ADR with .lab SSL/TLS deep dive
Replaces the placeholder "Success Metrics" section with a detailed
walkthrough of the internal PKI: Step CA provisioners, cert-manager +
StepClusterIssuer wiring, certificate issuance/renewal sequence diagram,
device-trust installation steps, and troubleshooting playbook for the
common stuck-CertificateRequest / Traefik TLS / device-trust failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:27 +02:00
1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:18 +02:00
934b62d922 chore(ansible): use project-local uv venv for ansible runtime deps
Moves the local ansible runtime from a global `uv tool install ansible-core`
(which required remembering `--with kubernetes --with jmespath --with dnspython`)
to a project-managed venv described by `pyproject.toml` + `uv.lock`. Fixes the
"Failed to import the required Python library (kubernetes)" error on localhost.

The localhost inventory entry now derives `ansible_python_interpreter` from
`{{ ansible_playbook_python }}`, so `uv run ansible-playbook` is enough — no
more hardcoded user-specific paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:35:28 +02:00
09a270d179 🤖 ci(postgres): declare dance-lessons-coach DB + role + pgbouncer lookup (#2)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 08:18:30 +02:00
0ce004cc6a 🤖 ci(argocd): enroll dance-lessons-coach + per-app org override in apps template (#1)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 08:01:50 +02:00
e6fc24c101 fix(dns): harden DNS resilience after power-cut incident
During the 2026-04-13 power cut recovery, DNS resolution failures blocked
Longhorn reinstall. Root causes:
- CoreDNS forwarded to a single hardcoded Pi-hole IP instead of both HA instances
- CoreDNS main Corefile forwarded to /etc/resolv.conf which pointed to itself on pi3
- Pi-hole lacked explicit upstream DNS, relying on DHCP-provided config
- dnsmasq system service conflicted with pihole-FTL on port 53

Changes:
- k3s_dns: forward CoreDNS to both Pi-hole HA instances (pi1 + pi3) dynamically
- k3s_dns: update main CoreDNS Corefile to forward to Pi-holes instead of resolv.conf
- pihole defaults: add explicit upstream DNS servers (8.8.8.8, 1.1.1.1, 8.8.4.4)
- pihole ha_setup: write /etc/dnsmasq.d/99-upstream.conf with explicit upstreams
- rpi: add dnsmasq user to dip group and disable conflicting dnsmasq service on Pi-hole nodes

See docs/adr/20260414-internal-dns-architecture.md for full rationale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:54:42 +02:00