With the runner CA fix (#11) the iac workflow now runs far enough to apply,
which exposed two provider problems:
cloudflare drift — `cloudflare/cloudflare` floated on `~> 5` with no committed
lock file, so CI pulled v5.21.1 where `cloudflare_account_token.policies[].resources`
is a JSON string, not a map ("Incorrect attribute value type"). Fix:
- pin to `~> 5.21` and commit a multi-platform `.terraform.lock.hcl`
(linux_arm64 for the runner + darwin_arm64 for local);
- `jsonencode(...)` the module's policy resources;
- bind the cloudflare_token module to `cloudflare/cloudflare` explicitly (it was
defaulting to `hashicorp/cloudflare`, pulling a redundant provider);
- stop `.gitignore` from hiding the lock file (the old `.terraform.*` rule did).
gitea provider TLS — it runs inside the dflook/terraform-apply container, which
doesn't trust the homelab CA (only the ubuntu-latest-ca runner does), so it
failed `x509: certificate signed by unknown authority` reaching
gitea.arcodange.lab. Fix: feed it the homelab CA via the provider's `cacert_file`
(TF_VAR_gitea_cacert_file -> the homelab.pem the workflow already materializes).
Validated locally with `tofu validate` + provider-schema inspection (no prod
calls). Complements #11. Out of scope (need a live run / operator): the OVH
consumer-key scope, and the R2 bucket "not found" on refresh (a state reconcile).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-session Claude Code checkouts live under .claude/worktrees/<slug>/
on the trunk; keep them out of git so the main checkout stays clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The two factory-provisioning sub-hubs were the only guidebook index pages without
the "alter a documented component -> update its page in the same PR" reminder that
every sibling hub carries. Add a scoped maintenance rule to each, pointing back to
the factory-provisioning maintenance rule and the guidebooks' Rules to contribute,
so no folder hub silently drifts.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two agent-oriented runbooks under vibe/runbooks/ with [AGENT]/[HUMAN] step
markers, grounded in real diffs:
- new-tool.md : add a platform component to the tools repo so ArgoCD deploys it
into the tools namespace (wrapper Chart.yaml + the tool library + a row in
chart/values.yaml; optional iac/ for secrets). Mirrors the prometheus/crowdsec
additions.
- new-app.md : stand up a brand-new application across THREE repos (app +
factory + tools) with the strict ordering dependency and the TERRAFORM_SSH_KEY
pitfall. Phase-by-phase mapped to the dance-lessons-coach onboarding PRs
(#89/#97/#98/#99/#100), factory #1/#2, tools #1; the FR doc/runbooks/new-web-app
is linked as the detailed companion.
2 mermaid diagrams MCP-validated; zero dead links across the vibe tree.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tree-docs guidebook under vibe/guidebooks/applications/ documenting the common
app pattern and two contrasting archetypes, drilling into lab-ecosystem/01-factory
(bidirectional):
- README.md : the shared app pattern (repo = Dockerfile + chart + optional iac +
CI; ArgoCD app-of-apps; the <app> join key; .fr vs .lab ingress conventions) +
a two-archetype comparison.
- webapp.md : canonical Go + Postgres exemplar (chart, VaultAuth/Static/Dynamic
CRDs, inline iac vs the shared app_roles module, CI); notes the current nuance
that the live pod still uses the static pgbouncer_auth DATABASE_URL.
- url-shortener.md : Rust + SQLite-on-Longhorn-RWO counterpart (single replica,
no iac/no Vault, CI mirrors the upstream image); the power-cut recovery story.
erp is referenced in prose only (its own guidebook lands next). Sibling-repo code
via full gitea URLs; 2 mermaid diagrams MCP-validated; zero dead links.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:
- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).
Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.
vibe/ folders (each with a README hub + contribution rules):
- ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/ dated FR handouts (decks/mp4)
Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.
Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document, as a tree-docs tree, the end-to-end procedure to stand up a new
web application on the Arcodange platform — a mechanic spread across the
factory, tools and app repos with non-trivial ordering dependencies.
Covers: Gitea repo creation (org-secret inheritance), Postgres DB + owner
role (factory/postgres/iac), platform Vault declaration (gitea_cicd_<app>
+ policies, tools/hashicorp-vault/iac), the app Helm chart (VSO dynamic
secrets via pgbouncer), the app Terraform (app_roles module), the CI
workflows (tofu apply + image build, incl. the copy-pasted role pitfall),
and ArgoCD registration (factory/argocd/values.yaml). Adds a naming-
conventions concept page and an ordered checklist.
Wires the legacy doc/adr "setup hello world web app" item and the factory
README to the runbook. New docs live under doc/ (singular) per the PR #8
convention.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 20260509 ADR landed in docs/adr/ (plural) by mistake. Convention
is doc/adr/ (alongside the existing 00_*, 01_*, … docs and the
network-architecture/cicd-architecture ADRs that pre-existed there).
Note : 20260407-*.md files in the typo'd docs/adr/ are still untracked
(never committed) — separate cleanup task.
Documents the authentication layer added to telegram-gateway in
Phase 1.5 :
- principal bot @arcodange_factory_bot (handler=auth) gère /auth, /whoami, /logout
- session Redis 24h keyed by Telegram from.id (TTL via AUTH_SESSION_TTL)
- allowlist optionnelle (ALLOWED_USERS) — silent drop avant la gate
- requireAuth secure-by-default (true), opt-out explicite par bot
- handler=auth force requireAuth=false (chicken-and-egg)
Cross-links bidirectionnels avec le code (Gitea URLs vers
arcodange/telegram-gateway), AUTH.md (user-facing) et HOWTO_ADD_BOT.md
(Cas 2 mis à jour). Diagrammes mermaid avec contrastes explicites.
Aligns with the upstream repo rename
(arcodange/homelab-gateway → arcodange/telegram-gateway) so the name
matches the public URL tg.arcodange.fr and Arcodange's naming
conventions.
Adds the homelab-gateway Argo CD Application pointing at
arcodange/homelab-gateway (user space, like dance-lessons-coach).
Image Updater watches gitea.arcodange.lab/arcodange/homelab-gateway:latest
with digest strategy.
Phase 1 of the Telegram webhook gateway — a long-running pod that
receives webhooks (no more polling) and routes per-bot to handler
implementations. Initial bot: @arcodange_factory_bot, slug=factory,
echo handler.
Removes the commented PACKAGES_TOKEN/HOMELAB_CA_CERT blocks and the legacy
"Deploy Argo CD" play that were left behind during the migration to
Helm-based ArgoCD.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ansible/arcodange/factory/ansible.cfg sets collections_path so ansible
commands run from inside the collection directory still find user-installed
collections under ~/.ansible/collections.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The apps template hardcoded automated{prune,selfHeal} for every app. Some
apps (e.g. tools, where Vault unseal is manual) need a custom syncPolicy
without selfHeal. Read $app_attr.syncPolicy when set, fall back to the
existing automated default otherwise. Use the override on `tools` to keep
the existing behavior explicit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-running the role would leave behind crowdsec pods stuck in Failed phase
(typically after a config error on a previous run), which then blocked the
Traefik middleware refresh. Delete them up front so the next reconcile
schedules fresh pods.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the actcache server to a fixed port (43707) and exposes it, then
mounts /mnt/arcodange/gitea-runner-cache and /mnt/arcodange/gitea-runner-act
into the runner so the actions/cache and act image layer cache survive
container restarts. Moves the runner onto a dedicated `gitea_action_network`
so CI job containers can reach the cache server by name without sharing the
host network.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the placeholder "Success Metrics" section with a detailed
walkthrough of the internal PKI: Step CA provisioners, cert-manager +
StepClusterIssuer wiring, certificate issuance/renewal sequence diagram,
device-trust installation steps, and troubleshooting playbook for the
common stuck-CertificateRequest / Traefik TLS / device-trust failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.
Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the local ansible runtime from a global `uv tool install ansible-core`
(which required remembering `--with kubernetes --with jmespath --with dnspython`)
to a project-managed venv described by `pyproject.toml` + `uv.lock`. Fixes the
"Failed to import the required Python library (kubernetes)" error on localhost.
The localhost inventory entry now derives `ansible_python_interpreter` from
`{{ ansible_playbook_python }}`, so `uv run ansible-playbook` is enough — no
more hardcoded user-specific paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
During the 2026-04-13 power cut recovery, DNS resolution failures blocked
Longhorn reinstall. Root causes:
- CoreDNS forwarded to a single hardcoded Pi-hole IP instead of both HA instances
- CoreDNS main Corefile forwarded to /etc/resolv.conf which pointed to itself on pi3
- Pi-hole lacked explicit upstream DNS, relying on DHCP-provided config
- dnsmasq system service conflicted with pihole-FTL on port 53
Changes:
- k3s_dns: forward CoreDNS to both Pi-hole HA instances (pi1 + pi3) dynamically
- k3s_dns: update main CoreDNS Corefile to forward to Pi-holes instead of resolv.conf
- pihole defaults: add explicit upstream DNS servers (8.8.8.8, 1.1.1.1, 8.8.4.4)
- pihole ha_setup: write /etc/dnsmasq.d/99-upstream.conf with explicit upstreams
- rpi: add dnsmasq user to dip group and disable conflicting dnsmasq service on Pi-hole nodes
See docs/adr/20260414-internal-dns-architecture.md for full rationale.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs caused daemon.json to be overwritten with invalid content:
- Invalid `when` condition using unsupported Ansible inline stat syntax,
causing the existing file read to be silently skipped and docker_config
to always reset to {}
- Folded scalar `>` in set_fact converted the dict to a Python string
representation, which to_nice_json serialized as a JSON string instead
of an object
Fixes identified during 2026-04-13 power cut incident post-mortem.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit adds a detailed sequence diagram to the Docker storage optimization ADR, illustrating the workflow for configuring Docker storage, pinning images, and maintaining Longhorn performance.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit moves Architecture Decision Records (ADRs) from ../../../docs/adr/ to docs/adr/ in the arcodange/factory repository. This centralizes all ADRs in one location for better maintainability and discoverability.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit updates the README to include a detailed timeline of the playbook execution sequence, organized into sections for system setup, application setup, CI/CD, tools, and backups.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit uncomments the PostgreSQL backup section in the backup playbook to enable regular backups of the PostgreSQL database.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
The default kube-rbac-proxy image (gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0) is AMD64-only and fails on pi3 (ARM64). This commit overrides the image to use quay.io/brancz/kube-rbac-proxy:v0.15.0, which supports ARM64.
Note: pi2 (ARMv7) may work with AMD64 images, but pi3 (ARM64) requires an ARM64-compatible image.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>