ADR-0002 Phase B. Makes postgres/iac, argocd, and the conventions docs
multi-environment-capable WITHOUT activating any sandbox yet — every app
stays prod-only, so this change is behaviour-neutral:
- postgres/iac `tofu plan` is a no-op (proven: the elision flatten keys
are bare app names, db=<app>, role=<app>_role — identical addresses)
- the argocd apps.yaml render is byte-identical (181→181 lines, diff
empty) since no app declares `envs`
postgres/iac:
- variables.tf: `applications` becomes set(object({name, envs=optional(["prod"])}))
- main.tf: a `local.app_instances` flatten of applications × envs keyed by the
elided instance id (env=prod → "<app>"); per-app resources iterate it and
reference each.key / each.value.{database,role}. For prod-only apps every
resource address + attribute is unchanged. (main.tf also got a full
`tofu fmt` pass — the pgbouncer function block reindents 4→2 spaces, which
is cosmetic; the correctness gate is the CI tofu plan, not the text diff.)
- terraform.tfvars: string entries → { name = "..." } objects.
argocd/templates/apps.yaml:
- after the prod Application, a `range $app_attr.envs` loop renders one extra
Application per non-prod env: name/namespace `<app>-<env>`, shared repoURL,
helm.valueFiles [values.yaml, values-<env>.yaml], per-env syncPolicy override.
Renders nothing while no app sets `envs` → prod render unchanged.
docs:
- doc/runbooks/new-web-app/conventions.md (FR, authoritative): new section
"Plusieurs environnements pour une même app" — elision rule, suffix rule,
snake-case owner-role exception, erp/erp-sandbox table, ADR-0002 link.
- vibe/guidebooks/lab-ecosystem/naming-conventions.md (EN mirror): the env
coordinate section + a "Two sandbox models" section reconciling the
separate-cluster (ADR-0001, names repeat) vs in-cluster sibling (ADR-0002,
<env> suffix) strategies; Last Updated bumped; ADR-0002 cross-links.
Activation (erp gets envs=["prod","sandbox"] in postgres tfvars + argocd
values + erp/iac) is Phase D, gated by its own plan review.
Refs ADR-0002 (factory#15). Phase A = tools#2 (merged). Phase C = erp#11 (merged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the placeholder References line with the PR URL so the
ADR↔PR crosslink is bidirectional per the AGENTS.md rule.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records the decision to extend the <app> join key with a second
coordinate <env>, governed by an elision rule (env=prod elides → every
existing app's derived names are byte-identical and its tofu plan is a
no-op; non-prod envs take the <app>-<env> suffix, with the Postgres
owner role staying snake-case <app>_<env>_role).
Motivated by the ERP's incoming write-capable AI-agent skill: it needs
an in-cluster sandbox instance (erp-sandbox) with a prod-like Dolibarr
API + isolated database to rehearse writes before a human promotes them
to prod. The ADR reconciles this against ADR-0001 honestly — ADR-0001
rejected an in-cluster sandbox for INFRA-change rehearsal (shared
fleet-wide control planes); ADR-0002 operates one layer up where the
agent's only reach is the app's HTTP API against an isolated DB, so the
fleet blast radius is not in scope. The two are complementary; ADR-0002
does not supersede ADR-0001.
Also:
- vibe/ADR/README.md: index row for 0002 + Last Updated 2026-06-25
- PRD safe-prod-like-environment README: bidirectional back-link to
ADR-0002 on the Adjacent line + Last Updated 2026-06-25
Authored via the ADR Scribe persona, validated via the Continuity Warden
checklist (no-tombstone, breadcrumb, MADR-lite sections, dead-link scan,
bidirectional links).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The one-time import block from the previous change reconciled
cloudflare_r2_bucket.arcodange_tf into state (run #29: "Import complete",
"Apply complete! Resources: 1 imported"). It is now a no-op, so remove it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Run #28 applied cleanly except cloudflare_r2_bucket.arcodange_tf: the bucket
exists in the EU jurisdiction, but its prior state entry lacked the jurisdiction,
so cloudflare provider >=5.20 read it as not-found, removed it from state, and
then failed to recreate it ("already exists"). Add a config-driven import block
with the jurisdiction-qualified id (<account_id>/<bucket_name>/<jurisdiction>) so
the next apply adopts the real bucket. No-op once reconciled; removable after.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the runner CA fix (#11) the iac workflow now runs far enough to apply,
which exposed two provider problems:
cloudflare drift — `cloudflare/cloudflare` floated on `~> 5` with no committed
lock file, so CI pulled v5.21.1 where `cloudflare_account_token.policies[].resources`
is a JSON string, not a map ("Incorrect attribute value type"). Fix:
- pin to `~> 5.21` and commit a multi-platform `.terraform.lock.hcl`
(linux_arm64 for the runner + darwin_arm64 for local);
- `jsonencode(...)` the module's policy resources;
- bind the cloudflare_token module to `cloudflare/cloudflare` explicitly (it was
defaulting to `hashicorp/cloudflare`, pulling a redundant provider);
- stop `.gitignore` from hiding the lock file (the old `.terraform.*` rule did).
gitea provider TLS — it runs inside the dflook/terraform-apply container, which
doesn't trust the homelab CA (only the ubuntu-latest-ca runner does), so it
failed `x509: certificate signed by unknown authority` reaching
gitea.arcodange.lab. Fix: feed it the homelab CA via the provider's `cacert_file`
(TF_VAR_gitea_cacert_file -> the homelab.pem the workflow already materializes).
Validated locally with `tofu validate` + provider-schema inspection (no prod
calls). Complements #11. Out of scope (need a live run / operator): the OVH
consumer-key scope, and the R2 bucket "not found" on refresh (a state reconcile).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
After the move to the self-signed internal DNS (gitea.arcodange.lab /
vault.arcodange.lab), the default `ubuntu-latest` runner image does not
trust the homelab CA, so the `uses:` clone of the vault-action over HTTPS
fails TLS verification. webapp's workflows already moved to the
`ubuntu-latest-ca` runner (whose image ships the homelab CA); apply the
same to the factory `iac` and `postgres` tofu workflows.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-session Claude Code checkouts live under .claude/worktrees/<slug>/
on the trunk; keep them out of git so the main checkout stays clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The two factory-provisioning sub-hubs were the only guidebook index pages without
the "alter a documented component -> update its page in the same PR" reminder that
every sibling hub carries. Add a scoped maintenance rule to each, pointing back to
the factory-provisioning maintenance rule and the guidebooks' Rules to contribute,
so no folder hub silently drifts.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two agent-oriented runbooks under vibe/runbooks/ with [AGENT]/[HUMAN] step
markers, grounded in real diffs:
- new-tool.md : add a platform component to the tools repo so ArgoCD deploys it
into the tools namespace (wrapper Chart.yaml + the tool library + a row in
chart/values.yaml; optional iac/ for secrets). Mirrors the prometheus/crowdsec
additions.
- new-app.md : stand up a brand-new application across THREE repos (app +
factory + tools) with the strict ordering dependency and the TERRAFORM_SSH_KEY
pitfall. Phase-by-phase mapped to the dance-lessons-coach onboarding PRs
(#89/#97/#98/#99/#100), factory #1/#2, tools #1; the FR doc/runbooks/new-web-app
is linked as the detailed companion.
2 mermaid diagrams MCP-validated; zero dead links across the vibe tree.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tree-docs guidebook under vibe/guidebooks/applications/ documenting the common
app pattern and two contrasting archetypes, drilling into lab-ecosystem/01-factory
(bidirectional):
- README.md : the shared app pattern (repo = Dockerfile + chart + optional iac +
CI; ArgoCD app-of-apps; the <app> join key; .fr vs .lab ingress conventions) +
a two-archetype comparison.
- webapp.md : canonical Go + Postgres exemplar (chart, VaultAuth/Static/Dynamic
CRDs, inline iac vs the shared app_roles module, CI); notes the current nuance
that the live pod still uses the static pgbouncer_auth DATABASE_URL.
- url-shortener.md : Rust + SQLite-on-Longhorn-RWO counterpart (single replica,
no iac/no Vault, CI mirrors the upstream image); the power-cut recovery story.
erp is referenced in prose only (its own guidebook lands next). Sibling-repo code
via full gitea URLs; 2 mermaid diagrams MCP-validated; zero dead links.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deep, code-grounded tree-docs guidebook under vibe/guidebooks/factory-provisioning/,
explored from the actual playbooks/roles and tofu code:
- Hub: the two provisioning engines (operator-run Ansible vs CI-applied OpenTofu),
a green-field bring-up flow, master index, maintenance rule.
- ansible/ sub-tree: ordered pages 01-system .. 06-recover, an inventory & variables
concept page, and a Tier-1/Tier-2 roles reference (hashicorp_vault, step_ca,
crowdsec, pihole, deploy_docker_compose + the gitea_* family and helpers).
- opentofu/ sub-tree: factory-iac (Cloudflare/OVH/GCP/Gitea/Vault edge +
cloudflare_token module), postgres-iac (per-app DB/role/pgbouncer lookup),
ci-apply-flow (Gitea OIDC-JWT -> Vault -> auto-approve apply).
Cross-linked bidirectionally with the lab-ecosystem guidebook and the safe-env
ADR/PRD (the sandbox rehearses exactly these engines). 14 mermaid diagrams
MCP-validated; zero dead links. Authored by the Lab Cartographer cohort.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.
vibe/ folders (each with a README hub + contribution rules):
- ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/ dated FR handouts (decks/mp4)
Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.
Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document, as a tree-docs tree, the end-to-end procedure to stand up a new
web application on the Arcodange platform — a mechanic spread across the
factory, tools and app repos with non-trivial ordering dependencies.
Covers: Gitea repo creation (org-secret inheritance), Postgres DB + owner
role (factory/postgres/iac), platform Vault declaration (gitea_cicd_<app>
+ policies, tools/hashicorp-vault/iac), the app Helm chart (VSO dynamic
secrets via pgbouncer), the app Terraform (app_roles module), the CI
workflows (tofu apply + image build, incl. the copy-pasted role pitfall),
and ArgoCD registration (factory/argocd/values.yaml). Adds a naming-
conventions concept page and an ordered checklist.
Wires the legacy doc/adr "setup hello world web app" item and the factory
README to the runbook. New docs live under doc/ (singular) per the PR #8
convention.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 20260509 ADR landed in docs/adr/ (plural) by mistake. Convention
is doc/adr/ (alongside the existing 00_*, 01_*, … docs and the
network-architecture/cicd-architecture ADRs that pre-existed there).
Note : 20260407-*.md files in the typo'd docs/adr/ are still untracked
(never committed) — separate cleanup task.
Documents the authentication layer added to telegram-gateway in
Phase 1.5 :
- principal bot @arcodange_factory_bot (handler=auth) gère /auth, /whoami, /logout
- session Redis 24h keyed by Telegram from.id (TTL via AUTH_SESSION_TTL)
- allowlist optionnelle (ALLOWED_USERS) — silent drop avant la gate
- requireAuth secure-by-default (true), opt-out explicite par bot
- handler=auth force requireAuth=false (chicken-and-egg)
Cross-links bidirectionnels avec le code (Gitea URLs vers
arcodange/telegram-gateway), AUTH.md (user-facing) et HOWTO_ADD_BOT.md
(Cas 2 mis à jour). Diagrammes mermaid avec contrastes explicites.
Aligns with the upstream repo rename
(arcodange/homelab-gateway → arcodange/telegram-gateway) so the name
matches the public URL tg.arcodange.fr and Arcodange's naming
conventions.
Adds the homelab-gateway Argo CD Application pointing at
arcodange/homelab-gateway (user space, like dance-lessons-coach).
Image Updater watches gitea.arcodange.lab/arcodange/homelab-gateway:latest
with digest strategy.
Phase 1 of the Telegram webhook gateway — a long-running pod that
receives webhooks (no more polling) and routes per-bot to handler
implementations. Initial bot: @arcodange_factory_bot, slug=factory,
echo handler.
Removes the commented PACKAGES_TOKEN/HOMELAB_CA_CERT blocks and the legacy
"Deploy Argo CD" play that were left behind during the migration to
Helm-based ArgoCD.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ansible/arcodange/factory/ansible.cfg sets collections_path so ansible
commands run from inside the collection directory still find user-installed
collections under ~/.ansible/collections.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The apps template hardcoded automated{prune,selfHeal} for every app. Some
apps (e.g. tools, where Vault unseal is manual) need a custom syncPolicy
without selfHeal. Read $app_attr.syncPolicy when set, fall back to the
existing automated default otherwise. Use the override on `tools` to keep
the existing behavior explicit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-running the role would leave behind crowdsec pods stuck in Failed phase
(typically after a config error on a previous run), which then blocked the
Traefik middleware refresh. Delete them up front so the next reconcile
schedules fresh pods.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the actcache server to a fixed port (43707) and exposes it, then
mounts /mnt/arcodange/gitea-runner-cache and /mnt/arcodange/gitea-runner-act
into the runner so the actions/cache and act image layer cache survive
container restarts. Moves the runner onto a dedicated `gitea_action_network`
so CI job containers can reach the cache server by name without sharing the
host network.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the placeholder "Success Metrics" section with a detailed
walkthrough of the internal PKI: Step CA provisioners, cert-manager +
StepClusterIssuer wiring, certificate issuance/renewal sequence diagram,
device-trust installation steps, and troubleshooting playbook for the
common stuck-CertificateRequest / Traefik TLS / device-trust failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.
Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the local ansible runtime from a global `uv tool install ansible-core`
(which required remembering `--with kubernetes --with jmespath --with dnspython`)
to a project-managed venv described by `pyproject.toml` + `uv.lock`. Fixes the
"Failed to import the required Python library (kubernetes)" error on localhost.
The localhost inventory entry now derives `ansible_python_interpreter` from
`{{ ansible_playbook_python }}`, so `uv run ansible-playbook` is enough — no
more hardcoded user-specific paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
During the 2026-04-13 power cut recovery, DNS resolution failures blocked
Longhorn reinstall. Root causes:
- CoreDNS forwarded to a single hardcoded Pi-hole IP instead of both HA instances
- CoreDNS main Corefile forwarded to /etc/resolv.conf which pointed to itself on pi3
- Pi-hole lacked explicit upstream DNS, relying on DHCP-provided config
- dnsmasq system service conflicted with pihole-FTL on port 53
Changes:
- k3s_dns: forward CoreDNS to both Pi-hole HA instances (pi1 + pi3) dynamically
- k3s_dns: update main CoreDNS Corefile to forward to Pi-holes instead of resolv.conf
- pihole defaults: add explicit upstream DNS servers (8.8.8.8, 1.1.1.1, 8.8.4.4)
- pihole ha_setup: write /etc/dnsmasq.d/99-upstream.conf with explicit upstreams
- rpi: add dnsmasq user to dip group and disable conflicting dnsmasq service on Pi-hole nodes
See docs/adr/20260414-internal-dns-architecture.md for full rationale.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs caused daemon.json to be overwritten with invalid content:
- Invalid `when` condition using unsupported Ansible inline stat syntax,
causing the existing file read to be silently skipped and docker_config
to always reset to {}
- Folded scalar `>` in set_fact converted the dict to a Python string
representation, which to_nice_json serialized as a JSON string instead
of an object
Fixes identified during 2026-04-13 power cut incident post-mortem.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit adds a detailed sequence diagram to the Docker storage optimization ADR, illustrating the workflow for configuring Docker storage, pinning images, and maintaining Longhorn performance.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit moves Architecture Decision Records (ADRs) from ../../../docs/adr/ to docs/adr/ in the arcodange/factory repository. This centralizes all ADRs in one location for better maintainability and discoverability.
Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>