Commit Graph

67 Commits

Author SHA1 Message Date
a644436746 🔒 fix(ansible): propagate vault_oauth__sh_b64 to user-owned namespaces (arcodange) (#3)
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-06 14:18:06 +02:00
01f0f37691 chore(ansible): add per-collection ansible.cfg + drop trailing whitespace
ansible/arcodange/factory/ansible.cfg sets collections_path so ansible
commands run from inside the collection directory still find user-installed
collections under ~/.ansible/collections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:54 +02:00
1688fe0dfd fix(crowdsec): clean up Failed pods before Traefik middleware reload
Re-running the role would leave behind crowdsec pods stuck in Failed phase
(typically after a config error on a previous run), which then blocked the
Traefik middleware refresh. Delete them up front so the next reconcile
schedules fresh pods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:39 +02:00
499410a160 feat(cicd): persist gitea act-runner cache + isolate on dedicated docker network
Pins the actcache server to a fixed port (43707) and exposes it, then
mounts /mnt/arcodange/gitea-runner-cache and /mnt/arcodange/gitea-runner-act
into the runner so the actions/cache and act image layer cache survive
container restarts. Moves the runner onto a dedicated `gitea_action_network`
so CI job containers can reach the cache server by name without sharing the
host network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:34 +02:00
e3e0decd98 docs(adr): extend network-architecture ADR with .lab SSL/TLS deep dive
Replaces the placeholder "Success Metrics" section with a detailed
walkthrough of the internal PKI: Step CA provisioners, cert-manager +
StepClusterIssuer wiring, certificate issuance/renewal sequence diagram,
device-trust installation steps, and troubleshooting playbook for the
common stuck-CertificateRequest / Traefik TLS / device-trust failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:27 +02:00
1ae28cb944 docs(longhorn): document 2026-04-13 power-cut recovery + add data-recovery tooling
Captures the post-mortem of the April 13 power-cut: incident timeline,
retrospective, and architecture/role diagrams. Adds an ADR explaining why
Longhorn cannot re-associate orphaned replica directories after a nuclear
reinstall (engine-id naming), plus block-device recovery runbooks and the
`playbooks/recover/longhorn_data.yml` automation that wires `merge-longhorn-layers.py`
to rebuild PVCs from raw `volume-head-*.img` chains.

Also extends the k3s_pvc backup to capture Longhorn `volumes`/`settings` CRDs
(needed for the fast-path restore) and rewrites the restore script with a
fallback dir + English messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:55:18 +02:00
934b62d922 chore(ansible): use project-local uv venv for ansible runtime deps
Moves the local ansible runtime from a global `uv tool install ansible-core`
(which required remembering `--with kubernetes --with jmespath --with dnspython`)
to a project-managed venv described by `pyproject.toml` + `uv.lock`. Fixes the
"Failed to import the required Python library (kubernetes)" error on localhost.

The localhost inventory entry now derives `ansible_python_interpreter` from
`{{ ansible_playbook_python }}`, so `uv run ansible-playbook` is enough — no
more hardcoded user-specific paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:35:28 +02:00
e6fc24c101 fix(dns): harden DNS resilience after power-cut incident
During the 2026-04-13 power cut recovery, DNS resolution failures blocked
Longhorn reinstall. Root causes:
- CoreDNS forwarded to a single hardcoded Pi-hole IP instead of both HA instances
- CoreDNS main Corefile forwarded to /etc/resolv.conf which pointed to itself on pi3
- Pi-hole lacked explicit upstream DNS, relying on DHCP-provided config
- dnsmasq system service conflicted with pihole-FTL on port 53

Changes:
- k3s_dns: forward CoreDNS to both Pi-hole HA instances (pi1 + pi3) dynamically
- k3s_dns: update main CoreDNS Corefile to forward to Pi-holes instead of resolv.conf
- pihole defaults: add explicit upstream DNS servers (8.8.8.8, 1.1.1.1, 8.8.4.4)
- pihole ha_setup: write /etc/dnsmasq.d/99-upstream.conf with explicit upstreams
- rpi: add dnsmasq user to dip group and disable conflicting dnsmasq service on Pi-hole nodes

See docs/adr/20260414-internal-dns-architecture.md for full rationale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:54:42 +02:00
355ab11c4d fix(system_docker): fix daemon.json corruption on re-run
Two bugs caused daemon.json to be overwritten with invalid content:
- Invalid `when` condition using unsupported Ansible inline stat syntax,
  causing the existing file read to be silently skipped and docker_config
  to always reset to {}
- Folded scalar `>` in set_fact converted the dict to a Python string
  representation, which to_nice_json serialized as a JSON string instead
  of an object

Fixes identified during 2026-04-13 power cut incident post-mortem.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:52:27 +02:00
ad70b424cf Add sequence diagram to Docker storage ADR
This commit adds a detailed sequence diagram to the Docker storage optimization ADR, illustrating the workflow for configuring Docker storage, pinning images, and maintaining Longhorn performance.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:33:03 +02:00
b299469d00 Consolidate ADRs into docs/adr/
This commit moves Architecture Decision Records (ADRs) from ../../../docs/adr/ to docs/adr/ in the arcodange/factory repository. This centralizes all ADRs in one location for better maintainability and discoverability.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:09:34 +02:00
fc9164f11e Update README with detailed playbook execution sequence
This commit updates the README to include a detailed timeline of the playbook execution sequence, organized into sections for system setup, application setup, CI/CD, tools, and backups.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:04:11 +02:00
c751b621ba Enable PostgreSQL backup in backup playbook
This commit uncomments the PostgreSQL backup section in the backup playbook to enable regular backups of the PostgreSQL database.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:04:07 +02:00
07a619b274 Fix step-issuer ARM64 compatibility on pi3
The default kube-rbac-proxy image (gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0) is AMD64-only and fails on pi3 (ARM64). This commit overrides the image to use quay.io/brancz/kube-rbac-proxy:v0.15.0, which supports ARM64.

Note: pi2 (ARMv7) may work with AMD64 images, but pi3 (ARM64) requires an ARM64-compatible image.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
2026-04-08 11:04:03 +02:00
9931f81998 Update Docker storage configuration and revoke token task 2026-04-07 19:19:03 +02:00
437fd506ed Fix Vault Gitea OIDC setup: remove trailing slash from bound_issuer and pass CA certificate 2026-04-07 19:17:47 +02:00
943915be74 gitea act runner: reuse docker images 2026-04-07 09:20:30 +02:00
8a82d14797 upgrade gitea version to 1.25.5 2026-04-06 10:55:20 +02:00
0285d171ff tweack backup and setup cronjob to fix pg table ownership 2026-03-15 22:14:12 +01:00
55d137132f backup k3s volumes 2026-01-23 18:26:28 +01:00
451dfa5133 restart traefik when editing crowdsec middleware 2026-01-03 20:08:00 +01:00
17e99db641 runner image and setup for gitea workflow with self signed cert 2026-01-03 12:44:27 +01:00
5b3c896a25 use self signed cert for internal domain arcodange.lab 2025-12-31 17:38:04 +01:00
91219c49f1 use exposed webapp.arcodange.fr instead in gitea cicd 2025-12-23 14:23:12 +01:00
1fd47e9d97 install pihole to fix failing duckdns name servers 2025-12-23 14:20:04 +01:00
8d6be311ae argocd: add --enable-helm to kustomize ; enable shell from web ui 2025-12-10 13:48:22 +01:00
2b4aa30a64 use cache redis with crowdsec traefik bouncer 2025-12-06 15:09:36 +01:00
cd3c4d86ff install socat package to enable kubectl port-forward 2025-12-06 15:09:12 +01:00
f4cb04c9c9 configure crowdsec captcha with cloudflare turnstile 2025-12-03 16:45:25 +01:00
17a0f23bbb declare gitea external service 2025-12-01 16:22:44 +01:00
f7bfe2f71d get cloudflared client real ip and fix crowdsec mw 2025-11-29 17:24:51 +01:00
72628f0f0e add crowdsec plugin and middleware for traefik 2025-11-26 14:20:09 +01:00
9b09e6bd86 fixes and set preferred_ip since new interface eth0 2025-10-09 17:27:42 +02:00
68fb29357a add tag to run single arcodange.factory.gitea_sync role 2025-09-09 09:03:51 +02:00
6d3adb5834 setup cron local mail reporting and longhorn recurring backup job 2025-09-08 13:25:02 +02:00
c6807851c5 edit crontab to store backup for postgres and gitea 2025-08-28 19:35:52 +02:00
c5a8d5ef52 fixes 2025-08-28 10:13:16 +02:00
6ec2d299fc fix gitea action registration 2025-08-27 18:11:14 +02:00
3cfc5f2bfd refactor storage and setup shared backup directory 2025-08-27 17:26:05 +02:00
588a6482e9 setup longhorn and prepare nfs server to store backups 2025-08-14 15:42:33 +02:00
b4bde14809 fixes 2025-08-09 17:01:18 +02:00
561331b825 fixes 2025-08-07 15:51:53 +02:00
b8636a6d48 document uv python package manager command for ansible setup - minor fixes in playbook 2025-08-05 12:22:27 +02:00
58aece92b6 disable allowIp middleware while fixing ip filtering - upgrade traefik and fix gitea admin urls by adding prefix 2025-08-04 17:35:11 +02:00
b185999478 add pi3 to inventory + fixes 2024-12-15 22:13:03 +01:00
fa0df6f175 create gitea tofu bot user 2024-11-05 23:31:13 +01:00
1c22b946d6 role management for postgres synergy with vault dynamic credentials 2024-10-30 12:23:14 +01:00
f9a47c8ccf traefik CA pem is a client crt not the Authority (let's encrypt) and is not needed here 2024-10-18 19:27:00 +02:00
50399328dc configure vault oidc login and cicd jwt login 2024-10-07 17:39:27 +02:00
2fd5ee703b gitea_action: fix extra_hosts 2024-09-29 17:11:38 +02:00