docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
47
vibe/guidebooks/README.md
Normal file
47
vibe/guidebooks/README.md
Normal file
@@ -0,0 +1,47 @@
|
||||
[vibe](../README.md) > **Guidebooks**
|
||||
|
||||
# Guidebooks
|
||||
|
||||
> **Status:** Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Related:** [vibe runbooks](../runbooks/README.md) · [vibe shareouts](../shareouts/README.md) · canonical docs under [doc/](../../doc/README.md)
|
||||
|
||||
## What a guidebook is
|
||||
|
||||
A **guidebook** is a *tree-doc reference map* of the lab: a navigable set of linked Markdown pages (a root index, per-folder README hubs, and leaf pages wired with breadcrumbs and bidirectional cross-references) whose job is to **describe how the system is actually wired right now** — components, the conventions that join them, and the data/control flows between them.
|
||||
|
||||
Guidebooks are descriptive maps, not procedures. They answer *"how does this fit together?"* For *"how do I execute X step by step?"* see the [runbooks](../runbooks/README.md). For *"why was it built this way?"* see the architecture decision records under [doc/adr](../../doc/adr/README.md).
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart LR
|
||||
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
SYS["Lab system<br>(factory + tools + cms)"]:::src --> GB["Guidebook<br>(tree-doc reference map)"]:::proc --> READER["Reader<br>(human or agent)<br>understands the wiring"]:::store
|
||||
```
|
||||
|
||||
1. The lab system spans three repos — `factory`, `tools`, and `cms` — joined by the `<app>` naming convention.
|
||||
2. A guidebook surveys that system and renders it as a tree-doc reference map: indexed folders, breadcrumb-linked leaves, Mermaid flow diagrams.
|
||||
3. A reader (a human onboarding, or an agent planning a change) consumes the guidebook to understand how the pieces wire together before touching anything.
|
||||
|
||||
## Key maintenance rule
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **If a component documented in a guidebook is altered, the guidebook page describing it MUST be updated in the same change.** A reference map that drifts from reality is worse than no map — it sends readers (and agents) confidently down dead paths. Treat the guidebook edit as part of the diff, not a follow-up: the PR that changes the component is the PR that updates its guidebook page.
|
||||
|
||||
## Index
|
||||
|
||||
| Guidebook | What it maps | Status |
|
||||
|---|---|---|
|
||||
| [Lab ecosystem](lab-ecosystem/README.md) | End-to-end map of `factory` + `tools` + `cms`: repos, the `<app>` join key, secrets via Vault, CI/CD, ArgoCD, and the data/control flows that connect them | ✅ Active |
|
||||
|
||||
## Rules to contribute
|
||||
|
||||
1. **Use the `tree-docs` skill.** Guidebooks are tree-docs: author and grow them with the skill so breadcrumbs, hubs, and cross-links stay consistent.
|
||||
2. **Breadcrumb spine on every file.** The first line of each page is its breadcrumb trail: ancestors are relative links, the current page is the bold-unlinked last item, separator is ` > ` (space-gt-space).
|
||||
3. **README hub per subfolder.** Every folder carries a `README.md` index hub: a table of its children (link + one-line summary + status), sorted by importance/sequence, never alphabetically.
|
||||
4. **Bidirectional links.** When page A references page B as related, page B references A back. Use descriptive link text — never "here" or "this".
|
||||
5. **Mermaid preferences.** Begin each diagram with a `%%{init: {'theme': 'base'}}%%` directive, define a `classDef` palette legible on both light and dark backgrounds (dark fills, light text), use HTML `<br>` for line breaks, and follow every diagram immediately with a numbered ordered list restating the same flow in words.
|
||||
6. **Status legend.** ✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started.
|
||||
7. **Honour the maintenance rule above** — update the relevant guidebook page in the same change that alters the component it documents.
|
||||
122
vibe/guidebooks/lab-ecosystem/01-factory.md
Normal file
122
vibe/guidebooks/lab-ecosystem/01-factory.md
Normal file
@@ -0,0 +1,122 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **01 · factory**
|
||||
|
||||
# 01 · factory
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Downstream:** [02 · tools](02-tools.md) · [03 · cms](03-cms.md)
|
||||
> **Related:** [naming-conventions.md](naming-conventions.md) · [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md)
|
||||
|
||||
`factory` is the **cornerstone admin repo**: it provisions the hosts and the cluster, declares what gets deployed, and owns the platform-level cloud/Gitea/Vault/Postgres state that every app leans on. It has four pillars — **Ansible** (imperative host & cluster setup), **ArgoCD** (declarative app-of-apps), **`iac/`** (OpenTofu for the cloud/Gitea/Vault edge), and **`postgres/iac/`** (per-app PostgreSQL provisioning). The repos `tools` and `cms` are deployed *by* factory's ArgoCD and are mapped in [02 · tools](02-tools.md) and [03 · cms](03-cms.md).
|
||||
|
||||
## Pillar 1 — Ansible ([`ansible/`](../../../ansible/))
|
||||
|
||||
The collection lives at `ansible/arcodange/factory/`. The inventory groups the three Pis and pins the service placement; numbered playbooks run an ordered narrative from bare OS to backups; `recover/` holds the disaster-recovery playbooks.
|
||||
|
||||
### Inventory (`inventory/hosts.yml`)
|
||||
|
||||
| Group | Hosts | Purpose |
|
||||
|---|---|---|
|
||||
| `raspberries` | `pi1`, `pi2`, `pi3` (`192.168.1.201-203`) | All three Pis; `ansible_user: pi` |
|
||||
| `postgres` | `pi2` | The PostgreSQL host (docker-compose, outside k3s) |
|
||||
| `gitea` | children of `postgres` (→ `pi2`) | Gitea co-located with PG on `pi2` |
|
||||
| `pihole` | `pi1`, `pi3` | Internal DNS resolvers |
|
||||
| `step_ca` | `pi1`, `pi2`, `pi3` | Step-CA PKI for `*.arcodange.lab` (primary `pi1`, replicas `pi2`/`pi3`) |
|
||||
| `local` | `localhost` + the Pis | Control-node-local tasks |
|
||||
|
||||
### Numbered playbooks (`playbooks/`)
|
||||
|
||||
| Playbook | Imports / does | Notes |
|
||||
|---|---|---|
|
||||
| `01_system` | `system/system.yml` → rpi base, DNS, SSL, prepare disks, Docker, iSCSI, **k3s install** (`--docker --disable traefik`), CoreDNS, cert-issuer, Longhorn/Traefik config | k3s `v1.34.3+k3s1` via upstream `k3s-ansible`; pi1 server, pi2/pi3 agents |
|
||||
| `02_setup` | `setup/setup.yml` → PostgreSQL + Gitea docker-compose; optional backup-NFS share | Stands up the two out-of-cluster source-of-truth services on `pi2` |
|
||||
| `03_cicd` | Gitea **act-runner** docker-compose on `pi1`/`pi3` (`raspberries:&local:!gitea`), plus the ArgoCD/Image-Updater install | See the ArgoCD caveat below |
|
||||
| `04_tools` | `tools/tools.yml` → `hashicorp_vault.yml`, `crowdsec.yml` | Platform tooling that bootstraps the cluster's Vault + CrowdSec |
|
||||
| `05_backup` | `backup/backup.yml` → `postgres.yml`, `gitea.yml`, `k3s_pvc.yml` to `/mnt/backups` | Scheduled PG/Gitea/PVC backups; cron-report wiring present |
|
||||
|
||||
### Recovery playbooks (`playbooks/recover/`)
|
||||
|
||||
| Playbook | When to use |
|
||||
|---|---|
|
||||
| `longhorn.yml` | Recover Longhorn after a power cut when **Volume CRDs still exist** (CSI driver registration loss) |
|
||||
| `longhorn_data.yml` | Recover app data from **raw replica `.img` files** when Volume CRDs are gone (block-device level) |
|
||||
|
||||
The tested power-cut recovery sequence (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last) is documented in `CLUSTER_RECOVERY.md` at the lab root (outside this repo) and summarized in [storage-and-recovery.md](storage-and-recovery.md). Background on PVC recovery is in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
|
||||
|
||||
### Key roles
|
||||
|
||||
`deploy_docker_compose` (renders compose stacks), `gitea_repo` / `gitea_token` / `gitea_secret` / `gitea_sync` (Gitea repo/token/secret/mirror management), `traefik_certs`, `playwright`, plus sub-roles `step_ca`, `hashicorp_vault`, `crowdsec`, `pihole`.
|
||||
|
||||
## Pillar 2 — ArgoCD app-of-apps ([`argocd/`](../../../argocd/))
|
||||
|
||||
A Helm chart whose `templates/apps.yaml` loops over `values.gitea_applications` and emits one `Application` CRD per app. Each Application derives everything from the app name: `repoURL = https://gitea.arcodange.lab/<org>/<app>`, `path = chart`, `namespace = <app>` (`CreateNamespace=true`), with `syncPolicy.automated` `prune: true` + `selfHeal: true` by default.
|
||||
|
||||
| App | Org override | Image Updater |
|
||||
|---|---|---|
|
||||
| `url-shortener` | — | — |
|
||||
| `tools` | — | explicit `prune`+`selfHeal` |
|
||||
| `webapp` | — | ✅ digest strategy |
|
||||
| `telegram-gateway` | `arcodange` | ✅ digest strategy |
|
||||
| `erp` | — | — |
|
||||
| `cms` | — | ✅ digest strategy |
|
||||
| `dance-lessons-coach` | `arcodange` | ✅ digest strategy |
|
||||
|
||||
> [!NOTE]
|
||||
> The chart also templates a `longhorn_backup_target` and the ArgoCD Image Updater config (`argocd.arcodange.lab`). **ArgoCD itself is not currently deployed in-cluster** — its install is commented out in `03_cicd`. This page documents the intended steady state; treat ArgoCD as "designed, not live" until that step is enabled.
|
||||
|
||||
## Pillar 3 — OpenTofu ([`iac/`](../../../iac/))
|
||||
|
||||
Manages the cloud/Gitea/Vault edge. State lives in **GCS** (`backend "gcs"`, bucket `arcodange-tf`, prefix `factory/main`). Tofu authenticates to Vault via **Gitea OIDC JWT** (mount `gitea_jwt`, role `gitea_cicd`).
|
||||
|
||||
| Provider | Used for |
|
||||
|---|---|
|
||||
| `go-gitea/gitea` (`0.6.0`) | Repos, users, action secrets (e.g. the restricted `tofu_module_reader` CI user, CMS secrets) |
|
||||
| `vault` (`4.4.0`) | KV secrets + policies + k8s auth roles (e.g. Longhorn GCS-backup creds & policy) |
|
||||
| `google` (`7.0.1`) | GCS backup bucket + service account + HMAC key for Longhorn |
|
||||
| `cloudflare/cloudflare` (`~> 5`) | R2 bucket, API tokens, CMS edge wiring (detailed in [03 · cms](03-cms.md)) |
|
||||
| `ovh/ovh` (`2.8.0`) | OAuth2 client + IAM policy for the `arcodange.fr` domain (registrar = OVH) |
|
||||
|
||||
`modules/cloudflare_token` is a reusable scoped-token factory. The whole module reuses the `<app>` name as the GCS state prefix (`<app>/main`) — see [naming-conventions.md](naming-conventions.md).
|
||||
|
||||
## Pillar 4 — per-app PostgreSQL ([`postgres/iac/`](../../../postgres/))
|
||||
|
||||
OpenTofu using the `cyrilgdn/postgresql` provider against PG on `192.168.1.202` (state prefix `factory/postgres`). It iterates over a `var.applications` set and, **per app**, creates:
|
||||
|
||||
| Resource | Name pattern | Purpose |
|
||||
|---|---|---|
|
||||
| Database | `<app>` | The app's database (`template0`, owned by the role) |
|
||||
| Owner role (non-login) | `<app>_role` | Database owner; granted to dynamic users by Vault |
|
||||
| Editor role (login) | `credentials_editor` | Shared admin role that can grant the per-app roles |
|
||||
| `user_lookup()` function | per-`<app>` db | `SECURITY DEFINER` lookup for **pgbouncer** auth (granted to `pgbouncer_auth`, revoked from `public`) |
|
||||
|
||||
Current `applications` set: `webapp`, `erp`, `crowdsec`, `plausible`, `dance-lessons-coach`. Vault's PostgreSQL secrets engine then issues **dynamic** credentials on top of these roles — see [secrets-and-vault.md](secrets-and-vault.md). The pooler (`pgbouncer`) that consumes `user_lookup()` lives in the `tools` namespace — see [02 · tools](02-tools.md).
|
||||
|
||||
## Provisioning order
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart LR
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
S1["01_system<br>OS + k3s + Longhorn"]:::proc --> S2["02_setup<br>PG + Gitea (pi2)"]:::proc --> S3["03_cicd<br>runners + ArgoCD"]:::proc --> S4["04_tools<br>Vault + CrowdSec"]:::proc --> S5["05_backup<br>PG/Gitea/PVC"]:::proc
|
||||
IAC["iac/ + postgres/iac<br>(OpenTofu state in GCS)"]:::store -. "declares cloud/Gitea/Vault/PG" .- S2
|
||||
```
|
||||
|
||||
1. **`01_system`** lays the OS, disks, Docker, and k3s with Longhorn + Traefik onto the three Pis.
|
||||
2. **`02_setup`** stands up PostgreSQL and Gitea as docker-compose on `pi2` — the out-of-cluster source-of-truth services.
|
||||
3. **`03_cicd`** registers the Gitea act-runners (and is where ArgoCD would install, currently commented out).
|
||||
4. **`04_tools`** bootstraps the cluster's Vault and CrowdSec.
|
||||
5. **`05_backup`** schedules PostgreSQL, Gitea, and k3s-PVC backups to `/mnt/backups`.
|
||||
6. In parallel, **OpenTofu** (`iac/` and `postgres/iac/`) declares the cloud, Gitea, Vault, and PostgreSQL objects, keeping state in GCS.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Lab ecosystem hub](README.md) — the whole-lab map this page sits under.
|
||||
- [02 · tools](02-tools.md) — what ArgoCD deploys into the `tools` namespace (incl. pgbouncer that consumes the PG `user_lookup()`).
|
||||
- [03 · cms](03-cms.md) — the CMS edge that `iac/cloudflare.tf` and `iac/ovh.tf` wire up.
|
||||
- [naming-conventions.md](naming-conventions.md) — the `<app>` join key these pillars share.
|
||||
- [secrets-and-vault.md](secrets-and-vault.md) — Gitea OIDC JWT for Tofu/CI and dynamic PG creds.
|
||||
- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn + GCS backup + power-cut recovery.
|
||||
- [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md) · [conventions](../../../doc/runbooks/new-web-app/conventions.md) — the step-by-step procedure these pillars support.
|
||||
- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs.
|
||||
- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — recovery background.
|
||||
76
vibe/guidebooks/lab-ecosystem/02-tools.md
Normal file
76
vibe/guidebooks/lab-ecosystem/02-tools.md
Normal file
@@ -0,0 +1,76 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **02 · tools**
|
||||
|
||||
# 02 · tools
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [01 · factory](01-factory.md)
|
||||
> **Related:** [secrets-and-vault.md](secrets-and-vault.md) · [storage-and-recovery.md](storage-and-recovery.md)
|
||||
|
||||
The [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) is deployed by factory's ArgoCD into the **`tools` namespace**. It is the platform layer that every app namespace depends on: secrets (Vault + VSO), observability (Prometheus + Grafana), edge security (CrowdSec), database pooling (pgbouncer / pgcat), caching (Redis/KeyDB), and analytics (Plausible + ClickHouse). Each component ships its own Helm chart or Kustomize overlay, and most carry an `iac/` directory of OpenTofu that declares the Vault config (roles, policies, dynamic-secret backends) that wires the component to secrets — see [secrets-and-vault.md](secrets-and-vault.md).
|
||||
|
||||
## Components in the `tools` namespace
|
||||
|
||||
| Component | What it does | How declared | How it gets secrets |
|
||||
|---|---|---|---|
|
||||
| **Vault** | Secrets engine: KV v1 + v2, transit, PostgreSQL **dynamic creds**; auth backends `kubernetes` + Gitea **OIDC/JWT** | Helm chart + `iac/` (Vault config of itself + apps) | Is the source of truth; unsealed at boot (1 key, threshold 1) |
|
||||
| **VSO** (Vault Secrets Operator) | Injects Vault secrets into pods via `VaultAuth` + `VaultDynamicSecret` CRDs | Helm chart | Authenticates to Vault via **Kubernetes auth** (per-`<app>` role) |
|
||||
| **Prometheus** | Metrics scraping + storage | Helm (community subchart) | — (scrape configs) |
|
||||
| **Grafana** | Dashboards at `grafana.arcodange.lab`; datasources Prometheus + ClickHouse | Helm | Admin/datasource creds via VSO from Vault |
|
||||
| **CrowdSec** | Behavioural detection + **Traefik bouncer** for the public edge | Helm + `iac/` | **Dynamic secrets** from Vault (VSO) |
|
||||
| **pgbouncer** | Connection pooler to the **external** PostgreSQL on `pi2` | Helm | Auth via the per-app `user_lookup()` function (see [01 · factory](01-factory.md)); creds via VSO |
|
||||
| **pgcat** | Alternative pooler (optional, **not the default**) | Helm | VSO-injected creds when enabled |
|
||||
| **Redis / KeyDB** | In-memory cache; **KeyDB** master/replica (Redis-compatible) | Helm | VSO-injected auth when set |
|
||||
| **Plausible** | Privacy-friendly web analytics | **Kustomize** | VSO-injected creds; backed by ClickHouse |
|
||||
| **ClickHouse** | OLAP column store backing Plausible | **Kustomize** | VSO-injected creds |
|
||||
| **`tool`** | A Helm **library chart** — shared templates/helpers reused by the other charts (not itself deployable) | Helm library chart | n/a |
|
||||
|
||||
## How tools fit together
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart TB
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef edge fill:#d97706,stroke:#b45309,color:#fff
|
||||
|
||||
VAULT[("Vault<br>single source of truth")]:::store
|
||||
VSO["VSO<br>VaultAuth / VaultDynamicSecret"]:::proc
|
||||
PG[("External PostgreSQL<br>pi2 · 192.168.1.202")]:::store
|
||||
PGB["pgbouncer<br>pooler"]:::proc
|
||||
APPS["app pods<br>(webapp, erp, …)"]:::proc
|
||||
PROM["Prometheus"]:::proc
|
||||
GRAF["Grafana<br>grafana.arcodange.lab"]:::proc
|
||||
CH[("ClickHouse")]:::store
|
||||
PLA["Plausible"]:::proc
|
||||
CS["CrowdSec + Traefik bouncer"]:::edge
|
||||
|
||||
VAULT --> VSO
|
||||
VSO -- "inject secrets" --> APPS
|
||||
VSO -- "inject secrets" --> PGB
|
||||
VSO -- "dynamic secret" --> CS
|
||||
APPS --> PGB --> PG
|
||||
PROM --> GRAF
|
||||
CH --> GRAF
|
||||
PLA --> CH
|
||||
```
|
||||
|
||||
1. **Vault** holds every secret; **VSO** is the operator that delivers them into pods.
|
||||
2. VSO **injects** static and dynamic secrets into the app pods, into **pgbouncer**, and supplies **CrowdSec** its dynamic secret.
|
||||
3. App pods connect through **pgbouncer**, which pools connections to the **external PostgreSQL** on `pi2` (using the per-app `user_lookup()` function defined in factory's `postgres/iac/`).
|
||||
4. **Prometheus** scrapes metrics and **ClickHouse** stores analytics; both are wired as **Grafana** datasources.
|
||||
5. **Plausible** writes its analytics into **ClickHouse**.
|
||||
6. **CrowdSec** runs as a Traefik bouncer on the public edge, fed dynamic secrets from Vault — the same edge that fronts the CMS in [03 · cms](03-cms.md).
|
||||
|
||||
## Where to look
|
||||
|
||||
- Repo: [arcodange-org/tools](https://gitea.arcodange.lab/arcodange-org/tools) — each component is a top-level chart/overlay with its own `iac/`.
|
||||
- Vault config patterns: [hashicorp-vault/iac/modules](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac) (e.g. `app_roles`, `app_policy`) — referenced by the [naming convention](../../../doc/runbooks/new-web-app/conventions.md).
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Lab ecosystem hub](README.md) — the whole-lab map.
|
||||
- [01 · factory](01-factory.md) — the ArgoCD that deploys this namespace, and the `postgres/iac/` roles + `user_lookup()` that pgbouncer consumes.
|
||||
- [03 · cms](03-cms.md) — the public edge protected by **CrowdSec** (Turnstile → CrowdSec wiring).
|
||||
- [secrets-and-vault.md](secrets-and-vault.md) — full Vault detail: KV/transit/dynamic engines, Gitea OIDC JWT, VSO injection.
|
||||
- [storage-and-recovery.md](storage-and-recovery.md) — Longhorn PVCs these stateful tools mount, and the Vault-unseal step in recovery.
|
||||
82
vibe/guidebooks/lab-ecosystem/03-cms.md
Normal file
82
vibe/guidebooks/lab-ecosystem/03-cms.md
Normal file
@@ -0,0 +1,82 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **03 · cms**
|
||||
|
||||
# 03 · cms
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [01 · factory](01-factory.md)
|
||||
> **Related:** [02 · tools](02-tools.md) · [secrets-and-vault.md](secrets-and-vault.md)
|
||||
|
||||
The [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) is the **public-facing site** of the lab: a Nuxt static site served at **`arcodange.fr`**, plus the OpenTofu that owns its Cloudflare edge and its Zoho email. It is the one app whose primary audience is the open Internet, so it ties together the public-DNS, tunnel, CAPTCHA, and email plumbing.
|
||||
|
||||
## The Nuxt site
|
||||
|
||||
| Aspect | Detail |
|
||||
|---|---|
|
||||
| App | Static **Nuxt** site |
|
||||
| Chart | [`chart/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/chart) — Helm chart, deployed as ArgoCD app **`cms`** into the `cms` namespace |
|
||||
| Image | Built in CI to the Gitea registry; ArgoCD **Image Updater** tracks `gitea.arcodange.lab/arcodange-org/cms:latest` with the **digest** strategy (see [01 · factory](01-factory.md)) |
|
||||
| Hostname | `arcodange.fr` (public) |
|
||||
|
||||
## Cloudflare edge ([`cloudflare/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/cloudflare))
|
||||
|
||||
OpenTofu (state in cloud object storage) manages the `arcodange.fr` zone. The domain is **registered at OVH** (factory's [`iac/ovh.tf`](../../../iac/ovh.tf) grants the CMS an OVH OAuth2 client to edit nameservers) but its **DNS is delegated to Cloudflare**. The Cloudflare API token + account ID are pushed into the CMS Gitea repo as action secrets and mirrored into Vault by factory's [`iac/cloudflare.tf`](../../../iac/cloudflare.tf).
|
||||
|
||||
| Cloudflare object | Purpose |
|
||||
|---|---|
|
||||
| Zone `arcodange.fr` | Public DNS for the site + email records |
|
||||
| Cloudflare **Pages** option | Static-hosting alternative for the Nuxt build |
|
||||
| **Cloudflared** Zero-Trust tunnel | Exposes **internal Traefik** to the Internet without opening home-LAN ports |
|
||||
| **Turnstile** CAPTCHA | Bot challenge on forms; wired to **CrowdSec** for decisioning |
|
||||
|
||||
The Cloudflared tunnel token and Turnstile secret are stored in **Vault** (see [secrets-and-vault.md](secrets-and-vault.md)); the Turnstile → CrowdSec link is the public-edge guard documented in [02 · tools](02-tools.md).
|
||||
|
||||
## Zoho email ([`zoho/`](https://gitea.arcodange.lab/arcodange-org/cms/src/branch/main/zoho))
|
||||
|
||||
Sets up email for `arcodange.fr`: org/account lookup via the Zoho API + shell scripts, the full DNS authentication record set, and the public aliases.
|
||||
|
||||
| DNS record | Role |
|
||||
|---|---|
|
||||
| CNAME (verify) | Domain ownership verification |
|
||||
| **MX** | Mail routing to Zoho |
|
||||
| **SPF** | Authorized senders |
|
||||
| **DKIM** | Outbound signing |
|
||||
| **DMARC** | Alignment + reporting policy |
|
||||
| **BIMI** | Brand logo in inboxes |
|
||||
|
||||
Seven aliases are provisioned: **bonjour, contact, analytics, books, abonnements, helloworld, bureaux**.
|
||||
|
||||
## Public request + email path
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart LR
|
||||
classDef edge fill:#d97706,stroke:#b45309,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
|
||||
USER(["Visitor"]):::edge
|
||||
CF["Cloudflare<br>DNS + Turnstile"]:::edge
|
||||
TUN["Cloudflared tunnel"]:::edge
|
||||
TRAEFIK["internal Traefik"]:::proc
|
||||
CS["CrowdSec bouncer"]:::proc
|
||||
CMS["cms pod (Nuxt)<br>arcodange.fr"]:::proc
|
||||
MAIL(["Sender"]):::edge
|
||||
ZOHO["Zoho<br>MX / SPF / DKIM / DMARC / BIMI"]:::store
|
||||
|
||||
USER --> CF -- "Turnstile challenge" --> TUN --> TRAEFIK --> CS --> CMS
|
||||
MAIL -- "MX lookup arcodange.fr" --> ZOHO
|
||||
```
|
||||
|
||||
1. A **visitor** resolves `arcodange.fr` through **Cloudflare** DNS; form submissions hit a **Turnstile** challenge.
|
||||
2. Traffic enters the home LAN through the **Cloudflared** Zero-Trust tunnel — no home-LAN ports are opened.
|
||||
3. The tunnel lands on **internal Traefik**, which routes through the **CrowdSec** bouncer (fed Turnstile/decision signals) to the **`cms`** Nuxt pod.
|
||||
4. Separately, **email** to `arcodange.fr` follows the **MX** record to **Zoho**, with **SPF/DKIM/DMARC/BIMI** authenticating and presenting the mail; the seven aliases land there.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Lab ecosystem hub](README.md) — the whole-lab map.
|
||||
- [01 · factory](01-factory.md) — the ArgoCD app `cms`, and `iac/cloudflare.tf` / `iac/ovh.tf` that grant the CMS its Cloudflare token and OVH nameserver-edit rights.
|
||||
- [02 · tools](02-tools.md) — **CrowdSec** (the Traefik bouncer the Turnstile challenge feeds).
|
||||
- [secrets-and-vault.md](secrets-and-vault.md) — the Cloudflared tunnel token and Turnstile/Cloudflare secrets stored in Vault.
|
||||
- Repo: [arcodange-org/cms](https://gitea.arcodange.lab/arcodange-org/cms).
|
||||
116
vibe/guidebooks/lab-ecosystem/README.md
Normal file
116
vibe/guidebooks/lab-ecosystem/README.md
Normal file
@@ -0,0 +1,116 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > **Lab ecosystem**
|
||||
|
||||
# Lab ecosystem
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Related:** [ADR-0001 · safe prod-like environment](../../ADR/0001-safe-prod-like-environment.md) · [PRD · safe prod-like environment](../../PRD/safe-prod-like-environment/README.md) · [INV-001 · prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
|
||||
|
||||
## What this is
|
||||
|
||||
This guidebook is the **end-to-end map of the Arcodange home lab** — how the three repos (`factory`, `tools`, `cms`), the three Raspberry Pis, and the cloud edge wire together into one running system. It is a *descriptive reference map*, not a procedure: it answers *"how does this fit together right now?"*. For *"how do I add a new app step by step?"* see the [new-web-app runbook](../../../doc/runbooks/new-web-app/README.md); for *"why was it built this way?"* see the [factory ADRs](../../../doc/adr/README.md).
|
||||
|
||||
The lab is run from **one control node** — a MacBook Pro M4 — driving everything via Ansible (imperative host setup) and OpenTofu (declarative cloud/Gitea/Vault/Postgres state). The three Pis (`pi1`/`pi2`/`pi3` = `192.168.1.201-203`) sit behind a home Livebox. `pi1` is the k3s server; `pi2`/`pi3` are agents. Gitea + PostgreSQL run as Docker Compose **outside** k3s on `pi2`'s disk; everything else runs **inside** k3s on Longhorn distributed block storage. The public edge is a Cloudflared Zero-Trust tunnel into the internal Traefik, with Cloudflare DNS and Zoho email fronting `arcodange.fr`.
|
||||
|
||||
## The whole lab, end to end
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart TB
|
||||
classDef ctrl fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
classDef host fill:#0891b2,stroke:#0e7490,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
classDef edge fill:#d97706,stroke:#b45309,color:#fff
|
||||
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
|
||||
|
||||
MAC["Control node (MacBook Pro M4)<br>Ansible + OpenTofu"]:::ctrl
|
||||
|
||||
subgraph LAN["Home LAN (Livebox) — 192.168.1.0/24"]
|
||||
subgraph PI2["pi2 · 192.168.1.202 (docker-compose, outside k3s)"]
|
||||
GITEA["Gitea<br>arcodange-org/*"]:::host
|
||||
PG[("PostgreSQL")]:::store
|
||||
end
|
||||
subgraph K3S["k3s cluster — pi1 server, pi2/pi3 agents"]
|
||||
ARGO["ArgoCD app-of-apps<br> /argocd"]:::proc
|
||||
LH[("Longhorn<br>block storage")]:::store
|
||||
VAULT["Vault + VSO<br>secrets"]:::store
|
||||
TRAEFIK["Traefik<br>ingress"]:::proc
|
||||
TOOLS["tools namespace<br>(Vault, Grafana, CrowdSec, …)"]:::host
|
||||
APPS["app namespaces<br>(webapp, erp, cms, …)"]:::host
|
||||
end
|
||||
OLLAMA["pi3 · ollama"]:::host
|
||||
end
|
||||
|
||||
subgraph CLOUD["Cloud edge"]
|
||||
CF["Cloudflare DNS<br>+ Cloudflared tunnel"]:::edge
|
||||
ZOHO["Zoho<br>email (arcodange.fr)"]:::edge
|
||||
GCS[("GCS gs://arcodange-tf<br>OpenTofu state + Longhorn backup")]:::store
|
||||
end
|
||||
|
||||
INTERNET(["Internet"]):::edge
|
||||
|
||||
MAC -- "Ansible: provision hosts, k3s, docker-compose" --> PI2
|
||||
MAC -- "Ansible: k3s, Longhorn, Traefik" --> K3S
|
||||
MAC -- "OpenTofu: Gitea/Vault/PG/Cloudflare/OVH state" --> GITEA
|
||||
MAC -- "OpenTofu state" --> GCS
|
||||
|
||||
GITEA -- "repoURL chart/" --> ARGO
|
||||
ARGO -- "Application CRDs (prune+selfHeal)" --> TOOLS
|
||||
ARGO -- "Application CRDs (prune+selfHeal)" --> APPS
|
||||
VAULT -- "VSO injects secrets into pods" --> TOOLS
|
||||
VAULT -- "VSO injects secrets into pods" --> APPS
|
||||
APPS -- "dynamic creds" --> PG
|
||||
LH -. "PVCs" .- TOOLS
|
||||
LH -. "PVCs" .- APPS
|
||||
LH -- "backup target" --> GCS
|
||||
|
||||
INTERNET --> CF -- "tunnel" --> TRAEFIK --> APPS
|
||||
INTERNET --> ZOHO
|
||||
```
|
||||
|
||||
1. The **control node** (MacBook) provisions the three Pis with Ansible (OS, disks, Docker, k3s, Longhorn, Traefik) and manages all SaaS/Gitea/Vault/Postgres state with OpenTofu.
|
||||
2. On **pi2**, Gitea and PostgreSQL run as Docker Compose *outside* k3s, on the local disk — they are the source-of-truth services the cluster depends on.
|
||||
3. OpenTofu keeps its **state in GCS** (`gs://arcodange-tf`), and Longhorn pushes volume **backups** to the same GCS project.
|
||||
4. **Gitea** hosts every app repo; each repo's `chart/` directory is the deployable Helm chart.
|
||||
5. **ArgoCD's app-of-apps** turns each Gitea repo into an `Application` CRD (automated `prune` + `selfHeal`) that deploys into the `tools` namespace and the per-app namespaces.
|
||||
6. **Vault** is the single source of truth for secrets; the **Vault Secrets Operator (VSO)** injects them into pods via Kubernetes auth, and apps draw dynamic PostgreSQL credentials from Vault against `pi2`.
|
||||
7. **Longhorn** provides the PVCs the in-cluster workloads mount, and backs up to GCS.
|
||||
8. The **public edge** routes Internet traffic through Cloudflare DNS and a Cloudflared Zero-Trust **tunnel** into the internal **Traefik**, which fronts the app namespaces; **Zoho** handles `arcodange.fr` email.
|
||||
|
||||
> [!NOTE]
|
||||
> The ArgoCD Helm chart under [`argocd/`](../../../argocd/) is defined and templated, but **ArgoCD itself is not currently deployed in-cluster** (its install step is commented out in the `03_cicd` provisioning). The app-of-apps wiring documented here is the intended steady state; see [01 · factory](01-factory.md) for the caveat.
|
||||
|
||||
## Deploy / secrets / DNS flows
|
||||
|
||||
- **Deploy flow.** Push to a Gitea repo → CI builds an image into the Gitea registry → ArgoCD (via the app-of-apps and, for some apps, the Image Updater) syncs the `chart/` directory into the matching namespace with `prune` + `selfHeal`. The whole chain keys off one `<app>` identifier — see [naming-conventions.md](naming-conventions.md).
|
||||
- **Secrets flow.** Vault is the **single source of truth** (no sops/age). CI authenticates to Vault via **Gitea OIDC JWT** (role `gitea_cicd_<app>`); pods receive secrets at runtime via **VSO** (Kubernetes auth + `VaultDynamicSecret` CRDs). Detail in [secrets-and-vault.md](secrets-and-vault.md).
|
||||
- **DNS / edge flow.** Internal names resolve under `*.arcodange.lab` (Pi-hole + Step-CA-issued TLS). Public traffic for `arcodange.fr` enters through Cloudflare and a Cloudflared tunnel to internal Traefik; public TLS is Let's Encrypt via Traefik's DNS-challenge (DuckDNS). Email runs through Zoho. Edge detail in [03 · cms](03-cms.md).
|
||||
|
||||
## Master index
|
||||
|
||||
| Page | What it maps | Status |
|
||||
|---|---|---|
|
||||
| [01 · factory](01-factory.md) | The cornerstone admin repo: Ansible host/cluster provisioning, ArgoCD app-of-apps, OpenTofu (`iac/`), and per-app PostgreSQL (`postgres/iac/`) | ✅ Active |
|
||||
| [02 · tools](02-tools.md) | The `tools` namespace: Vault, VSO, Prometheus, Grafana, CrowdSec, poolers, Redis/KeyDB, Plausible + ClickHouse, the `tool` library chart | ✅ Active |
|
||||
| [03 · cms](03-cms.md) | The public-facing site: Nuxt static site, Cloudflare zone + tunnel + Turnstile, Zoho email (MX/SPF/DKIM/DMARC/BIMI + aliases) | ✅ Active |
|
||||
| [naming-conventions.md](naming-conventions.md) | The `<app>` join key — one kebab-case name reused identically across Gitea, PG, Vault, k8s, ArgoCD, GCS, DNS | ✅ Active |
|
||||
| [secrets-and-vault.md](secrets-and-vault.md) | How Vault is the single source of truth: Gitea OIDC JWT for CI, VSO injection for pods, dynamic PostgreSQL creds | ✅ Active |
|
||||
| [storage-and-recovery.md](storage-and-recovery.md) | Longhorn block storage, GCS backup target, and the tested power-cut recovery sequence | ✅ Active |
|
||||
|
||||
## Status legend
|
||||
|
||||
✅ done · 🟡 beta · 🔴 critical · ⚠️ known issue · ❌ disabled · ⬜ not started.
|
||||
|
||||
## Maintenance rule
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **If you alter a component documented here, update its page in the same change.** A reference map that drifts from reality sends readers (and agents) confidently down dead paths. The PR that changes the component is the PR that updates its guidebook page — treat the doc edit as part of the diff, not a follow-up.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [ADR-0001 · safe prod-like environment](../../ADR/0001-safe-prod-like-environment.md) — the decision this map supports.
|
||||
- [PRD · safe prod-like environment](../../PRD/safe-prod-like-environment/README.md) — the product framing of an isolated, prod-like sandbox.
|
||||
- [INV-001 · prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md) — the couplings (the `<app>` join key, shared Vault/PG/Longhorn) that make blast radius real.
|
||||
- [doc/adr](../../../doc/adr/README.md) — the canonical infrastructure ADRs (FRENCH).
|
||||
- [new-web-app conventions](../../../doc/runbooks/new-web-app/conventions.md) — the authoritative source for the `<app>` naming convention.
|
||||
96
vibe/guidebooks/lab-ecosystem/naming-conventions.md
Normal file
96
vibe/guidebooks/lab-ecosystem/naming-conventions.md
Normal file
@@ -0,0 +1,96 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Naming conventions (the `<app>` join key)**
|
||||
|
||||
# Naming conventions — the `<app>` join key
|
||||
|
||||
> **Status**: 🟢 Active
|
||||
> **Last Updated**: 2026-06-23
|
||||
> **Related**: [Lab ecosystem](README.md) · [Factory brick](01-factory.md) · [Secrets & Vault](secrets-and-vault.md) · [PRD — isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md)
|
||||
> **Upstream (source of truth)**: [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md) (French, authoritative)
|
||||
|
||||
## TL;DR
|
||||
|
||||
Every application on the platform is pinned to **one** kebab-case identifier — `<app>` (e.g. `erp`, `webapp`, `url-shortener`, `dance-lessons-coach`). That single string is reused **verbatim**, with no transformation, as the name of the app's Gitea repo, its PostgreSQL database and role, its Vault roles and policies, its Kubernetes namespace and ServiceAccount, its ArgoCD Application, its OpenTofu state prefix, and its DNS records. The bricks of the stack do not point at each other through explicit configuration; they **wire together by guessing each other's names from `<app>`**. Pick the name once, get it right, and the whole chain self-assembles. One typo anywhere, and the chain breaks silently.
|
||||
|
||||
## What `<app>` is
|
||||
|
||||
`<app>` is a **lowercase, kebab-case** slug. It is the join key of the entire platform — the one value that lets a dozen otherwise-independent systems agree on which resources belong to the same application without ever exchanging a config pointer. The canonical, authoritative definition (in French) lives in the runbook: [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md). This page is the English concept summary inside the ecosystem guidebook.
|
||||
|
||||
## The mapping — one name, every system
|
||||
|
||||
The table below shows how each system derives its identifier from `<app>`, with the `erp` application as the worked example.
|
||||
|
||||
| System | Identifier derived from `<app>` | Example (`erp`) |
|
||||
| --- | --- | --- |
|
||||
| Gitea repository | `arcodange-org/<app>` | `arcodange-org/erp` |
|
||||
| PostgreSQL database | `<app>` | `erp` |
|
||||
| PostgreSQL owner role (non-login) | `<app>_role` | `erp_role` |
|
||||
| Vault dynamic DB role | `postgres/creds/<app>` | `postgres/creds/erp` |
|
||||
| Vault Kubernetes auth role | `<app>` | `erp` |
|
||||
| Vault runtime policy (pod) | `<app>` | `erp` |
|
||||
| Vault CI/ops policy | `<app>-ops` | `erp-ops` |
|
||||
| Vault CI JWT role (Gitea OIDC) | `gitea_cicd_<app>` | `gitea_cicd_erp` |
|
||||
| Vault KV config path | `kvv2/<app>/config` | `kvv2/erp/config` |
|
||||
| Kubernetes namespace | `<app>` | `erp` |
|
||||
| Kubernetes ServiceAccount | `<app>` | `erp` |
|
||||
| ArgoCD Application | `<app>` | `erp` |
|
||||
| OpenTofu state prefix (GCS) | `<app>/main` | `erp/main` |
|
||||
| Internal DNS | `<app>.arcodange.lab` | `erp.arcodange.lab` |
|
||||
| Public DNS | `<app>.arcodange.fr` | `erp.arcodange.fr` |
|
||||
|
||||
> [!NOTE]
|
||||
> The `_role` suffix (PG owner role) and the `-ops` suffix (Vault CI policy/identity group) are the only two *systematic* transformations of `<app>`. Everything else uses the bare slug. Note the suffix style differs: PostgreSQL uses an underscore (`erp_role`) because hyphens are awkward in SQL identifiers, whereas Vault and Kubernetes use a hyphen (`erp-ops`).
|
||||
|
||||
## Why uniformity is structuring
|
||||
|
||||
The platform is a set of loosely-coupled bricks (Gitea, Postgres, Vault, k3s/ArgoCD, OpenTofu, DNS). They were deliberately built **not** to hold explicit references to one another. Instead, each brick reconstructs the names it needs from `<app>` at the moment it runs:
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base'}}%%
|
||||
flowchart LR
|
||||
APP["<app><br/>(one kebab-case slug)"]:::src
|
||||
|
||||
APP --> GIT["Gitea repo<br/>arcodange-org/<app>"]:::brick
|
||||
APP --> PG["PostgreSQL<br/>db <app> · role <app>_role"]:::brick
|
||||
APP --> VAULT["Vault<br/>postgres/creds/<app><br/>policy <app> · gitea_cicd_<app>"]:::brick
|
||||
APP --> K8S["Kubernetes<br/>namespace + SA <app>"]:::brick
|
||||
APP --> ARGO["ArgoCD<br/>Application <app>"]:::brick
|
||||
APP --> GCS["OpenTofu state<br/><app>/main"]:::brick
|
||||
APP --> DNS["DNS<br/><app>.arcodange.lab / .fr"]:::brick
|
||||
|
||||
VAULT -.->|"GRANT <app>_role<br/>assumes PG role name"| PG
|
||||
K8S -.->|"VaultDynamicSecret reads<br/>postgres/creds/<app>"| VAULT
|
||||
ARGO -.->|"repoURL=.../<app><br/>namespace=<app>"| GIT
|
||||
|
||||
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
classDef brick fill:#059669,stroke:#047857,color:#fff
|
||||
```
|
||||
|
||||
1. The chosen slug `<app>` is the single input.
|
||||
2. From it, each brick names its own resource: Gitea names the repo `arcodange-org/<app>`; Postgres names the database `<app>` and its owner role `<app>_role`; Vault names the dynamic-creds role `postgres/creds/<app>`, the runtime policy `<app>`, and the CI JWT role `gitea_cicd_<app>`; Kubernetes names the namespace and ServiceAccount `<app>`; ArgoCD names the Application `<app>`; OpenTofu writes state under `<app>/main`; DNS publishes `<app>.arcodange.lab` and `<app>.arcodange.fr`.
|
||||
3. The dashed arrows are the cross-brick assumptions that make it work: the Vault `app_roles` module issues a dynamic PG user with `GRANT <app>_role TO …`, **assuming** the Postgres owner role is named exactly `<app>_role`; the chart's `VaultDynamicSecret` reads `postgres/creds/<app>`, **assuming** the Vault role is named exactly `<app>`; the ArgoCD Application derives `repoURL=.../<app>` and `namespace=<app>` from the slug alone, **assuming** the Gitea repo and the namespace match.
|
||||
4. None of these links is configured by hand. They hold purely because every brick was given the same `<app>` to reconstruct from.
|
||||
|
||||
## The failure mode of a typo
|
||||
|
||||
Because the wiring is by name and not by explicit reference, **nothing validates the join key end-to-end**. A single divergence — `my_app` vs `my-app`, a stray capital (`MyApp`), an accidental plural (`erps`) — does not raise an error at creation time. The mismatched brick simply builds a resource under a name no one else looks for:
|
||||
|
||||
- A Postgres owner role created as `erp-role` (hyphen) instead of `erp_role` → Vault's `GRANT erp_role` fails or grants nothing → the pod gets a DB user with no privileges.
|
||||
- A Gitea repo named `erp-app` instead of `erp` → ArgoCD's derived `repoURL=.../erp` 404s → the Application never syncs.
|
||||
- A namespace typo → the `VaultDynamicSecret` and ServiceAccount land in the wrong place → silent auth failure at pod start.
|
||||
|
||||
The symptom is always the same: a brick that *looks* provisioned but never connects, with no single component to blame. This is why the slug must be **short, stable, and correct from the first step** — there is no safety net downstream.
|
||||
|
||||
✅ Choose a short, stable, lowercase kebab-case name up front and reuse it character-for-character.
|
||||
❌ Never introduce variants (case, separators, plurals); nothing will warn you.
|
||||
|
||||
## Why this makes a sandbox safe
|
||||
|
||||
The `<app>` convention is also the reason a **production-like sandbox can reuse the exact same names** without colliding with production. Because every brick derives its resource names from `<app>` and from nothing else, an entire parallel universe of the platform — its own Vault, its own Postgres instance, its own k3s namespace scope — can host an `erp` named identically to the production `erp`, provided the two universes never share a backing store. Identity comes from the *environment boundary*, not from the name; the name is free to repeat. This is what lets QA and recovery drills run against `erp`, `webapp`, etc. with realistic identifiers instead of mangled `erp-staging`-style aliases that would themselves break the name-wiring. See the PRD's [isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) for how that environment fence is drawn.
|
||||
|
||||
## See also
|
||||
|
||||
- [doc/runbooks/new-web-app/conventions.md](../../../doc/runbooks/new-web-app/conventions.md) — the authoritative French source, with per-step references into the 8-step "new web app" runbook.
|
||||
- [Secrets & Vault](secrets-and-vault.md) — how `gitea_cicd_<app>` and the `<app>` / `<app>-ops` policies fit the auth model.
|
||||
- [Factory brick](01-factory.md) — where the ArgoCD app-of-apps, the Postgres OpenTofu, and the IaC live.
|
||||
- [PRD — isolation boundary](../../PRD/safe-prod-like-environment/isolation-boundary.md) — why identical names are safe across environments.
|
||||
- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md).
|
||||
110
vibe/guidebooks/lab-ecosystem/secrets-and-vault.md
Normal file
110
vibe/guidebooks/lab-ecosystem/secrets-and-vault.md
Normal file
@@ -0,0 +1,110 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Secrets & Vault**
|
||||
|
||||
# Secrets & Vault
|
||||
|
||||
> **Status**: 🟢 Active
|
||||
> **Last Updated**: 2026-06-23
|
||||
> **Related**: [Lab ecosystem](README.md) · [Tools brick](02-tools.md) · [Storage & recovery](storage-and-recovery.md) · [Naming conventions](naming-conventions.md)
|
||||
> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
## TL;DR
|
||||
|
||||
**HashiCorp Vault is the single source of truth for every secret in the lab.** There is no sops, no age, no secret files in git — if a credential exists, Vault either stores it or mints it on demand. Two parties consume secrets, and each authenticates a different way: **pods** use the Kubernetes auth backend (via the Vault Secrets Operator), and **CI / OpenTofu** use Gitea OIDC JWT (one role `gitea_cicd_<app>` per app). Vault holds static config in KV, encryption keys in transit, and issues **short-lived, dynamic** PostgreSQL credentials so no long-lived DB password is ever written down. The trade-off: Vault is sealed on every restart and must be **manually unsealed** (1 key, threshold 1) before anything that needs a secret can come back.
|
||||
|
||||
## Why Vault, and only Vault
|
||||
|
||||
The lab made a deliberate choice: **one** secret store, accessed over the network, rather than encrypted secret files scattered through the repos. The consequences are structuring:
|
||||
|
||||
- **No secret material in git.** Charts and OpenTofu reference Vault *paths*, never values. A leaked repo leaks no credentials.
|
||||
- **One revocation point.** Rotating or revoking a credential happens in Vault; consumers pick up the change on their next read or lease renewal.
|
||||
- **Dynamic over static.** Where a backend supports it (Postgres), Vault issues a fresh, time-boxed credential per consumer instead of a shared static password.
|
||||
|
||||
Vault itself runs as the `hashicorp-vault` chart in the **tools** namespace. Its full configuration — engines, auth backends, policies, the per-app role/policy modules — lives in the tools repo; see the [Tools brick](02-tools.md) for the deployment context.
|
||||
|
||||
## What Vault mounts
|
||||
|
||||
| Mount | Type | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `kvv2/` | KV v2 (versioned) | Application static config, e.g. `kvv2/<app>/config`. Versioned so a bad write can be rolled back. |
|
||||
| KV v1 | KV v1 (unversioned) | Flat secrets that don't need history. |
|
||||
| `transit/` | Transit | Encryption-as-a-service: encrypt/decrypt and sign without exposing the key. |
|
||||
| `postgres/` | Database (dynamic) | Issues **short-lived** PostgreSQL credentials on demand: `postgres/creds/<app>` hands out a fresh login user, granted `<app>_role`, with a lease that expires. |
|
||||
|
||||
The `<app>` slug threads through every one of these paths — `kvv2/<app>/config`, `postgres/creds/<app>` — exactly as described in [Naming conventions](naming-conventions.md).
|
||||
|
||||
## The two auth backends
|
||||
|
||||
Vault doesn't trust callers by static token. Each class of consumer proves its identity through a backend matched to where it runs:
|
||||
|
||||
- **Kubernetes auth** — for **pods**. The Vault Secrets Operator (VSO) and workloads present their Kubernetes ServiceAccount token; Vault validates it against the cluster's API and maps the SA to the Vault role `<app>`, which carries the runtime policy `<app>`.
|
||||
- **Gitea OIDC / JWT auth** — for **CI and OpenTofu**. A Gitea Actions workflow obtains an OIDC token; Vault validates it and maps it to the JWT role `gitea_cicd_<app>`, which carries the CI/ops policy `<app>-ops`. This is how `tofu apply` in CI reads and writes the secrets it manages without any pre-shared Vault token.
|
||||
|
||||
The split matters: pods get only what they need at runtime (the `<app>` policy), while CI gets the broader provisioning rights (`<app>-ops`) needed to *create* the very secrets the pods will later read.
|
||||
|
||||
## How VSO delivers secrets to pods
|
||||
|
||||
Inside the cluster, the **Vault Secrets Operator** is the bridge between Vault and Kubernetes. It watches two CRDs:
|
||||
|
||||
- **`VaultAuth`** — declares *how* to authenticate to Vault (the Kubernetes auth mount + the `<app>` role).
|
||||
- **`VaultDynamicSecret`** (and `VaultStaticSecret`) — declares *what* to fetch (e.g. `postgres/creds/<app>`) and which Kubernetes Secret to materialise it into. For dynamic secrets, VSO also **renews the lease** and rotates the Secret before it expires.
|
||||
|
||||
The pod then mounts the resulting Kubernetes Secret as it would any other — it never speaks to Vault directly, and never sees a static DB password.
|
||||
|
||||
## The secret flow, end to end
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base'}}%%
|
||||
flowchart LR
|
||||
subgraph CI["CI / Provisioning path"]
|
||||
GHA["Gitea Actions<br/>workflow"]:::src
|
||||
TOFU["OpenTofu<br/>tofu apply"]:::proc
|
||||
end
|
||||
|
||||
subgraph RT["Runtime path (in-cluster)"]
|
||||
VSO["Vault Secrets<br/>Operator (VSO)"]:::proc
|
||||
POD["App pod<br/>(ServiceAccount <app>)"]:::proc
|
||||
end
|
||||
|
||||
VAULT["Vault<br/>KV v1/v2 · transit · postgres dynamic"]:::store
|
||||
|
||||
GHA -->|"OIDC JWT<br/>role gitea_cicd_<app>"| VAULT
|
||||
VAULT -->|"policy <app>-ops<br/>read/write secrets"| TOFU
|
||||
TOFU -->|"writes config to<br/>kvv2/<app>/config"| VAULT
|
||||
|
||||
VSO -->|"k8s auth<br/>role <app> (SA token)"| VAULT
|
||||
VAULT -->|"dynamic creds<br/>postgres/creds/<app>"| VSO
|
||||
VSO -->|"materialises +<br/>renews K8s Secret"| POD
|
||||
|
||||
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
```
|
||||
|
||||
1. **CI path:** a Gitea Actions workflow requests an OIDC JWT and presents it to Vault under the role `gitea_cicd_<app>`. Vault validates the token and grants the `<app>-ops` policy.
|
||||
2. With that policy, OpenTofu (`tofu apply`, running in CI) reads the secrets it needs and writes the app's static config back to `kvv2/<app>/config`. No pre-shared Vault token is ever stored — the trust is established per-run via OIDC.
|
||||
3. **Runtime path:** in the cluster, the Vault Secrets Operator authenticates with the Kubernetes auth backend, presenting the app's ServiceAccount token mapped to the Vault role `<app>`.
|
||||
4. Vault issues a **short-lived, dynamic** PostgreSQL credential from `postgres/creds/<app>` back to VSO.
|
||||
5. VSO materialises that credential into a Kubernetes Secret in the app's namespace, then **renews the lease** and rotates the Secret before it expires.
|
||||
6. The app pod mounts the Kubernetes Secret like any other — it never talks to Vault, and never holds a long-lived database password.
|
||||
|
||||
## The unseal model
|
||||
|
||||
Vault encrypts its storage with a master key that is **never persisted in usable form**. On every start — a fresh deploy, a pod reschedule, or a full cluster recovery — Vault comes up **sealed** and refuses every request until it is unsealed.
|
||||
|
||||
- **Shamir config:** 1 unseal key, threshold 1 (a single-operator lab, so no key-splitting ceremony).
|
||||
- **Where the key lives:** on the control node (the MacBook), at `~/.arcodange/cluster-keys.json`. It is *not* in git, *not* in Kubernetes, *not* in Vault.
|
||||
- **Operational consequence:** **nothing that needs a secret recovers until a human unseals Vault.** This is the chokepoint baked into the recovery order — VSO cannot re-auth, dynamic DB creds cannot be issued, and dependent apps cannot start, until the unseal happens. See [Storage & recovery](storage-and-recovery.md) for where unseal sits in the tested startup sequence.
|
||||
|
||||
> [!CAUTION]
|
||||
> If `~/.arcodange/cluster-keys.json` is lost, Vault's data is **unrecoverable** — there is no second copy of the unseal key and no key-recovery path. Treat that file as the most critical secret in the lab.
|
||||
|
||||
## Sandbox implications
|
||||
|
||||
A production-like sandbox does **not** share the production Vault. It runs its **own** Vault instance with its **own** unseal key and its **own** policies, so that exercising secret flows, rotating credentials, or testing a broken unseal cannot touch production secrets. Because the `<app>` join key is environment-relative (see [Naming conventions](naming-conventions.md)), the sandbox can keep identical role and policy names — `gitea_cicd_<app>`, `<app>`, `<app>-ops` — while remaining fully isolated. The rationale for that separate-Vault, separate-unseal posture is recorded in [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md).
|
||||
|
||||
## See also
|
||||
|
||||
- [Tools brick](02-tools.md) — where the `hashicorp-vault` chart, VSO, and the per-app Vault IaC modules are deployed.
|
||||
- [Storage & recovery](storage-and-recovery.md) — Vault unseal as a step in the tested power-cut recovery order.
|
||||
- [Naming conventions](naming-conventions.md) — how `gitea_cicd_<app>`, `<app>`, and `<app>-ops` derive from the join key.
|
||||
- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md) — the sandbox's separate-Vault decision.
|
||||
76
vibe/guidebooks/lab-ecosystem/storage-and-recovery.md
Normal file
76
vibe/guidebooks/lab-ecosystem/storage-and-recovery.md
Normal file
@@ -0,0 +1,76 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Lab ecosystem](README.md) > **Storage & recovery**
|
||||
|
||||
# Storage & recovery
|
||||
|
||||
> **Status**: 🟢 Active
|
||||
> **Last Updated**: 2026-06-23
|
||||
> **Related**: [Lab ecosystem](README.md) · [Secrets & Vault](secrets-and-vault.md) · [Factory brick](01-factory.md) · [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md)
|
||||
> **Decision**: [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
|
||||
> **Upstream (incident ADR)**: [Longhorn PVC recovery](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)
|
||||
|
||||
## TL;DR
|
||||
|
||||
The lab keeps state in **two** places, on purpose. **Longhorn** provides distributed block storage *inside* k3s for everything cluster-native (app PVCs, Traefik's `acme.json`, the backup volume itself). **PostgreSQL and Gitea** deliberately persist on **pi2's local disk, outside k3s**, as plain docker-compose — they are the platform's own foundations and must not depend on the cluster they help run. This split survives a full power cut, but Longhorn has one sharp edge: when its CRDs are wiped and recreated, it assigns **new engine IDs** and cannot automatically re-associate the surviving on-disk replica files with the new volumes. The 2026-04-13 power-cut taught us a **fixed startup order** — Longhorn first, then Vault unseal, then VSO re-auth, with **ERP scaled up last** — that brings the cluster back deterministically. That order is now rehearsed as a drill.
|
||||
|
||||
## Two storage tiers, on purpose
|
||||
|
||||
| Tier | Backing | What lives there | Why here |
|
||||
| --- | --- | --- | --- |
|
||||
| **In-cluster** | **Longhorn** (distributed block storage inside k3s, replicated across pi1/pi2/pi3) | App PVCs, Traefik certificates (`acme.json`), the cluster backup volume (`backups-rwx`) | Cluster-native workloads get replicated, snapshot-able volumes that follow the pod. |
|
||||
| **Outside the cluster** | **docker-compose on pi2's local disk** | PostgreSQL + Gitea | These are *foundations*: Gitea serves the GitOps source and Postgres backs the apps. They must survive — and start — **without** k3s being healthy, so they cannot live inside it. |
|
||||
|
||||
This separation is the reason the platform can bootstrap itself: Gitea and Postgres come up on pi2 independently, and only then does the cluster (which pulls its config from Gitea) have something to sync against. See the [Factory brick](01-factory.md) for how the Ansible playbooks and the ArgoCD app-of-apps consume those foundations.
|
||||
|
||||
## The Longhorn engine-ID re-association failure mode
|
||||
|
||||
Longhorn stores each replica's data on a node in a directory named **`<volume-name>-<engine-id>`**. The raw `volume-head-*.img` files are durable — they survive a power cut on the disk. The danger is in the *metadata*, not the data:
|
||||
|
||||
1. A power cut drops the Longhorn CSI driver.
|
||||
2. Recovering the stuck pods forces a delete of Longhorn's CRDs (Volume / Engine / Replica) — a webhook circular dependency makes a clean shutdown impossible.
|
||||
3. Reinstalling Longhorn recreates the Volume CRDs, but with **new engine IDs**.
|
||||
4. Longhorn creates **new, empty** replica directories under the new engine IDs and **does not adopt** the old, data-bearing directories.
|
||||
|
||||
The result: the real data sits in an orphaned `…-<old-id>/` directory while Longhorn happily serves an empty `…-<new-id>/`. Worse, a naive directory rename can backfire — Longhorn reconciliation may find a `Dirty: true` orphan alongside a clean empty replica and **silently rebuild from the empty one, destroying the data**. The proven safe path is the automated **block-device injection** (Method D): create a fresh volume, attach it in maintenance mode, and `rsync` the recovered, layer-merged image into the live device — never renaming the orphaned directories. The full method comparison, the `playbooks/recover/longhorn_data.yml` automation, and the prevention work (the backup playbook now captures Longhorn Volume CRDs for fast `kubectl apply` restore) are documented in the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md).
|
||||
|
||||
> [!CAUTION]
|
||||
> Do **not** recover a Longhorn volume by renaming the orphaned replica directory to the new engine ID. Reconciliation can pick the empty replica as the rebuild source and overwrite your data. Use the block-device injection playbook instead.
|
||||
|
||||
## The tested 2026-04-13 power-cut recovery
|
||||
|
||||
The April 13, 2026 power cut was recovered end to end and the sequence was distilled into a deterministic startup order. The order is not arbitrary — each step is a **dependency gate** for the next:
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme':'base'}}%%
|
||||
flowchart TD
|
||||
PC["Power cut<br/>(cluster down, disks intact)"]:::dead
|
||||
|
||||
PC --> LH["1 · Restore Longhorn<br/>volumes (block-device<br/>injection if engine IDs changed)"]:::store
|
||||
LH --> VU["2 · Unseal Vault<br/>(1 key, threshold 1,<br/>key on the Mac)"]:::proc
|
||||
VU --> VSO["3 · VSO re-auth<br/>(k8s auth → fresh<br/>dynamic creds)"]:::proc
|
||||
VSO --> ERP["4 · Scale up ERP<br/>last (depends on DB +<br/>injected secrets)"]:::src
|
||||
|
||||
classDef dead fill:#6b7280,stroke:#4b5563,color:#fff
|
||||
classDef store fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef src fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
```
|
||||
|
||||
1. **Restore Longhorn first.** Persistent volumes must be back and attachable before any stateful workload starts. If the engine IDs changed (the failure mode above), recover the data with the block-device injection playbook before proceeding. Nothing that mounts a PVC can come up until this is done.
|
||||
2. **Unseal Vault.** Vault restarts **sealed** and serves nothing until a human unseals it with the single key from `~/.arcodange/cluster-keys.json` (threshold 1). This is the secret-flow chokepoint — see [Secrets & Vault](secrets-and-vault.md). No secret consumer recovers before this step.
|
||||
3. **VSO re-authenticates.** Once Vault is unsealed, the Vault Secrets Operator re-auths over the Kubernetes auth backend and re-issues the dynamic credentials (notably fresh `postgres/creds/<app>` leases) that workloads need. Until VSO has re-populated the Kubernetes Secrets, apps would start with stale or missing credentials.
|
||||
4. **Scale up ERP last.** ERP is the most dependency-heavy app — it needs both the database (on pi2) reachable and its Vault-injected secrets present. Bringing it up only after steps 1–3 are confirmed avoids a crash-loop against a half-recovered platform.
|
||||
|
||||
The single backing fact for this drill — Longhorn restore, Vault unseal, VSO re-auth, ERP scaled up last, plus the 1-key/threshold-1 unseal detail — is recorded in CLUSTER_RECOVERY.md (kept at the lab root, outside this repo).
|
||||
|
||||
## Why this is rehearsed in the sandbox
|
||||
|
||||
A recovery procedure that has only been run once, under the stress of a real outage, is a liability. The production-like sandbox exists partly so this exact sequence can be **rehearsed deliberately** — kill the cluster, lose the engine IDs on a test volume, force a sealed Vault, and walk the four-step order back to green — without risking production data or a live ERP. That makes the drill a routine QA exercise rather than a one-shot incident memory. The QA approach for these drills is laid out in the PRD's [QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md), and the overall decision to maintain such an environment is [ADR 0001](../../ADR/0001-safe-prod-like-environment.md).
|
||||
|
||||
## See also
|
||||
|
||||
- [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID failure mode, the five recovery methods, and the block-device injection automation.
|
||||
- [Secrets & Vault](secrets-and-vault.md) — the unseal model and why it gates step 2 of the recovery order.
|
||||
- [Factory brick](01-factory.md) — the Ansible recover/ playbooks, the ArgoCD app-of-apps, and the Postgres-on-pi2 foundation.
|
||||
- [PRD — QA strategy](../../PRD/safe-prod-like-environment/qa-strategy.md) — how recovery drills become routine QA.
|
||||
- [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md).
|
||||
- CLUSTER_RECOVERY.md — the tested power-cut recovery record (lab root, outside this repo).
|
||||
Reference in New Issue
Block a user