docs(vibe): add tools/ and cms/ guidebooks
Two code-grounded tree-docs guidebooks under vibe/guidebooks/, drilling into the lab-ecosystem 02-tools and 03-cms pages (bidirectional): - tools/ : hub + components.md (Vault+VSO, Prometheus, Grafana, CrowdSec, pgbouncer, Redis/KeyDB, Plausible, ClickHouse; pgcat/tool as Tier-2) + secrets-and-vso.md (Vault engines/auth, the app_roles/app_policy modules = the <app> join-key machinery, VSO CRDs, secret-paths inventory). - cms/ : hub + site.md (Nuxt + dual Pages/k3s deploy) + cloudflare.md (zone via OVH->CF, Pages, cloudflared tunnel, Turnstile, R2 state) + zoho-email.md (OAuth, MX/SPF/DKIM/DMARC/BIMI, the 7 aliases). Sibling-repo code linked via full gitea URLs; vibe-internal links bidirectional. Reconciled the cloudflared tunnel token path to kvv2 cms/cloudflared (the chart VaultStaticSecret is kv-v2; the kvv1 tofu reference is a commented-out stub). 6 mermaid diagrams MCP-validated; zero dead links. Lab Cartographer cohort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
114
vibe/guidebooks/tools/README.md
Normal file
114
vibe/guidebooks/tools/README.md
Normal file
@@ -0,0 +1,114 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > **Tools**
|
||||
|
||||
# Tools
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Guidebooks index](../README.md) · [lab-ecosystem 02 · tools](../lab-ecosystem/02-tools.md)
|
||||
> **Downstream:** [Components](components.md) · [Secrets & VSO](secrets-and-vso.md)
|
||||
> **Related:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
The [`tools` repo](https://gitea.arcodange.lab/arcodange-org/tools) is the lab's **platform layer**: the cluster-wide services every app namespace leans on — secrets (Vault + VSO), observability (Prometheus + Grafana), edge security (CrowdSec), database pooling (pgbouncer), caching (Redis/KeyDB), and analytics (Plausible + ClickHouse). Everything in this repo lands in the single **`tools` namespace**.
|
||||
|
||||
This hub explains the **deploy model** — how one factory-owned ArgoCD Application fans out into one Application per component — and gives a **component inventory**. For per-component internals see [Components](components.md); for how secrets reach the pods see [Secrets & VSO](secrets-and-vso.md).
|
||||
|
||||
## Deploy model
|
||||
|
||||
The whole repo is wired into the cluster through a single **meta-chart** that factory's ArgoCD points at:
|
||||
|
||||
1. Factory's ArgoCD declares **one** Application named `tools` whose source is this repo's [`chart/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart) meta-chart.
|
||||
2. That meta-chart renders two kinds of object from [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml):
|
||||
- an **AppProject** named `tools` ([`chart/templates/project.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/templates/project.yaml)) that pins every child Application to `sourceRepos: tools` and `destinations: tools` namespace only;
|
||||
- one ArgoCD **Application per component** ([`chart/templates/apps.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/templates/apps.yaml) — a `range` over `.Values.tools`), each pointing `path:` at the matching **top-level directory** of the repo (`path: pgbouncer`, `path: grafana`, …).
|
||||
3. Each child Application targets `namespace: tools`, with `automated` sync (`prune: true`, `selfHeal: true`) and `CreateNamespace=true`.
|
||||
4. A component directory is **either** a Helm chart (`Chart.yaml` whose `dependencies:` pull the upstream chart + the `tool` library) **or** a Kustomize overlay (`kustomization.yaml` using a `helmCharts:` inflation generator).
|
||||
5. [`tool/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/tool) is a Helm **library chart** (`type: library`): it ships shared templates/helpers consumed by the component charts via `dependencies:` and is **not deployable** on its own.
|
||||
|
||||
> [!NOTE]
|
||||
> A component is deployed **only if it appears as a key under `tools:` in [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml)**. `pgcat` is present in the repo but commented out there, so no Application is rendered for it.
|
||||
|
||||
## Component inventory
|
||||
|
||||
| Component | How declared (chart + version OR Kustomize) | Ingress host | Persistence | Purpose |
|
||||
|---|---|---|---|---|
|
||||
| **hashicorp-vault** | Helm — `hashicorp/vault` `0.28.1` (+ `tool` lib) | `vault.arcodange.lab` (Traefik, Let's Encrypt) | `storage "file"` at `/vault/data` + audit storage (PVC) | Secrets engine: KV, transit, PostgreSQL dynamic creds; auth `kubernetes` + Gitea OIDC/JWT |
|
||||
| **vault-secrets-operator (VSO)** | Helm — `hashicorp/vault-secrets-operator` `0.9.0`, a dependency of the `hashicorp-vault` chart | — | — | Injects Vault secrets into pods via `VaultAuth` / `VaultDynamicSecret` CRDs; client-cache `direct-encrypted` via transit |
|
||||
| **prometheus** | Helm — `prometheus-community/prometheus` `28.13.0` (app `v3.10.0`) | none (in-cluster) | `persistentVolume` enabled, `8Gi` | Metrics scraping + TSDB storage |
|
||||
| **grafana** | Helm — `grafana/grafana` `10.3.0` (+ `tool` lib) | `grafana.arcodange.lab` (Traefik, Let's Encrypt) | `persistence.enabled: false` (ephemeral; dashboards provisioned) | Dashboards; datasources Prometheus + ClickHouse |
|
||||
| **crowdsec** | Helm — `crowdsecurity/crowdsec` `0.20.1` (+ `tool` lib) | none (Traefik bouncer + AppSec on the edge) | LAPI state in external PostgreSQL (via pgbouncer) | Behavioural detection; agent parses Traefik logs, AppSec virtual-patching |
|
||||
| **pgbouncer** | Helm — `icoretech/pgbouncer` `2.3.1` (+ `tool` lib) | none (cluster service `pgbouncer.tools`) | stateless (config only) | Connection pooler to the **external** PostgreSQL on `pi2` (`192.168.1.202`), pinned via `kubernetes.io/hostname: pi2` |
|
||||
| **redis / KeyDB** | Helm — `pascaliske/redis` `2.1.0` (+ `tool` lib) | none (cluster service) | PVC `create: true`, `1Gi` at `/data` | In-memory cache; KeyDB master + replica, Redis-compatible |
|
||||
| **plausible** | **Kustomize** — inflates `pascaliske/plausible` `2.0.0` | `analytics.arcodange.lab` (Traefik `IngressRoute`, Let's Encrypt) | stateless app; data lives in ClickHouse | Privacy-friendly web analytics; `DB_HOST: pgbouncer.tools` |
|
||||
| **clickhouse** | **Kustomize** — inflates `pascaliske/clickhouse` `0.4.0` + local `databases` chart | none (cluster service) | PVC `16Gi` (StatefulSet) | OLAP column store backing Plausible |
|
||||
| **pgcat** *(disabled)* | Helm — `improwised/pgcat` `0.1.0` — **commented out** in `chart/values.yaml` | — | — | Alternative pooler; not rendered (too constraining: must list every db/user, md5-only auth) |
|
||||
| **tool** *(library)* | Helm **library chart** (`type: library`), not deployable | — | — | Shared templates/helpers consumed by the component charts |
|
||||
|
||||
## How tools fit together
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart TB
|
||||
classDef ext fill:#7c3aed,stroke:#6d28d9,color:#fff
|
||||
classDef proc fill:#059669,stroke:#047857,color:#fff
|
||||
classDef edge fill:#d97706,stroke:#b45309,color:#fff
|
||||
classDef meta fill:#2563eb,stroke:#1e40af,color:#fff
|
||||
|
||||
ARGOCD["factory ArgoCD<br>Application: tools"]:::meta
|
||||
META["tools meta-chart<br>chart/ (apps.yaml + project.yaml)"]:::meta
|
||||
PROJ["AppProject: tools"]:::meta
|
||||
|
||||
subgraph NS["tools namespace"]
|
||||
VAULT[("hashicorp-vault<br>+ VSO")]:::ext
|
||||
PROM["prometheus"]:::proc
|
||||
GRAF["grafana"]:::proc
|
||||
CS["crowdsec<br>Traefik bouncer + AppSec"]:::edge
|
||||
PGB["pgbouncer"]:::proc
|
||||
REDIS[("redis / KeyDB")]:::ext
|
||||
PLA["plausible"]:::proc
|
||||
CH[("clickhouse")]:::ext
|
||||
PODS["app + tool pods"]:::proc
|
||||
end
|
||||
|
||||
PG[("external PostgreSQL<br>pi2 · 192.168.1.202")]:::ext
|
||||
TRAEFIK["Traefik ingress<br>vault / grafana / analytics .arcodange.lab"]:::edge
|
||||
|
||||
ARGOCD --> META
|
||||
META --> PROJ
|
||||
META -- "one Application per component" --> NS
|
||||
VAULT -- "inject secrets (VSO)" --> PODS
|
||||
PGB -- "pools to" --> PG
|
||||
PLA -- "writes analytics" --> CH
|
||||
PROM --> GRAF
|
||||
CH --> GRAF
|
||||
TRAEFIK --> VAULT
|
||||
TRAEFIK --> GRAF
|
||||
TRAEFIK --> PLA
|
||||
CS -- "fronts the edge" --> TRAEFIK
|
||||
```
|
||||
|
||||
1. **Factory's ArgoCD** owns a single Application named `tools` pointed at this repo's `chart/` meta-chart.
|
||||
2. The **meta-chart** renders the `tools` **AppProject** (which scopes every child to the `tools` repo + `tools` namespace) and **one Application per component** listed under `tools:` in `chart/values.yaml`.
|
||||
3. Every child Application deploys into the **`tools` namespace** — Vault+VSO, Prometheus, Grafana, CrowdSec, pgbouncer, Redis/KeyDB, Plausible, ClickHouse.
|
||||
4. **Vault + VSO** inject secrets into app and tool pods via the `VaultAuth` / `VaultDynamicSecret` CRDs.
|
||||
5. **pgbouncer** pools connections out to the **external PostgreSQL** on `pi2` (`192.168.1.202`), the same database CrowdSec's LAPI and Plausible use through it.
|
||||
6. **Plausible** writes analytics into **ClickHouse**; both **Prometheus** and **ClickHouse** are wired as **Grafana** datasources.
|
||||
7. **Traefik** publishes `vault.arcodange.lab`, `grafana.arcodange.lab`, and `analytics.arcodange.lab` over Let's Encrypt, with **CrowdSec** running as the bouncer/AppSec layer fronting that edge.
|
||||
|
||||
## Pages in this guidebook
|
||||
|
||||
| Page | What it covers | Status |
|
||||
|---|---|---|
|
||||
| [Components](components.md) | Per-component internals: chart values, ingress, persistence, how each gets its secrets | ✅ Active |
|
||||
| [Secrets & VSO](secrets-and-vso.md) | How Vault + the Vault Secrets Operator deliver static and dynamic secrets into `tools` pods | ✅ Active |
|
||||
|
||||
## Maintenance rule
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **If a component in the `tools` repo changes, update this guidebook in the same change.** Adding or removing a key under `tools:` in `chart/values.yaml`, bumping an upstream chart version, switching a component between Helm and Kustomize, or changing an ingress host or persistence size all alter the inventory above — keep the table and the diagram in sync as part of the same PR. A reference map that drifts from reality sends readers (and agents) confidently down dead paths.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [lab-ecosystem 02 · tools](../lab-ecosystem/02-tools.md) — the parent whole-lab view of this namespace.
|
||||
- [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) — the lab-wide Vault model these services depend on.
|
||||
- [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) — how each component's `iac/` (Vault config) is applied.
|
||||
- [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md) — why a safe, prod-like environment shapes how these platform services are run.
|
||||
218
vibe/guidebooks/tools/components.md
Normal file
218
vibe/guidebooks/tools/components.md
Normal file
@@ -0,0 +1,218 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Tools](README.md) > **Components**
|
||||
|
||||
# Components
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Tools hub](README.md) · [lab-ecosystem 02 · tools](../lab-ecosystem/02-tools.md)
|
||||
> **Downstream:** [Secrets & VSO](secrets-and-vso.md)
|
||||
> **Related:** [storage & recovery concept](../lab-ecosystem/storage-and-recovery.md) · [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [naming conventions](../lab-ecosystem/naming-conventions.md)
|
||||
|
||||
This is the **per-component reference** for the `tools` platform layer: pinned chart/app versions, the values that actually matter (replicas, storage, ports, auth), and the cross-service wiring. Every component lands in the single **`tools` namespace**. For the deploy model (how one ArgoCD Application fans out into one per component) see the [Tools hub](README.md); for how Vault secrets reach the pods see [Secrets & VSO](secrets-and-vso.md).
|
||||
|
||||
Components split into two **tiers**:
|
||||
|
||||
- **Tier 1** — the load-bearing services, each with its own subsection and value tables below.
|
||||
- **Tier 2** — supporting / inactive pieces, summarised in a single table.
|
||||
|
||||
Severity legend (GitHub alerts): `[!NOTE]` informational · `[!TIP]` good-to-know · `[!WARNING]` operational hazard · `[!CAUTION]` live risk.
|
||||
|
||||
---
|
||||
|
||||
## Tier 1 — load-bearing services
|
||||
|
||||
### hashicorp-vault
|
||||
|
||||
[`hashicorp-vault/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault) — the lab's secrets brain. The chart bundles **three** dependencies: the upstream `vault` server, the `vault-secrets-operator` (VSO) that injects secrets into pods, and the shared `tool` library chart.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Chart deps | `vault` `0.28.1`, `vault-secrets-operator` `0.9.0`, `tool` `0.1.0` |
|
||||
| Mode | `standalone` (single instance, **not** HA / raft) |
|
||||
| Storage | `storage "file"` at `/vault/data` + audit storage enabled |
|
||||
| Listener | TLS **off** (`tls_disable = 1`) on `[::]:8200` — terminated at the edge |
|
||||
| Ingress | `vault.arcodange.lab` (Traefik `websecure`, Let's Encrypt, `localIp@file` middleware) |
|
||||
| UI | enabled (`ui = true`) |
|
||||
| Log level | `trace` |
|
||||
|
||||
**Mounts (secret engines) exposed:**
|
||||
|
||||
| Mount | Type | Purpose |
|
||||
|---|---|---|
|
||||
| `kvv1` | KV v1 | Static secrets (legacy / v1 layout) |
|
||||
| `kvv2` | KV v2 | Versioned static secrets (primary store) |
|
||||
| `transit` | transit | Encryption-as-a-service; backs VSO client-cache (`vso-client-cache` key) |
|
||||
| `postgres` | database | Dynamic PostgreSQL credentials (connection via `pgbouncer.tools:5432`) |
|
||||
|
||||
**Auth methods enabled:**
|
||||
|
||||
| Method | Used by |
|
||||
|---|---|
|
||||
| `kubernetes` | In-cluster workloads (VSO, app ServiceAccounts) authenticate by SA token |
|
||||
| `gitea_jwt` | Gitea Actions / OIDC-JWT pipelines authenticate from CI |
|
||||
|
||||
> [!NOTE]
|
||||
> The full secret-engine layout, VSO `VaultAuth` / `VaultConnection` / `VaultDynamicSecret` wiring, and the `kvv2/data/...` path conventions are documented in [Secrets & VSO](secrets-and-vso.md) — this page only inventories what the chart stands up.
|
||||
|
||||
The VSO sub-chart ships a `defaultVaultConnection` pointing at `http://hashicorp-vault.tools.svc.cluster.local:8200` and a client cache with `persistenceModel: direct-encrypted`, encrypted through the `transit` mount.
|
||||
|
||||
### prometheus
|
||||
|
||||
[`prometheus/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/prometheus) — metrics collection and TSDB, via the `kube-prometheus`-style community chart.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Chart deps | `prometheus` `28.13.0` (app `v3.10.0`), `tool` `0.1.0` |
|
||||
| Server replicas | `1` (Deployment, `strategy: Recreate`) |
|
||||
| Server storage | `persistentVolume` enabled, **8Gi** at `/data` (`ReadWriteOnce`) |
|
||||
| Retention | `15d` |
|
||||
| Alertmanager | enabled, persistence **2Gi** (`ReadWriteOnce`) |
|
||||
| node-exporter | enabled (DaemonSet, `prometheus-node-exporter` sub-chart) |
|
||||
| kube-state-metrics | enabled |
|
||||
| pushgateway | enabled (`prometheus.io/probe: pushgateway`) |
|
||||
| Scrape / eval interval | `1m` (scrape timeout `10s`) |
|
||||
| Ingress | none — **internal only** |
|
||||
|
||||
**Scrape targets** (default `scrapeConfigs`, all enabled): the Prometheus server itself, the Kubernetes API servers, nodes + kubelet cadvisor, plus **annotation-based** service-endpoint and pod discovery (`prometheus.io/scrape`, `prometheus.io/port`, `prometheus.io/path`, `prometheus.io/scheme`), with `*-slow` (5m) variants for cheaper targets.
|
||||
|
||||
### grafana
|
||||
|
||||
[`grafana/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana) — dashboards over Prometheus and ClickHouse.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Chart deps | `grafana` `10.3.0` (app `latest`), `tool` `0.1.0` |
|
||||
| Replicas | `1` (Deployment, `RollingUpdate`) |
|
||||
| Persistence | **disabled** — ephemeral; dashboards/datasources are provisioned at boot |
|
||||
| Ingress | `grafana.arcodange.lab` (Traefik `websecure`, Let's Encrypt, `localIp@file` middleware) |
|
||||
| Plugin | `grafana-clickhouse-datasource` |
|
||||
| Resources | requests `100m` / `128Mi`, limits `100m` / `512Mi` |
|
||||
| Timezone | `Europe/Paris` |
|
||||
|
||||
**Datasources (provisioned):**
|
||||
|
||||
| Name | Type | Target | Default |
|
||||
|---|---|---|---|
|
||||
| Prometheus | `prometheus` | `http://prometheus-server.tools.svc.cluster.local` | ✅ yes |
|
||||
| clickhouse | `grafana-clickhouse-datasource` | `clickhouse.tools.svc.cluster.local:9000` (native, `tlsSkipVerify`) | no |
|
||||
|
||||
> [!WARNING]
|
||||
> The Grafana **admin password is static and committed** in [`grafana/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/grafana/values.yaml) (`adminUser: admin`). The provisioned ClickHouse datasource password is committed there too (`secureJsonData.password`). Treat these as lab-only credentials; do not reuse them outside the homelab.
|
||||
|
||||
### crowdsec
|
||||
|
||||
[`crowdsec/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec) — behavioural edge security that feeds a Traefik blocklist.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Chart deps | `crowdsec` `0.20.1`, `tool` `0.1.0` |
|
||||
| LAPI | Deployment (`RollingUpdate`, `maxUnavailable: 0`) — the local API + decision store |
|
||||
| Agent | DaemonSet pinned to control-plane nodes (`node-role.kubernetes.io/control-plane`) |
|
||||
| Log source | parses **Traefik** pod logs in `kube-system` (`podName: traefik-*`, `program: traefik`) |
|
||||
| Collections | `crowdsecurity/traefik`, `crowdsecurity/http-cve` (+ AppSec rules below) |
|
||||
| AppSec (WAF) | **enabled** — `crowdsecurity/appsec-default` on `0.0.0.0:7422`; collections `appsec-virtual-patching` + `appsec-generic-rules` |
|
||||
| Database | external PostgreSQL `crowdsec` via **pgbouncer** (`host: pgbouncer.tools:5432`, `type: postgresql`) |
|
||||
| DB credentials | dynamic, from secret `crowdsec-db-credentials` (`DB_USER` / `DB_PASSWORD`, sourced via VSO) |
|
||||
| Console | enrolled as instance `homelab` |
|
||||
|
||||
The decisions CrowdSec produces are surfaced as a **Traefik middleware blocklist applied at the edge**, so malicious IPs are dropped before they reach app namespaces. `server_reset_query: DEALLOCATE ALL` on pgbouncer (below) exists specifically to keep CrowdSec's prepared statements happy through the pooler. The CAPTCHA challenge CrowdSec serves on remediated requests is a **Cloudflare Turnstile widget minted by the `cms` repo** — see the [CMS Cloudflare page](../cms/cloudflare.md), which produces the sitekey/secret this bouncer consumes from Vault.
|
||||
|
||||
### pgbouncer
|
||||
|
||||
[`pgbouncer/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgbouncer) — the connection pooler in front of the **external** PostgreSQL.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Chart deps | `pgbouncer` `2.3.1` (`icoretech/pgbouncer`), `tool` `0.1.0` |
|
||||
| Scheduling | `nodeSelector: kubernetes.io/hostname: pi2` (co-located with PostgreSQL) |
|
||||
| Upstream DB | external PostgreSQL at `192.168.1.202:5432` (the `pi2` host), wildcard database `"*"` |
|
||||
| Auth type | `scram-sha-256` |
|
||||
| `auth_query` | `SELECT uname, phash FROM user_lookup($1)` |
|
||||
| `server_reset_query` | `DEALLOCATE ALL` (clears prepared statements — fixes CrowdSec re-use) |
|
||||
| `server_idle_timeout` | `7200` (2h) |
|
||||
| `ignore_startup_parameters` | `extra_float_digits` (unsupported JDBC arg) |
|
||||
| Exporter | disabled |
|
||||
| Service | `pgbouncer.tools:5432` (cluster-internal) |
|
||||
|
||||
> [!NOTE]
|
||||
> pgbouncer is the single front door to the lab's PostgreSQL: CrowdSec, Plausible, and Vault's `postgres` dynamic-secret backend all connect through `pgbouncer.tools:5432`, never to `192.168.1.202` directly.
|
||||
|
||||
### redis (KeyDB)
|
||||
|
||||
[`redis/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/redis) — the in-memory cache / session store. The chart targets **KeyDB** (EqAlpha, Redis-compatible), tuned for the 2× Raspberry Pi 5 nodes.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Chart deps | `redis` `2.1.0` (`pascaliske/redis`), `tool` `0.1.0` |
|
||||
| Workload | **StatefulSet** (master at index 0, replica running `replicaof` the master) |
|
||||
| Storage | PVC `create: true`, **1Gi** at `/data` (`ReadWriteOnce`) |
|
||||
| Tuning | `server-threads 4` (ARM-tuned for the Pi 5 cores) |
|
||||
| Port | `6379` (`ClusterIP`) |
|
||||
| Security | `runAsUser/Group/fsGroup: 999`, non-root |
|
||||
| Timezone | `Europe/Paris` |
|
||||
|
||||
> [!NOTE]
|
||||
> Access the instance for inspection with `kubectl port-forward -n tools svc/redis 6379:6379` and Redis Insights (per the [chart README](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/redis/README.md)).
|
||||
|
||||
### plausible
|
||||
|
||||
[`plausible/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible) — privacy-friendly web analytics. Deployed via a **Kustomize** overlay that inflates the upstream Helm chart (not a `Chart.yaml` dependency like the Tier-1 charts above).
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Declared via | Kustomize `helmCharts:` inflation generator |
|
||||
| Chart / version | `plausible` `2.0.0` (`pascaliske/plausible`), image `ghcr.io/plausible/community-edition` |
|
||||
| Replicas | `1` (Deployment) |
|
||||
| Ingress | `analytics.arcodange.lab` (Traefik IngressRoute, Let's Encrypt, `localIp@file` middleware) |
|
||||
| App DB | PostgreSQL via **pgbouncer** — an **init container** assembles `DATABASE_URL` from VSO dynamic creds |
|
||||
| Event store | **ClickHouse** (see below) |
|
||||
| GeoIP | MaxMind **GeoLite2** (`GeoLite2-Country` + `GeoLite2-City`), license key from secret `plausible-geoip` |
|
||||
| Secrets | `SECRET_KEY_BASE` / `TOTP_VAULT_KEY` from existing secret `plausible-config` (VSO-fed) |
|
||||
|
||||
Plausible writes analytics events to ClickHouse and stores app/account state in PostgreSQL — two distinct backends, both reached through lab-internal services.
|
||||
|
||||
### clickhouse
|
||||
|
||||
[`clickhouse/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/clickhouse) — the OLAP column store behind Plausible. Also a **Kustomize** overlay inflating the upstream chart, plus a `databases` sub-chart that runs an init job.
|
||||
|
||||
| Key | Value |
|
||||
|---|---|
|
||||
| Declared via | Kustomize `helmCharts:` inflation generator (`chartHome: charts`) |
|
||||
| Chart / version | `clickhouse` `0.4.0` (`pascaliske/clickhouse`), image `clickhouse/clickhouse-server` |
|
||||
| Workload | **StatefulSet**, `replicas: 1` |
|
||||
| Storage | PVC **16Gi** at `/var/lib/clickhouse` (`ReadWriteOnce`) |
|
||||
| Ports | `8123` (HTTP), `9000` (native protocol) |
|
||||
| Custom user | `arcodange` (full network access, `access_management: 1`) via `custom-users.xml` |
|
||||
| Security | `runAsUser/Group/fsGroup: 101`, non-root |
|
||||
| Timezone | `Europe/Paris` |
|
||||
|
||||
> [!WARNING]
|
||||
> The ClickHouse `arcodange` user password is **static and committed** in [`clickhouse/clickhouseValues.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/clickhouse/clickhouseValues.yaml) (`custom-users.xml`). The same value appears in Grafana's provisioned datasource — keep the two in sync if you rotate it.
|
||||
|
||||
> [!CAUTION]
|
||||
> ClickHouse carries a `nodeAffinity` that **excludes `pi2`** (`kubernetes.io/hostname NotIn [pi2]`). `pi2` hosts PostgreSQL and pgbouncer; ClickHouse is deliberately kept off it to avoid I/O contention on that node. A cluster where `pi2` is the only schedulable node will leave ClickHouse `Pending`.
|
||||
|
||||
---
|
||||
|
||||
## Tier 2 — supporting & inactive
|
||||
|
||||
| Component | Status | Notes |
|
||||
|---|---|---|
|
||||
| [`pgcat/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/pgcat) | ❌ disabled | Alternative Postgres pooler (`pgcat` chart `0.1.0`). Not in service — its sole pool has empty `username`/`password`/`database` placeholders, and it is **not** keyed under `tools:` in [`chart/values.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/chart/values.yaml), so ArgoCD renders no Application for it. [pgbouncer](#pgbouncer) is the active pooler. |
|
||||
| [`tool/`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/tool) | ✅ active (library) | Helm **library chart** (`type: library`, version `0.1.0`) consumed by **every** component chart via `dependencies:`. Ships shared templates/helpers; **not deployable** on its own. |
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
> [!WARNING]
|
||||
> **No high availability.** Every Tier-1 service runs a **single replica** — Vault (`standalone`), Prometheus (`replicaCount: 1`), Grafana (`replicas: 1`), ClickHouse and Redis/KeyDB StatefulSets (`replicas: 1`), Plausible and the CrowdSec LAPI (single Deployment). Any node drain or pod restart is a brief outage for that service, not a failover.
|
||||
|
||||
> [!WARNING]
|
||||
> **Static, committed passwords.** Grafana admin (+ its ClickHouse datasource), the ClickHouse `arcodange` user, and the pgbouncer admin/auth users all carry plaintext credentials in their `values.yaml`. They are lab-only; rotate before any exposure and never copy them to a real environment.
|
||||
|
||||
> [!CAUTION]
|
||||
> **ClickHouse must avoid `pi2`.** The `NotIn [pi2]` `nodeAffinity` keeps it off the PostgreSQL/pgbouncer host. If `pi2` is the only schedulable node, ClickHouse (and therefore Plausible analytics) stays `Pending`. See the [storage & recovery concept](../lab-ecosystem/storage-and-recovery.md) for how PVC-backed services map onto specific nodes.
|
||||
|
||||
> [!CAUTION]
|
||||
> **Vault is single-instance and starts sealed.** After **any** restart (pod reschedule, node reboot, chart upgrade) Vault comes up **sealed** with no automatic unseal configured — every VSO injection and dynamic-secret lease blocks until an operator unseals it. This is the first thing to check when secrets stop flowing across the cluster; the unseal procedure lives in [Secrets & VSO](secrets-and-vso.md).
|
||||
234
vibe/guidebooks/tools/secrets-and-vso.md
Normal file
234
vibe/guidebooks/tools/secrets-and-vso.md
Normal file
@@ -0,0 +1,234 @@
|
||||
[vibe](../../README.md) > [Guidebooks](../README.md) > [Tools](README.md) > **Secrets & VSO**
|
||||
|
||||
# Tools — Secrets & VSO
|
||||
|
||||
> **Status:** ✅ Active
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Tools](README.md) · [Components](components.md)
|
||||
> **Downstream:** consumed by every `tools`-namespace pod and by every app's CI/CD
|
||||
> **Related:** [secrets-and-vault concept](../lab-ecosystem/secrets-and-vault.md) · [naming-conventions concept](../lab-ecosystem/naming-conventions.md) · [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md) · [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) · [postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) · [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md)
|
||||
|
||||
This page maps how secrets live in **HashiCorp Vault** (engines, auth backends) and how they reach **Kubernetes pods** via the **Vault Secrets Operator (VSO)**. The keystone is the **`app_policy` + `app_roles` module pair**: the machinery that turns a single `<app>` name into a matched set of Vault policies, roles, and CI identities — the same `<app>` join key documented in the [naming-conventions concept](../lab-ecosystem/naming-conventions.md).
|
||||
|
||||
Vault itself runs as a component in the `tools` namespace; see the [Components](components.md) page for its deploy shape. The admin/bootstrap layer (the `kvv1` engine, the `gitea_jwt` auth backend, the base `gitea_cicd` role, the Kubernetes auth backend mount) is created **by factory's Ansible-managed Vault Terraform** in [`hashicorp_vault.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf); everything in this page that is *per-app* is created by the IaC under [`hashicorp-vault/iac`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac).
|
||||
|
||||
> [!CAUTION]
|
||||
> Vault runs **standalone** with file/raft storage and starts **sealed** after any restart or node reboot. Until it is unsealed, every VSO read fails and no app can fetch DB creds or config — pods that depend on a `VaultDynamicSecret` will not start. Unseal procedure and key custody live in [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md).
|
||||
|
||||
---
|
||||
|
||||
## 1) Vault engines & auth backends
|
||||
|
||||
All engines below are mounted by [`hashicorp-vault/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) except `kvv1`, which is bootstrapped by factory's Ansible Vault Terraform.
|
||||
|
||||
| Mount | Type | Holds | Defined in |
|
||||
|---|---|---|---|
|
||||
| `kvv1` | KV **v1** | Admin / cloud secrets: `kvv1/google/credentials`, `kvv1/gitea/*`, `kvv1/cloudflare/*`, `kvv1/ovh/*`, `kvv1/postgres/credentials`, `kvv1/admin/*` | factory [`hashicorp_vault.tf`](https://gitea.arcodange.lab/arcodange-org/factory/src/branch/main/ansible/arcodange/factory/playbooks/tools/roles/hashicorp_vault/files/hashicorp_vault.tf) |
|
||||
| `kvv2` | KV **v2** (versioned) | Per-app config secrets under `kvv2/<app>/*` | [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) |
|
||||
| `transit` | transit | The **VSO client-cache encryption key** `vso-client-cache` — lets VSO persist its client cache encrypted so it survives an operator restart without re-auth storms | [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) |
|
||||
| `postgres` | database | **Dynamic** Postgres creds at `postgres/creds/<app>`; connects to the DB through `pgbouncer.tools:5432` using the `credentials_editor` root account | [`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf) |
|
||||
|
||||
The `postgres` connection is configured with `allowed_roles = ["*"]` and a root-rotation statement (`ALTER USER … WITH PASSWORD`); the editor username/password come from the sensitive `POSTGRES_CREDENTIALS_EDITOR_*` variables.
|
||||
|
||||
### Auth backends
|
||||
|
||||
| Backend | Mount | Who uses it | Role(s) |
|
||||
|---|---|---|---|
|
||||
| `kubernetes` | `kubernetes` | VSO controller + every app pod's ServiceAccount | `vault-secret-operator` (VSO itself), `<app>` (one per app), `factory_crowdsec_conf` |
|
||||
| `gitea_jwt` | `gitea_jwt` | CI/OpenTofu jobs running in Gitea Actions | `gitea_cicd` (base, factory-bootstrapped) + per-app `gitea_cicd_<app>` |
|
||||
|
||||
- **`kubernetes`** auth ([`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf)) is configured against `https://kubernetes.default.svc:443`. The VSO role `vault-secret-operator` binds SA `hashicorp-vault-vault-secrets-operator-controller-manager` in ns `tools`, `audience = vault`, and carries the `edit-vso-client-cache` policy (encrypt/decrypt on `transit/.../vso-client-cache`).
|
||||
- **`gitea_jwt`** is the OIDC/JWT backend for CI. Its backend, `default_role = gitea_cicd`, and the base `gitea_cicd` role are created by factory's Vault bootstrap; the Vault provider in each IaC project logs in via `auth_login_jwt { mount = "gitea_jwt", role = "gitea_cicd[_<app>]" }` using the `TERRAFORM_VAULT_AUTH_JWT` env var. See the [tofu CI apply flow](../factory-provisioning/opentofu/ci-apply-flow.md) for how the token is minted in the pipeline.
|
||||
|
||||
### Terraform state
|
||||
|
||||
Each IaC project keeps its state in the **`arcodange-tf` GCS bucket** under a distinct prefix:
|
||||
|
||||
| Project | GCS prefix |
|
||||
|---|---|
|
||||
| Vault admin/app machinery | `tools/hashicorp_vault/main` |
|
||||
| Plausible | `tools/plausible/main` |
|
||||
| CrowdSec | `tools/crowdsec/main` |
|
||||
|
||||
---
|
||||
|
||||
## 2) The `app_policy` + `app_roles` modules — the `<app>` join-key machinery
|
||||
|
||||
> [!IMPORTANT]
|
||||
> These two modules are the heart of the secrets layer. Given a single `<app>` name they emit a **matched, name-derived** set of Vault objects so that an app's runtime, its CI, and its database identity all line up on the same key. This is the Vault half of the lab-wide [naming convention](../lab-ecosystem/naming-conventions.md): the same `<app>` string also names the Kubernetes namespace, the ServiceAccount, the Postgres `<app>_role`, and the Gitea repo.
|
||||
|
||||
The two modules live on **opposite sides of the trust boundary**:
|
||||
|
||||
- [`modules/app_policy`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_policy) is declared **once, centrally**, in the Vault admin project ([`main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/main.tf), `for_each` over `var.applications`). It creates the **policies and the CI identity** — the privileged bits — so the app's own repo never holds them.
|
||||
- [`modules/app_roles`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_roles) is declared **by the subordinate app project** (pulled over SSH as a Git module), running under the `<app>`-ops policy. It creates the **roles** the app needs.
|
||||
|
||||
### `app_roles` — runtime roles (declared by the app repo)
|
||||
|
||||
For `<app>`, [`app_roles/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_roles/main.tf) creates:
|
||||
|
||||
| Resource | Path | Key settings |
|
||||
|---|---|---|
|
||||
| Kubernetes auth role | `auth/kubernetes/role/<app>` | `bound_service_account_names = [<app>] + extras`, `bound_service_account_namespaces = [<app>] + extras`, `token_ttl = 3600` (1h), `token_policies = [default, <app>]`, `audience = vault` |
|
||||
| Postgres dynamic role | `postgres/roles/<app>` | `db_name = postgres`; creation SQL: `CREATE ROLE "{{name}}" WITH LOGIN PASSWORD … VALID UNTIL …` then `GRANT <app>_role TO "{{name}}"`; revocation: `REASSIGN OWNED BY "{{name}}" TO <app>_role` then `REVOKE ALL ON DATABASE <app> FROM "{{name}}"` |
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The Postgres dynamic role's creation SQL does `GRANT <app>_role TO {{name}}` and its revocation does `REASSIGN OWNED BY {{name}} TO <app>_role`. **The non-login `<app>_role` must already exist in Postgres** — it is created by factory's [postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) (`postgresql_role.app_role["<app>"]`, owner of the `<app>` database). If that role is missing, every ephemeral-user creation/revocation fails. This is the ordering dependency between the two repos: **factory postgres/iac before tools app_roles**.
|
||||
|
||||
> [!NOTE]
|
||||
> The Kubernetes auth role binds **both** SA names **and** namespaces — the check is an **AND**. A token presenting SA `<app>` from the wrong namespace (or any other SA from ns `<app>`) is rejected. The default binding is SA `<app>` in ns `<app>`; the `service_account_names` / `service_account_namespaces` inputs widen it (e.g. CrowdSec/Plausible run in ns `tools`, not a namespace named after the app).
|
||||
|
||||
The Postgres role can be skipped with `disable_database = true`; the DB name defaults to `<app>` but can be overridden via `database`.
|
||||
|
||||
### `app_policy` — policies + CI identity (declared centrally)
|
||||
|
||||
For `<app>`, [`app_policy/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/modules/app_policy/main.tf) creates:
|
||||
|
||||
| Resource | Name | Grants |
|
||||
|---|---|---|
|
||||
| **App policy** | `<app>` | `read,list` on `kvv2/data/<app>/*`; `read` on `postgres/creds/<app>*` — what the runtime pod can do |
|
||||
| **Ops policy** | `<app>-ops` | The CI bundle (below) |
|
||||
| **JWT role** | `gitea_cicd_<app>` (mount `gitea_jwt`) | `token_policies = [default] + <app>'s ops_policies`, `bound_audiences = [gitea_app_id]`, `user_claim = email`, `role_type = jwt` |
|
||||
| **Identity group** | `<app>-ops` | Internal group carrying the `<app>-ops` policy, so Vault users mapped to their Gitea entity inherit ops rights |
|
||||
|
||||
The **`<app>-ops` policy** is the privilege set a CI job needs to *manage* the app's own corner of Vault and the clouds:
|
||||
|
||||
- `create/update` on `auth/token/create`; `read` on `sys/mounts/auth/*` (so the Vault provider works);
|
||||
- full CRUD on `postgres/roles/<app>*` and on `auth/kubernetes/role/<app>*` (so `app_roles` can apply) — the k8s-role rule is **parameter-constrained**: it may only set `bound_service_account_names`/`bound_service_account_namespaces` to the whitelisted `[<app>] + extras` lists and `token_policies` to `["default","<app>"]`, preventing a CI job from minting a role with broader bindings;
|
||||
- full CRUD on the app's KV-v2 data, delete/undelete/destroy, and `metadata` (`kvv2/data|delete|undelete|destroy|metadata/<app>/*`);
|
||||
- `read` on `kvv1/google/credentials` (the GCS backend SA), `kvv1/gitea/tofu_module_reader` (the bot SSH key that lets CI pull the `app_roles` Git module);
|
||||
- CRUD on `kvv1/cloudflare/<app>*` and `kvv1/ovh/<app>*` (cloud DNS/edge secrets scoped to the app).
|
||||
|
||||
> [!NOTE]
|
||||
> The policy document is post-processed with two `replace()` calls. The Vault provider serializes the whitelisted list parameters as a JSON-encoded string (`"["webapp"]"`); the replaces strip the outer quotes so Vault receives a real list. If you change those `allowed_parameter` blocks, keep the replaces in sync.
|
||||
|
||||
### Apps wired in `terraform.tfvars`
|
||||
|
||||
[`terraform.tfvars`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/terraform.tfvars) declares the `applications` set the central `app_policy` `for_each` walks:
|
||||
|
||||
| `<app>` | Extra SA | Extra ns | Extra ops policy | Notes |
|
||||
|---|---|---|---|---|
|
||||
| `webapp` | — | — | — | defaults: SA `webapp` / ns `webapp` |
|
||||
| `erp` | — | — | — | defaults |
|
||||
| `cms` | `cloudflared` | — | `factory__cf_r2_arcodange_tf` | extra SA for the Cloudflare tunnel; extra ops policy for the CF R2 Terraform-state bucket |
|
||||
| `crowdsec` | — | `tools` | — | runs in ns `tools` |
|
||||
| `plausible` | — | `tools` | — | runs in ns `tools` |
|
||||
|
||||
> [!NOTE]
|
||||
> `terraform.tfvars` uses the key `ops_policies` for the CMS extra policy while `variables.tf` declares the optional attribute as `policies`; the central `main.tf` passes `each.value.policies` into the module's `ops_policies` input. Read these together when adding a new app so the extra-policy list actually lands on the JWT role.
|
||||
|
||||
---
|
||||
|
||||
## 3) VSO CRDs — how a secret becomes a Kubernetes Secret
|
||||
|
||||
The [Vault Secrets Operator](https://developer.hashicorp.com/vault/docs/platform/k8s/vso) watches three custom resources and writes plain Kubernetes `Secret` objects that pods consume normally (env / volume). The app repo ships the CRDs; the operator does the Vault round-trips.
|
||||
|
||||
| CRD | What it does | Refresh / rotation |
|
||||
|---|---|---|
|
||||
| `VaultAuth` | Picks the auth method (`kubernetes`), the `mount`, the Vault `role` (= `<app>`), and the pod **ServiceAccount** (= `<app>`) used to log in; references a `VaultConnection` (here the in-cluster `default` → `http://hashicorp-vault.tools.svc.cluster.local:8200`) | n/a — used by the other two CRDs via `vaultAuthRef` |
|
||||
| `VaultStaticSecret` | Reads a **KV-v2** path → writes a k8s `Secret` | `refreshAfter` (the lab uses `30s`) |
|
||||
| `VaultDynamicSecret` | Reads `postgres/creds/<app>` (a **dynamic** lease) → writes a k8s `Secret`; `rolloutRestartTargets` lists Deployments to restart when creds rotate | follows the Vault lease TTL (1h); VSO renews/re-issues and restarts the targets |
|
||||
|
||||
### Worked example — Plausible (`tools` namespace)
|
||||
|
||||
Files under [`plausible/resources`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources):
|
||||
|
||||
1. **`VaultAuth` `plausible`** ([`vaultauth.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/vaultauth.yaml)) — `method: kubernetes`, `role: plausible`, `serviceAccount: plausible`, `audiences: [vault]`. This is the Vault role `app_roles` created in [`plausible/iac/main.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/iac/main.tf).
|
||||
2. **`VaultStaticSecret` `plausible`** ([`vaultsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/vaultsecret.yaml)) — `kvv2` path `plausible/config` → Secret `plausible-config` (`refreshAfter: 30s`). The config payload holds **`SECRET_KEY_BASE`** and **`TOTP_VAULT_KEY`**, both **generated by Terraform** (`random_password`, base64-encoded) and written to `kvv2/plausible/config` via `vault_kv_secret_v2` in the plausible IaC.
|
||||
3. **`VaultStaticSecret` `plausible-geoip`** ([`geoipsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/geoipsecret.yaml)) — `kvv2` path `plausible/geoip` → Secret `plausible-geoip` exposing **`LICENSE_KEY`** (the MaxMind GeoIP license, an admin-seeded value, fed to the `geoipupdate` sidecar via env `GEOIPUPDATE_LICENSE_KEY`).
|
||||
4. **`VaultDynamicSecret` `plausible-db-credentials`** ([`vaultdynamicsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/resources/vaultdynamicsecret.yaml)) — `postgres/creds/plausible` → Secret `plausible-db-credentials`; `rolloutRestartTargets` restarts Deployment `plausible`. An **init container** ([`add-initcontainer.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/plausible/add-initcontainer.yaml)) reads `username`/`password` from that Secret and writes `DATABASE_URL` (`postgres://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/${DB_NAME}`) into a shared `generated-secrets` volume the app reads.
|
||||
|
||||
### Worked example — CrowdSec (`tools` namespace)
|
||||
|
||||
Templates under [`crowdsec/templates`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec/templates):
|
||||
|
||||
1. **`VaultAuth` `crowdsec`** ([`vaultauth.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec/templates/vaultauth.yaml)) — `role: crowdsec`, `serviceAccount: crowdsec`.
|
||||
2. **`VaultDynamicSecret` `crowdsec-db-credentials`** ([`vaultdynamicsecret.yaml`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/crowdsec/templates/vaultdynamicsecret.yaml)) — `postgres/creds/crowdsec` → Secret `crowdsec-db-credentials`; `rolloutRestartTargets` restarts Deployment **`crowdsec-lapi`** (the Local API that owns the DB connection).
|
||||
|
||||
### `factory_auth.tf` — the Ansible CrowdSec/Traefik plugin reader
|
||||
|
||||
Separately from the per-app machinery, [`factory_auth.tf`](https://gitea.arcodange.lab/arcodange-org/tools/src/branch/main/hashicorp-vault/iac/factory_auth.tf) wires a Kubernetes auth role **`factory_crowdsec_conf`** for SA **`factory-ansible-tool-crowdsec-traefik-plugin`** in ns **`kube-system`** (`token_ttl = 3600`). It carries policy `factory_crowdsec_conf`, which grants `read,list` on **`kvv2/data/cms/factory/*`**. This is how the Ansible-deployed CrowdSec/Traefik bouncer plugin reads the **Turnstile** configuration that the [`cms` repo](https://gitea.arcodange.lab/arcodange-org/cms) writes into `kvv2/cms/factory/*` — a cross-repo handoff entirely through Vault, with no shared file. The producer side (the Turnstile widget and the `vault_kv_secret_v2` write) is documented on the [CMS Cloudflare page](../cms/cloudflare.md).
|
||||
|
||||
---
|
||||
|
||||
## 4) Secret-paths inventory
|
||||
|
||||
| Path | Engine | Holds | Producer | Consumer |
|
||||
|---|---|---|---|---|
|
||||
| `kvv2/<app>/config` | KV v2 | App runtime config | app CI (KV CRUD via `<app>-ops`) | `VaultStaticSecret` → pod |
|
||||
| `kvv2/plausible/config` | KV v2 | `SECRET_KEY_BASE`, `TOTP_VAULT_KEY` | Plausible IaC (`random_password` → `vault_kv_secret_v2`) | `VaultStaticSecret plausible` → `plausible-config` |
|
||||
| `kvv2/plausible/geoip` | KV v2 | `LICENSE_KEY` (MaxMind) | admin-seeded | `VaultStaticSecret plausible-geoip` → `geoipupdate` sidecar |
|
||||
| `kvv2/cms/factory/turnstile` | KV v2 | Cloudflare Turnstile config | `cms` repo IaC | `factory_crowdsec_conf` k8s role → Ansible CrowdSec/Traefik plugin |
|
||||
| `postgres/creds/<app>` | database | Ephemeral DB user (`username`/`password`, 1h lease) | Vault on demand (role `<app>`, `GRANT <app>_role`) | `VaultDynamicSecret` → pod (e.g. `plausible-db-credentials`, `crowdsec-db-credentials`) |
|
||||
| `transit/.../vso-client-cache` | transit | VSO client-cache encryption key | Vault admin IaC | VSO controller (encrypt/decrypt its cache) |
|
||||
| `kvv1/cloudflare/<app>*` | KV v1 | Cloudflare DNS/edge secrets | admin | app CI (`<app>-ops` CRUD) |
|
||||
| `kvv1/ovh/<app>*` | KV v1 | OVH secrets | admin | app CI (`<app>-ops` CRUD) |
|
||||
| `kvv1/gitea/tofu_module_reader` | KV v1 | Bot SSH key to pull the `app_roles` Git module | admin | app CI (`<app>-ops` read) |
|
||||
| `kvv1/google/credentials` | KV v1 | GCS Terraform-backend SA key | admin | every IaC CI job (read) |
|
||||
|
||||
---
|
||||
|
||||
## 5) Secrets flow
|
||||
|
||||
```mermaid
|
||||
%%{init: {'theme': 'base'}}%%
|
||||
flowchart TB
|
||||
classDef eng fill:#7c3aed,stroke:#5b21b6,color:#ffffff
|
||||
classDef auth fill:#b45309,stroke:#92400e,color:#ffffff
|
||||
classDef crd fill:#059669,stroke:#047857,color:#ffffff
|
||||
classDef k8s fill:#2563eb,stroke:#1e40af,color:#ffffff
|
||||
classDef ci fill:#be123c,stroke:#9f1239,color:#ffffff
|
||||
|
||||
subgraph VAULT["Vault (tools ns)"]
|
||||
KV2["kvv2 engine<br>kvv2/<app>/*"]:::eng
|
||||
PG["postgres engine<br>postgres/creds/<app>"]:::eng
|
||||
TR["transit<br>vso-client-cache"]:::eng
|
||||
KKUB["kubernetes auth<br>role <app> (SA AND ns)"]:::auth
|
||||
KJWT["gitea_jwt auth<br>gitea_cicd_<app>"]:::auth
|
||||
end
|
||||
|
||||
subgraph RUNTIME["Runtime path"]
|
||||
VA["VaultAuth<br>role <app>, SA <app>"]:::crd
|
||||
VSS["VaultStaticSecret<br>kvv2/<app>/config"]:::crd
|
||||
VDS["VaultDynamicSecret<br>postgres/creds/<app>"]:::crd
|
||||
SEC["k8s Secret<br><app>-config / -db-credentials"]:::k8s
|
||||
POD["App pod<br>(SA <app>)"]:::k8s
|
||||
end
|
||||
|
||||
subgraph CICD["CI path"]
|
||||
GHA["Gitea Actions<br>OpenTofu job"]:::ci
|
||||
TOFU["apply app_roles<br>(under <app>-ops)"]:::ci
|
||||
end
|
||||
|
||||
KKUB --> VA
|
||||
VA --> VSS
|
||||
VA --> VDS
|
||||
KV2 --> VSS
|
||||
PG --> VDS
|
||||
VSS --> SEC
|
||||
VDS -- "rolloutRestart on rotation" --> SEC
|
||||
SEC --> POD
|
||||
TR -. "encrypts client cache" .-> VA
|
||||
|
||||
GHA -- "JWT login" --> KJWT
|
||||
KJWT --> TOFU
|
||||
TOFU -- "creates" --> KKUB
|
||||
TOFU -- "creates" --> PG
|
||||
```
|
||||
|
||||
1. **Vault** mounts the engines (`kvv2`, `postgres`, `transit`) and the two auth backends (`kubernetes`, `gitea_jwt`), all in the `tools` namespace.
|
||||
2. A pod's `VaultAuth` logs in through the **`kubernetes`** backend with SA `<app>` against role `<app>`; the role accepts only when **both** the SA name **and** its namespace match (AND).
|
||||
3. `VaultStaticSecret` reads `kvv2/<app>/config` and `VaultDynamicSecret` reads `postgres/creds/<app>` using that auth; VSO writes the values into ordinary k8s `Secret` objects.
|
||||
4. The pod consumes the Secret (env or volume); on a dynamic-cred **rotation** VSO restarts the `rolloutRestartTargets` Deployment so it picks up the new credentials.
|
||||
5. The **`transit`** key `vso-client-cache` encrypts VSO's client cache so an operator restart doesn't trigger a re-auth storm.
|
||||
6. On the CI side, a **Gitea Actions** OpenTofu job logs into the **`gitea_jwt`** backend as `gitea_cicd_<app>` (audience = the Gitea OAuth app id, identity from the `email` claim).
|
||||
7. Running under the `<app>-ops` policy, that job **applies the `app_roles` module**, creating/updating the Kubernetes auth role and the Postgres dynamic role for `<app>` — closing the loop so the runtime path in steps 2-4 works.
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
- **Vault must be unsealed after every restart.** Sealed Vault → all VSO reads fail → dynamic-secret consumers won't start. See [storage-and-recovery](../lab-ecosystem/storage-and-recovery.md).
|
||||
- **The Kubernetes auth role binds SA *and* namespace (AND).** The wrong namespace, or a different SA in the right namespace, is rejected. Apps in ns `tools` (CrowdSec, Plausible) widen the binding via `service_account_namespaces`.
|
||||
- **The Postgres dynamic role depends on `<app>_role` existing.** `GRANT <app>_role TO {{name}}` (create) and `REASSIGN OWNED BY {{name}} TO <app>_role` (revoke) both fail if factory's [postgres IaC](../factory-provisioning/opentofu/postgres-iac.md) hasn't created the `<app>_role` non-login role first. Order: **factory postgres/iac → tools app_roles**.
|
||||
- **The `ops_policies` vs `policies` key mismatch** in `terraform.tfvars` / `variables.tf` (see §2) — read both when adding an app's extra ops policy.
|
||||
- **The sandbox uses a separate Vault.** Per the [safe-env ADR](../../ADR/0001-safe-prod-like-environment.md), the prod-like sandbox stands up its own Vault instance; none of the paths or roles above are shared with it. Don't assume a secret seeded in prod exists in the sandbox.
|
||||
Reference in New Issue
Block a user