docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
34
vibe/PRD/README.md
Normal file
34
vibe/PRD/README.md
Normal file
@@ -0,0 +1,34 @@
|
||||
[vibe](../README.md) > **PRD**
|
||||
|
||||
# Product Requirement Documents
|
||||
|
||||
> **Status**: 🟢 Active
|
||||
> **Last Updated**: 2026-06-23
|
||||
> **Related**: [vibe/ADR](../ADR/README.md) · [vibe/Investigations](../investigations/README.md)
|
||||
|
||||
`vibe/PRD/` holds the Product Requirement Documents that drive larger pieces of work in the lab. A PRD captures *what* we want and *why it matters*; the matching ADRs capture *how we decided to build it*, and investigations capture *what we learned* along the way.
|
||||
|
||||
## Convention
|
||||
|
||||
- **One subfolder per PRD**, kebab-case (e.g. `safe-prod-like-environment/`).
|
||||
- Each subfolder **MUST** contain:
|
||||
- `README.md` — the PRD hub: problem, goals/non-goals, requirements, success criteria, and a QA strategy.
|
||||
- `STATUS.md` — the implementation tracker. **Update it whenever something ships** (a PR merges, a brick lands, a milestone closes). It is the living view of "where are we" against the PRD.
|
||||
- A **big PRD uses tree-docs**: the `README.md` stays a hub and detail lives in leaf pages (each with its own breadcrumb and bidirectional cross-links). A tree-sized PRD **MUST** detail an explicit **QA strategy** — how the delivered work will be verified, and what "done and safe" means.
|
||||
- **PRs cross-link to the PRD**, and the PRD's `STATUS.md` **cross-links back** to the PRs/ADRs/investigations that realised each part. Links are bidirectional.
|
||||
- **No-tombstone rule** applies: the PRD reads as currently true. Progress lives in `STATUS.md` (which *is* a tracker and may legitimately list shipped items), not as "previously / now" edits scattered through the hub.
|
||||
|
||||
## Index
|
||||
|
||||
| PRD | Hub | Status |
|
||||
| --- | --- | --- |
|
||||
| Safe, production-like environment | [safe-prod-like-environment/README.md](safe-prod-like-environment/README.md) | 🟡 In design |
|
||||
|
||||
## Rules to contribute
|
||||
|
||||
1. Create a kebab-case subfolder named for the PRD.
|
||||
2. Add `README.md` (the hub) and `STATUS.md` (the tracker). Both carry a breadcrumb first line and the leaf header blockquote (Status / Last Updated / Related).
|
||||
3. In the hub, state the problem, goals and non-goals, requirements, success criteria, and the QA strategy. If the PRD is large, split detail into leaf pages and keep the README as a navigable hub.
|
||||
4. Keep `STATUS.md` current: every time a piece ships, record it there and link the PR/ADR that delivered it.
|
||||
5. Add a row to the Index table above.
|
||||
6. Ensure every PR that implements part of the PRD links to the PRD, and that `STATUS.md` links back. Bidirectional links are mandatory.
|
||||
69
vibe/PRD/safe-prod-like-environment/README.md
Normal file
69
vibe/PRD/safe-prod-like-environment/README.md
Normal file
@@ -0,0 +1,69 @@
|
||||
[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment**
|
||||
|
||||
# Safe, production-like environment
|
||||
|
||||
> **Status:** In design
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
|
||||
> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
|
||||
> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md)
|
||||
|
||||
## Problem
|
||||
|
||||
The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.
|
||||
|
||||
There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.
|
||||
|
||||
## Users & personas
|
||||
|
||||
A **single operator wearing two hats**:
|
||||
|
||||
- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
|
||||
- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.
|
||||
|
||||
Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.
|
||||
|
||||
## Goals & non-goals
|
||||
|
||||
**Goals**
|
||||
|
||||
- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**.
|
||||
- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated.
|
||||
- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
|
||||
- Run on the **existing control node** (the MacBook) at **$0** marginal cost.
|
||||
|
||||
**Non-goals**
|
||||
|
||||
- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope.
|
||||
- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred.
|
||||
- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred.
|
||||
- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.
|
||||
|
||||
## Requirements
|
||||
|
||||
Functional:
|
||||
|
||||
- **One-command bring-up**, seeded from the same GitOps repos as prod.
|
||||
- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md).
|
||||
- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail.
|
||||
- **Two modes** seeded from the same repos via the sandbox inventory:
|
||||
- **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration.
|
||||
- **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.
|
||||
|
||||
The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`<app>` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md).
|
||||
|
||||
## QA strategy
|
||||
|
||||
Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md).
|
||||
|
||||
## Implementation status
|
||||
|
||||
Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md).
|
||||
|
||||
## Leaves
|
||||
|
||||
| Page | Summary | Status |
|
||||
| --- | --- | --- |
|
||||
| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `<app>` naming note; the token caution. | 🟡 In design |
|
||||
| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design |
|
||||
| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started |
|
||||
52
vibe/PRD/safe-prod-like-environment/STATUS.md
Normal file
52
vibe/PRD/safe-prod-like-environment/STATUS.md
Normal file
@@ -0,0 +1,52 @@
|
||||
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **STATUS**
|
||||
|
||||
# STATUS — Safe, production-like environment
|
||||
|
||||
> **Last Updated:** 2026-06-23
|
||||
|
||||
Legend: ⬜ not started · 🟡 in progress · ✅ done
|
||||
|
||||
> [!IMPORTANT]
|
||||
> This file MUST be updated whenever something ships. Every PR that advances a phase crosslinks back here (and the matching checkbox flips), and the [PRs](#prs) table gets a row.
|
||||
|
||||
## Phase 0 — Isolation guardrails
|
||||
|
||||
*Must land before any sandbox run.*
|
||||
|
||||
- [ ] ⬜ Sandbox inventory `inventory/sandbox/hosts.yml` (VM/cloud hosts only)
|
||||
- [ ] ⬜ Prod-IP abort guard (aborts on `192.168.1.201-203` unless `i_mean_prod=true`)
|
||||
- [ ] ⬜ Sandbox GCS state prefixes (`sandbox/...`) or `gs://arcodange-tf-sandbox`
|
||||
- [ ] ⬜ Sandbox Vault unseal-key path (`~/.arcodange/sandbox/cluster-keys.json`)
|
||||
- [ ] ⬜ Sandbox env profile / plan-only DNS against a throwaway zone
|
||||
|
||||
## Phase 1 — Tier-1 k3d fast mode
|
||||
|
||||
- [ ] ⬜ One-command bring-up seeded from GitOps
|
||||
- [ ] ⬜ Parity manifest v1
|
||||
- [ ] ⬜ Canary provisioning-parity test
|
||||
- [ ] ⬜ `changed=0` idempotence gate documented
|
||||
|
||||
## Phase 2 — Tier-1 3-VM cluster
|
||||
|
||||
- [ ] ⬜ Three arm64 VMs (multipass / Vagrant on the M4)
|
||||
- [ ] ⬜ Same `system_k3s`; Postgres + Gitea outside k3s on the pi2-equivalent VM
|
||||
- [ ] ⬜ Longhorn across the three VM disks
|
||||
- [ ] ⬜ Chaos drills: node-kill / Vault-seal / DB-drop
|
||||
- [ ] ⬜ First full `CLUSTER_RECOVERY` dry-run against the sandbox
|
||||
|
||||
## Phase 3 — Game-day operationalization
|
||||
|
||||
- [ ] ⬜ Monthly cadence + promotion gate in the PR checklist
|
||||
- [ ] ⬜ Longhorn engine-ID drill
|
||||
- [ ] ⬜ ArgoCD bad-sync rollback runbook
|
||||
- [ ] ⬜ Evidence trail for ≥1 cycle
|
||||
|
||||
## Phase 4 — out of scope
|
||||
|
||||
Not planned: dedicated physical node (4th Pi / mini-PC) and disposable cloud k3s for real public DNS/ACME. See [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) for the rejected-alternatives rationale.
|
||||
|
||||
## PRs
|
||||
|
||||
| PR | Scope | Phase | Merged |
|
||||
| --- | --- | --- | --- |
|
||||
| _pending_ | Bootstrap the PRD tree (this `vibe/` set) — backfilled on open | — | ⬜ |
|
||||
31
vibe/PRD/safe-prod-like-environment/isolation-boundary.md
Normal file
31
vibe/PRD/safe-prod-like-environment/isolation-boundary.md
Normal file
@@ -0,0 +1,31 @@
|
||||
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **Isolation boundary**
|
||||
|
||||
# Isolation boundary
|
||||
|
||||
> **Status:** In design
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Safe, production-like environment](README.md)
|
||||
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
|
||||
|
||||
The isolation boundary is the load-bearing part of this PRD: the sandbox must be **unable to mutate real prod even on a wrong command**. Every prod coupling that a sandbox run could touch is mapped below to a concrete control. The boundary is the **cluster + Vault + state + DNS zone** — not the names (see the naming note).
|
||||
|
||||
## Prod couplings → sandbox controls
|
||||
|
||||
| Prod coupling | What it can break in prod | Sandbox control |
|
||||
| --- | --- | --- |
|
||||
| Ansible inventory `hosts.yml` → `192.168.1.201-203` | Wipe disks, reset k3s, corrupt Longhorn on the live Pis. | Separate `inventory/sandbox/hosts.yml` (VM/cloud hosts only) **plus** a pre-task guard that **aborts** if any target IP is in `192.168.1.201-203` unless `i_mean_prod=true` is set explicitly. |
|
||||
| OpenTofu state in `gs://arcodange-tf` (prefixes) | A sandbox apply rewrites live state and re-plans prod resources. | A sandbox prefix family (`sandbox/factory/main`, `sandbox/tools/...`, `sandbox/factory/postgres`) via a backend-config override, **or** a separate bucket `gs://arcodange-tf-sandbox`. Sandbox runs never touch prod state. |
|
||||
| Gitea provider `base_url` `gitea.arcodange.lab` + ArgoCD `repoURL` / `targetRevision` | Sandbox commits/pushes into the prod forge; ArgoCD syncs sandbox refs onto the prod cluster. | Sandbox Gitea on the sandbox cluster (or org `arcodange-sandbox`); the sandbox app-of-apps points at a **sandbox branch** so the sandbox cluster syncs only sandbox refs. |
|
||||
| Vault provider `address` `vault.arcodange.lab` + unseal key `~/.arcodange/cluster-keys.json` | Sandbox writes clobber prod policies/auth/mounts; a botched init overwrites the prod unseal key. | A **separate sandbox Vault**; override the unseal-key path to `~/.arcodange/sandbox/cluster-keys.json` so prod's key can never be overwritten. |
|
||||
| PostgreSQL provider `host` `192.168.1.202` (superuser) | Drop or alter live DBs — including ERP business records. | Sandbox PG is the docker-compose on the sandbox pi2-equivalent; a guard **refuses apply** if `host == 192.168.1.202` and `workspace != prod`. |
|
||||
| Cloudflare account / OVH `arcodange.fr` / Zoho live mail | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. | DNS/email modules run **plan-only** against a throwaway zone/subdomain with a separate token. The real `arcodange.fr` token is **never** exported into a sandbox shell. Real public DNS/ACME is out of scope. |
|
||||
| Longhorn backup bucket | A restore drill overwrites prod backups. | Sandbox backup target is a **separate bucket/prefix** so restore drills cannot overwrite prod backups. |
|
||||
|
||||
## The `<app>` naming note
|
||||
|
||||
The `<app>` key threads one kebab-case identifier through the Gitea repo, the PG db + role, the Vault paths/policies, the k8s namespace + SA, the ArgoCD Application, the GCS state prefix, and DNS — see [conventions](../../../doc/runbooks/new-web-app/conventions.md).
|
||||
|
||||
Because `<app>` keys everything **within** a cluster / Vault / DB / zone, the sandbox can reuse **identical `<app>` names with no collision**. The isolation boundary is the cluster + Vault + state + DNS zone, not the names. This is deliberate: runbooks read **identically** in both environments, so a drill exercises the exact same convention chain an operator runs in prod.
|
||||
|
||||
> [!CAUTION]
|
||||
> The real `arcodange.fr` Cloudflare token must **never** be exported into a sandbox shell. DNS/email work in the sandbox is plan-only against a throwaway zone with its own separate token. Exporting the prod token into a sandbox session would defeat the entire isolation boundary — a single `tofu apply` could rewrite live public DNS or mail records.
|
||||
49
vibe/PRD/safe-prod-like-environment/qa-strategy.md
Normal file
49
vibe/PRD/safe-prod-like-environment/qa-strategy.md
Normal file
@@ -0,0 +1,49 @@
|
||||
[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy**
|
||||
|
||||
# QA strategy
|
||||
|
||||
> **Status:** In design
|
||||
> **Last Updated:** 2026-06-23
|
||||
> **Upstream:** [Safe, production-like environment](README.md)
|
||||
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md)
|
||||
|
||||
The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate).
|
||||
|
||||
## Fidelity gates
|
||||
|
||||
- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate.
|
||||
- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `<app>` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test.
|
||||
- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready.
|
||||
- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate.
|
||||
|
||||
## Chaos drills
|
||||
|
||||
Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.
|
||||
|
||||
| Drill | Action | Acceptance | Recovery section |
|
||||
| --- | --- | --- | --- |
|
||||
| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. |
|
||||
| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. |
|
||||
| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. |
|
||||
| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. |
|
||||
| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. |
|
||||
| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). |
|
||||
| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. |
|
||||
|
||||
1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
|
||||
2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets.
|
||||
3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR.
|
||||
4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
|
||||
5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last.
|
||||
6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section.
|
||||
7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.
|
||||
|
||||
## Recurring game-day & promotion gate
|
||||
|
||||
A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident.
|
||||
|
||||
**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md).
|
||||
|
||||
## Evidence trail
|
||||
|
||||
Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.
|
||||
Reference in New Issue
Block a user