docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
84
vibe/ADR/0001-safe-prod-like-environment.md
Normal file
84
vibe/ADR/0001-safe-prod-like-environment.md
Normal file
@@ -0,0 +1,84 @@
|
||||
[vibe](../README.md) > [ADR](README.md) > **0001 · Safe, production-like environment**
|
||||
|
||||
# ADR-0001: Safe, production-like environment for the lab
|
||||
|
||||
> **Status**: Accepted
|
||||
> **Date**: 2026-06-23
|
||||
> **Deciders**: @arcodange
|
||||
|
||||
## Context
|
||||
|
||||
The Arcodange lab doubles as company production. The same three Raspberry Pis and one MacBook control node run the public CMS (`arcodange.fr`), Zoho-backed business email, the Dolibarr ERP holding accounting and business records, the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one laptop holding the kubeconfig, the Vault root token, and every cloud admin token.
|
||||
|
||||
There is no separation between *where I experiment* and *where the business runs*. Every risky change is tested directly in production. The 2026-04-13 power-cut proved recovery is manual, multi-step, and only ever validated by a real incident — never rehearsed.
|
||||
|
||||
The danger is concentrated in a handful of change-classes. Each one can cause silent, fleet-wide, or data-losing damage when applied to the live environment:
|
||||
|
||||
| Change-class | Blast radius if wrong |
|
||||
| --- | --- |
|
||||
| Ansible playbook edits | Can wipe disks, reset k3s, or corrupt Longhorn across the fleet. |
|
||||
| Vault policy / auth / mount changes | Lock out the Vault Secrets Operator → fleet-wide secret outage; a botched init could overwrite the single unseal key. |
|
||||
| Postgres migrations / role changes | The superuser provider on `192.168.1.202` can drop or alter live databases → ERP data loss. |
|
||||
| ArgoCD sync / app-of-apps changes | `prune` + `selfHeal` auto-prunes live resources fleet-wide. |
|
||||
| Cloudflare / DNS / email changes | A wrong MX/SPF/DKIM silently breaks `arcodange.fr` mail for days. |
|
||||
| Longhorn / storage ops | Volume recreation orphans replicas via new engine IDs. |
|
||||
| Recovery drills | The runbook is only validated by real incidents, never rehearsed. |
|
||||
| Cert / PKI re-init | Rotates the internal CA, invalidating every issued `*.arcodange.lab` cert. |
|
||||
|
||||
A change-management process is not enough: the operator needs a place to *make the mistake first*, where the mistake cannot reach production.
|
||||
|
||||
## Decision
|
||||
|
||||
We will build a **local-only safe environment on the MacBook control node**, seeded from the *same* GitOps repos via a dedicated sandbox inventory, with two modes:
|
||||
|
||||
- **(a) k3d single-node "fast inner loop"** (~60s bring-up) for app, Vault, and ArgoCD iteration.
|
||||
- **(b) Three arm64 VMs** (multipass or Vagrant on the M4) reproducing the three-node topology — Postgres + Gitea as docker-compose *outside* k3s on the "pi2-equivalent" VM, Longhorn across the three VM disks — for Ansible, Longhorn, and recovery work.
|
||||
|
||||
The load-bearing requirement is the **isolation boundary**: the sandbox must be *unable* to mutate real production even on a wrong command. Each production coupling maps to a concrete guardrail — a separate sandbox inventory with a prod-IP abort guard, sandbox GCS state prefixes, a separate sandbox Vault with its own unseal-key path, a sandbox Postgres host check, and plan-only DNS against a throwaway zone. The real `arcodange.fr` Cloudflare/Zoho tokens are never exported into a sandbox shell. Because the `<app>` convention keys everything *within* a cluster/Vault/DB, the sandbox reuses identical `<app>` names with no collision — the boundary is the cluster + Vault + state + DNS zone, not the names, so runbooks read identically in both environments.
|
||||
|
||||
See the [isolation-boundary leaf](../PRD/safe-prod-like-environment/isolation-boundary.md) for the full coupling→control mapping, and the [PRD](../PRD/safe-prod-like-environment/README.md) for the complete product view.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **+** $0 inner loop that runs on the existing control node — no new hardware.
|
||||
- **+** Rehearses every dangerous change-class except real public DNS/email.
|
||||
- **+** The 2026-04-13 recovery sequence becomes a repeatable drill instead of a once-per-incident gamble.
|
||||
- **+** Identical `<app>` names mean runbooks are environment-agnostic.
|
||||
- **−** x86/ARM nuance must be handled (use arm64 VMs/images on the M4).
|
||||
- **−** New guardrail and parity-manifest maintenance burden.
|
||||
- **−** Single-laptop resource limits — k3d for speed, VMs only when multi-node fidelity is actually needed.
|
||||
- **→** Real public DNS/ACME and physical-ARM always-on testing remain unsolved by design; revisit only if recurring game-days demand them.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
| Option | Fidelity | Isolation | $ / effort | Verdict |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| 1 · Ephemeral local cluster (k3d/kind) | Medium (single-node) | Full (separate cluster) | $0 / low | ✅ **Chosen** as the fast mode. |
|
||||
| 2 · Three arm64 VMs reproducing the topology | High (3-node, PG+Gitea outside k3s, Longhorn) | Full (separate VMs) | $0 / medium | ✅ **Chosen** for fidelity. |
|
||||
| 3 · Sandbox namespace on the real cluster | High | None — shared Vault/PG/Longhorn/ArgoCD | $0 / low | ❌ **Rejected**: shared blast radius fails the core isolation requirement. |
|
||||
| 4 · Dedicated physical node (4th Pi / mini-PC) | High (real ARM, always-on) | Full | $$ hardware / medium | ⛔ **Out of scope**: hardware cost; revisit only for recurring always-on ARM game-days. |
|
||||
| 5 · Disposable cloud k3s for real public DNS/ACME | High infra, but arch drift | Full | $ recurring / medium | ⛔ **Out of scope**: cost + ARM drift; its only unique value is real DNS/email, which we explicitly do not test. |
|
||||
|
||||
## QA & validation
|
||||
|
||||
- **Parity manifest** — k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, PG + Gitea outside k3s, three nodes. Any drift from this manifest is a failure.
|
||||
- **Provisioning-parity test** — run the new-web-app runbook for a throwaway `<app>` "canary" and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running.
|
||||
- **Idempotence gate** — `ansible-playbook --check --diff` reports `changed=0` on the converged sandbox *before* any change is promoted to prod.
|
||||
- **`tofu plan` diff gate** — plan against sandbox state; for DNS, assert it touches *only* the throwaway zone.
|
||||
- **Chaos drills** (mapped to CLUSTER_RECOVERY.md sections): node-kill; Vault-seal (unseal via the *sandbox* key; VSO re-auths); Longhorn volume corruption (run `recover/longhorn*.yml`, validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md)); DB drop/restore; full power-cut simulation (execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green); ArgoCD bad-sync (observe prune/selfHeal, author a rollback runbook section); cert/PKI re-issue.
|
||||
- **Monthly game-day** — the operator follows *only* the runbook; any improvised step becomes a runbook PR. This is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident.
|
||||
- **Promotion gate** — no infra/Vault/storage/DNS change reaches prod until it has been applied to the sandbox *and* survived the matching drill. Each drill records time-to-recover and which step failed or was improvised.
|
||||
|
||||
See the [QA strategy leaf](../PRD/safe-prod-like-environment/qa-strategy.md) for the detailed drill table and evidence trail.
|
||||
|
||||
## References
|
||||
|
||||
- [PRD · Safe, production-like environment](../PRD/safe-prod-like-environment/README.md) — full product view, requirements, and phased rollout.
|
||||
- [INV-001 · Prod blast-radius couplings](../investigations/INV-001-prod-blast-radius-couplings.md) — the investigation that mapped every prod coupling.
|
||||
- [Guidebook · Lab ecosystem](../guidebooks/lab-ecosystem/README.md) — the end-to-end map of the prod topology this sandbox mirrors.
|
||||
- [Guidebook · Storage and recovery](../guidebooks/lab-ecosystem/storage-and-recovery.md) — how Longhorn and the recovery sequence work today.
|
||||
- [doc/adr index](../../doc/adr/README.md) — foundational infrastructure ADRs (read-only history).
|
||||
- [Longhorn PVC recovery ADR](../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md) — engine-ID re-association, exercised by the Longhorn chaos drill.
|
||||
- [new-web-app conventions](../../doc/runbooks/new-web-app/conventions.md) — the `<app>` convention reused identically across sandbox and prod.
|
||||
- CLUSTER_RECOVERY.md — the tested power-cut recovery sequence, lives at the lab root (outside this repo); the chaos drills rehearse it section by section.
|
||||
- PRs: to be backfilled on PR open.
|
||||
45
vibe/ADR/README.md
Normal file
45
vibe/ADR/README.md
Normal file
@@ -0,0 +1,45 @@
|
||||
[vibe](../README.md) > **ADR**
|
||||
|
||||
# Architecture Decision Records
|
||||
|
||||
> **Status**: 🟢 Active
|
||||
> **Last Updated**: 2026-06-23
|
||||
> **Related**: [vibe/PRD](../PRD/README.md) · [vibe/Investigations](../investigations/README.md)
|
||||
> **Historical**: [doc/adr](../../doc/adr/README.md) (foundational infra) · [ansible/.../docs/adr](../../ansible/arcodange/factory/docs/adr/) (dated infra ADRs)
|
||||
|
||||
`vibe/ADR/` is the **canonical home for Architecture Decision Records going forward**. The format is MADR-lite: one short, self-contained Markdown file per decision, focused on the *why* rather than the *how*. Use the [`_template.md`](_template.md) skeleton to start a new one.
|
||||
|
||||
## Where ADRs live
|
||||
|
||||
There are three ADR locations in this repo. Only the first accepts new records; the other two are read-only history kept for context.
|
||||
|
||||
| Location | Role | Accepts new ADRs? |
|
||||
| --- | --- | --- |
|
||||
| `vibe/ADR/` (this folder) | Canonical, MADR-lite, going forward | ✅ Yes |
|
||||
| [`doc/adr/`](../../doc/adr/README.md) | Foundational infrastructure ADRs (DNS, k3s, CI/CD, Vault, telegram-gateway auth) | ❌ Historical |
|
||||
| [`ansible/arcodange/factory/docs/adr/`](../../ansible/arcodange/factory/docs/adr/) | Dated infra ADRs (network, CI/CD, Longhorn PVC recovery, internal DNS) | ❌ Historical |
|
||||
|
||||
When a new decision *supersedes* one of the historical records, write the new ADR here, set the old one's status note to `Superseded by ADR-NNNN`, and cross-link both ways.
|
||||
|
||||
## Rules
|
||||
|
||||
- **One file per decision**, named `NNNN-kebab-title.md` (zero-padded sequence, e.g. `0001-safe-prod-like-environment.md`).
|
||||
- **The body is immutable once `Accepted`.** A decision is a historical fact: do not rewrite the Context/Decision/Consequences after acceptance. The *only* mutation allowed on an accepted ADR is its **status** (e.g. flipping `Accepted` → `Superseded`).
|
||||
- **Statuses**: `Proposed` (under discussion) → `Accepted` (decided, body frozen) → `Superseded` (replaced; points to the successor ADR). A `Proposed` ADR may still be edited freely.
|
||||
- **No-tombstone rule.** Each file reads as currently true. Never leave "previously X, now Y", changelog lines, or "updated to ..." notes inside an ADR — git history is the audit trail. A superseded ADR keeps its original frozen body; the supersession is recorded only in its status line and the successor's References.
|
||||
- **PR cross-link both ways.** The ADR References section links the PR that introduced it; the PR description links back to the ADR. Keep links bidirectional.
|
||||
|
||||
## Index
|
||||
|
||||
| # | Title | Status | Date |
|
||||
| --- | --- | --- | --- |
|
||||
| [0001](0001-safe-prod-like-environment.md) | Safe, production-like environment | 🟢 Accepted | 2026-06-23 |
|
||||
|
||||
## Rules to contribute
|
||||
|
||||
1. Copy [`_template.md`](_template.md) to `NNNN-kebab-title.md` using the next free sequence number and delete the top HTML-comment note.
|
||||
2. Fill in the blockquote (Status/Date/Deciders), then Context, Decision, Consequences, Alternatives considered, QA & validation, References.
|
||||
3. Open the ADR with status `Proposed`. Flip it to `Accepted` once the decision is settled — and from that point treat the body as frozen.
|
||||
4. Add a row to the Index table above (newest at the bottom to preserve chronological numbering).
|
||||
5. In the PR that lands the ADR, link to the ADR file; in the ADR's References, link back to the PR. Bidirectional links are mandatory.
|
||||
6. If this ADR supersedes a historical one in `doc/adr/` or the Ansible ADR folder, update the old record's status note and cross-reference both directions.
|
||||
41
vibe/ADR/_template.md
Normal file
41
vibe/ADR/_template.md
Normal file
@@ -0,0 +1,41 @@
|
||||
[vibe](../README.md) > [ADR](README.md) > **_template**
|
||||
|
||||
<!-- Copy this file to NNNN-kebab-title.md, fill in, delete this note. -->
|
||||
|
||||
# ADR-NNNN: Title
|
||||
|
||||
> **Status**: Proposed | Accepted | Superseded by ADR-NNNN
|
||||
> **Date**: YYYY-MM-DD
|
||||
> **Deciders**: name(s)
|
||||
|
||||
## Context
|
||||
|
||||
What forces are at play? Describe the problem, the constraints, and the situation that makes a decision necessary. State facts, not opinions. Keep it short enough that a future reader understands *why* a decision was needed without prior context.
|
||||
|
||||
## Decision
|
||||
|
||||
The decision, stated in the active voice: "We will ...". One clear choice. If the decision has sub-parts, use a short bulleted list.
|
||||
|
||||
## Consequences
|
||||
|
||||
What becomes easier or harder as a result of this decision?
|
||||
|
||||
- **+** A positive outcome / something now enabled.
|
||||
- **−** A trade-off / cost / new constraint accepted.
|
||||
- **→** A future follow-up this implies (work deferred, a door left open, a re-evaluation trigger).
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
| Option | Why not |
|
||||
| --- | --- |
|
||||
| Alternative A | Reason it was rejected. |
|
||||
| Alternative B | Reason it was rejected. |
|
||||
|
||||
## QA & validation
|
||||
|
||||
How was (or will be) this decision validated? Tests, smoke checks, manual verification, rollback plan, or the criteria that would tell us the decision was wrong.
|
||||
|
||||
## References
|
||||
|
||||
- Link to the PR that introduces this ADR (and ensure the PR links back here).
|
||||
- Related ADRs, PRDs, investigations, or external docs (descriptive link text, never "here"/"this").
|
||||
Reference in New Issue
Block a user