Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
70 lines
5.2 KiB
Markdown
70 lines
5.2 KiB
Markdown
[vibe](../../README.md) > [PRD](../README.md) > **Safe, production-like environment**
|
|
|
|
# Safe, production-like environment
|
|
|
|
> **Status:** In design
|
|
> **Last Updated:** 2026-06-23
|
|
> **Design record:** [ADR 0001 — Safe, production-like environment](../../ADR/0001-safe-prod-like-environment.md)
|
|
> **Adjacent:** [INV-001 — prod blast-radius couplings](../../investigations/INV-001-prod-blast-radius-couplings.md)
|
|
> **Map:** [Lab ecosystem guidebook](../../guidebooks/lab-ecosystem/README.md)
|
|
|
|
## Problem
|
|
|
|
The lab doubles as company production. The same three Raspberry Pis that host experiments also run the public CMS ([arcodange.fr](https://gitea.arcodange.lab/arcodange-org/cms)), Zoho-backed email, the Dolibarr ERP (accounting and business records), the url-shortener, the telegram-gateway, and the dance-lessons-coach. A single operator administers all of it from one MacBook that holds the kubeconfig, the Vault root, and the cloud admin tokens.
|
|
|
|
There is no separation between "where I experiment" and "where the business runs": every risky change is currently tested directly in prod. The 2026-04-13 power-cut proved that recovery is manual, non-trivial, and validated only by real incidents. A wrong Ansible play, Vault policy, Postgres migration, ArgoCD sync, or DNS record can silently break the business for days.
|
|
|
|
## Users & personas
|
|
|
|
A **single operator wearing two hats**:
|
|
|
|
- **The inner-loop developer** — iterating on an app, a Helm chart, a Vault policy, or an ArgoCD app. Wants a fast feedback cycle (seconds, not a fleet round-trip) and zero fear of touching the wrong thing.
|
|
- **The game-day / recovery operator** — rehearsing dangerous change-classes and disaster recovery (node-kill, Vault-seal, Longhorn corruption, DB drop, full power-cut). Wants high fidelity to the real topology so a drill actually predicts prod behaviour, and a runbook that gets validated against the sandbox instead of waiting for the next outage.
|
|
|
|
Both hats belong to the same person on the same laptop. The environment must serve the fast loop and the faithful loop without ever letting either reach into prod.
|
|
|
|
## Goals & non-goals
|
|
|
|
**Goals**
|
|
|
|
- Rehearse **every dangerous change-class** (Ansible, Vault, Postgres, ArgoCD, Cloudflare/DNS, Longhorn, recovery drills, PKI) with **zero prod blast radius**.
|
|
- Validate `CLUSTER_RECOVERY.md` against the **sandbox**, not against prod — turn the recovery runbook from incident-validated into drill-validated.
|
|
- A **fast inner loop** for app / Vault / ArgoCD iteration, seeded from the same GitOps repos as prod.
|
|
- Run on the **existing control node** (the MacBook) at **$0** marginal cost.
|
|
|
|
**Non-goals**
|
|
|
|
- No real `arcodange.fr` DNS or email tests. DNS/email modules run **plan-only against a throwaway zone**; real public DNS/ACME end-to-end is out of scope.
|
|
- No **physical-node tier** (4th Pi / mini-PC) — explicitly rejected, not deferred.
|
|
- No **cloud tier** for disposable real-DNS clusters — explicitly rejected, not deferred.
|
|
- Not a **performance-benchmark** environment — fidelity is about topology and the convention chain, not throughput numbers on laptop-class hardware.
|
|
|
|
## Requirements
|
|
|
|
Functional:
|
|
|
|
- **One-command bring-up**, seeded from the same GitOps repos as prod.
|
|
- **Sandbox inventory + guards** — a separate `inventory/sandbox/hosts.yml` plus a pre-task guard that aborts on any prod IP (`192.168.1.201-203`) unless `i_mean_prod=true`. See [Isolation boundary](isolation-boundary.md).
|
|
- **Parity manifest** — same k3s `v1.34.3+k3s1`, same Longhorn/Vault/VSO, same app-of-apps list, Postgres + Gitea outside k3s, three nodes. Drift = fail.
|
|
- **Two modes** seeded from the same repos via the sandbox inventory:
|
|
- **(a) k3d single-node fast inner loop** (~60s bring-up) for app / Vault / ArgoCD iteration.
|
|
- **(b) 3 arm64 VMs** (multipass or Vagrant on the M4) reproducing the 3-node topology — Postgres + Gitea as docker-compose outside k3s on the pi2-equivalent VM, Longhorn across the three VM disks — for Ansible / Longhorn / recovery work.
|
|
|
|
The **isolation boundary** (the table mapping each prod coupling to its sandbox control) and the **`<app>` naming note** are detailed in [isolation-boundary.md](isolation-boundary.md).
|
|
|
|
## QA strategy
|
|
|
|
Fidelity gates (parity manifest, a `canary` provisioning-parity test that drives the new-web-app runbook end to end, an `ansible-playbook --check --diff` changed=0 idempotence gate, and a `tofu plan` diff gate) plus chaos drills mapped section-by-section to `CLUSTER_RECOVERY.md`, a recurring monthly game-day, and a promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. Full detail in [qa-strategy.md](qa-strategy.md).
|
|
|
|
## Implementation status
|
|
|
|
Rollout is phased (Phase 0 guardrails → Phase 1 k3d → Phase 2 3-VM → Phase 3 game-day; Phase 4 out of scope). Live tracker: [STATUS.md](STATUS.md).
|
|
|
|
## Leaves
|
|
|
|
| Page | Summary | Status |
|
|
| --- | --- | --- |
|
|
| [Isolation boundary](isolation-boundary.md) | Prod-coupling → sandbox-control table; the `<app>` naming note; the token caution. | 🟡 In design |
|
|
| [QA strategy](qa-strategy.md) | Fidelity gates, chaos-drill table, game-day cadence, promotion gate, evidence trail. | 🟡 In design |
|
|
| [STATUS](STATUS.md) | Phase tracker (all not-started) and PR log. | ⬜ Not started |
|