Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM agents, modeled on tree-docs conventions and the factory house style. vibe/ folders (each with a README hub + contribution rules): - ADR/ optimized MADR-lite; canonical home going forward (doc/adr stays historical) - PRD/ one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones - investigations/ single INV-NNN-slug.md, or stub + folder w/ notebooks - guidebooks/ tree-docs maps; lab-ecosystem guidebook of factory+tools+cms - runbooks/ [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR) - shareouts/ dated FR handouts (decks/mp4) Seed content (first ADR + PRD): a safe, production-like environment to rehearse risky changes and recovery without touching real prod — local-only sandbox (k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout. Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram. Built with a Workflow + persona cohort; 24 files, zero dead links. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.1 KiB
vibe > PRD > Safe, production-like environment > QA strategy
QA strategy
Status: In design Last Updated: 2026-06-23 Upstream: Safe, production-like environment Related: ADR 0001 · Isolation boundary · INV-001
The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay faithful to prod (fidelity gates), and dangerous change-classes must be rehearsed before they ship (chaos drills + promotion gate).
Fidelity gates
- Parity manifest — the sandbox must match prod on: k3s
v1.34.3+k3s1, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running outside k3s, and three nodes. Any drift from this manifest is a failed gate. - Provisioning-parity canary test — run the new-web-app runbook for a throwaway
<app>namedcanaryand assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCDHealthy/Synced→ VSO injects → podRunning. One typo anywhere in the chain fails this test. - Idempotence gate (changed=0) —
ansible-playbook --check --diffmust reportchanged=0on the converged sandbox before the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready. tofu plandiff gate —tofu planruns against sandbox state; for DNS it must assert the plan touches only the throwaway zone. A plan that proposes to touch anything else fails the gate.
Chaos drills
Each drill maps to a section of CLUSTER_RECOVERY.md. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.
| Drill | Action | Acceptance | Recovery section |
|---|---|---|---|
| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to Ready/Healthy. |
Node-loss / reschedule section. |
| Vault-seal | Seal the sandbox Vault. | Unseal via the sandbox key (~/.arcodange/sandbox/cluster-keys.json); VSO re-authenticates and resumes secret injection. |
Vault unseal + VSO re-auth section. |
| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run recover/longhorn*.yml; validate engine-ID re-association per the Longhorn PVC recovery ADR. |
Longhorn restore section. |
| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. |
| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). |
Whole runbook. |
| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe prune/selfHeal; author/validate a rollback runbook section. |
ArgoCD rollback (to be authored). |
| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal *.arcodange.lab (sandbox) certs re-issue and chains validate. |
PKI re-issue section. |
- Node-kill stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
- Vault-seal seals the sandbox Vault, then unseals it with the sandbox key and confirms VSO re-authenticates and resumes injecting secrets.
- Longhorn corruption corrupts a sandbox volume, runs the
recover/longhorn*.ymlplaybooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR. - DB drop/restore drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
- Full power-cut cold-stops all three sandbox VMs and runs
CLUSTER_RECOVERY.mdend to end to green, with the ERP scaled up last. - ArgoCD bad-sync pushes a broken sandbox ref, observes
prune/selfHealbehaviour, and produces a rollback runbook section. - Cert/PKI re-issue re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.
Recurring game-day & promotion gate
A recurring monthly game-day where the operator follows only the runbook. Any improvised step becomes a runbook PR — this is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident.
Promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox and survived the matching drill. This gate belongs in the PR checklist and crosslinks to STATUS.md.
Evidence trail
Each drill records time-to-recover and which step failed or was improvised. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.