Files
factory/vibe/PRD/safe-prod-like-environment/qa-strategy.md
Gabriel Radureau 7647a68cdc docs(vibe): bootstrap vibe/ knowledge tree + ecosystem AGENTS.md
Add a root AGENTS.md (ecosystem map of factory/tools/cms + agent operating
rules + the persona cohort & workflow) and a new vibe/ knowledge base for LLM
agents, modeled on tree-docs conventions and the factory house style.

vibe/ folders (each with a README hub + contribution rules):
- ADR/      optimized MADR-lite; canonical home going forward (doc/adr stays historical)
- PRD/      one subfolder per PRD, mandatory STATUS.md, QA strategy for big ones
- investigations/  single INV-NNN-slug.md, or stub + folder w/ notebooks
- guidebooks/      tree-docs maps; lab-ecosystem guidebook of factory+tools+cms
- runbooks/        [AGENT]/[HUMAN] step procedures (EN; doc/runbooks stays FR)
- shareouts/       dated FR handouts (decks/mp4)

Seed content (first ADR + PRD): a safe, production-like environment to rehearse
risky changes and recovery without touching real prod — local-only sandbox
(k3d + arm64 VMs) with a hard prod/sandbox isolation boundary. Includes
INV-001 (prod blast-radius couplings), the ecosystem guidebook, and a FR shareout.

Conventions enforced: no-tombstone rule, breadcrumb spine, bidirectional
cross-links, theme:base mermaid (MCP-validated) + ordered-list-after-diagram.
Built with a Workflow + persona cohort; 24 files, zero dead links.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 11:52:37 +02:00

5.1 KiB

vibe > PRD > Safe, production-like environment > QA strategy

QA strategy

Status: In design Last Updated: 2026-06-23 Upstream: Safe, production-like environment Related: ADR 0001 · Isolation boundary · INV-001

The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay faithful to prod (fidelity gates), and dangerous change-classes must be rehearsed before they ship (chaos drills + promotion gate).

Fidelity gates

  • Parity manifest — the sandbox must match prod on: k3s v1.34.3+k3s1, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running outside k3s, and three nodes. Any drift from this manifest is a failed gate.
  • Provisioning-parity canary test — run the new-web-app runbook for a throwaway <app> named canary and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD Healthy/Synced → VSO injects → pod Running. One typo anywhere in the chain fails this test.
  • Idempotence gate (changed=0)ansible-playbook --check --diff must report changed=0 on the converged sandbox before the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready.
  • tofu plan diff gatetofu plan runs against sandbox state; for DNS it must assert the plan touches only the throwaway zone. A plan that proposes to touch anything else fails the gate.

Chaos drills

Each drill maps to a section of CLUSTER_RECOVERY.md. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.

Drill Action Acceptance Recovery section
Node-kill Stop one sandbox VM (agent, then server). Workloads reschedule; cluster returns to Ready/Healthy. Node-loss / reschedule section.
Vault-seal Seal the sandbox Vault. Unseal via the sandbox key (~/.arcodange/sandbox/cluster-keys.json); VSO re-authenticates and resumes secret injection. Vault unseal + VSO re-auth section.
Longhorn volume corruption Corrupt/recreate a sandbox volume. Run recover/longhorn*.yml; validate engine-ID re-association per the Longhorn PVC recovery ADR. Longhorn restore section.
DB drop/restore Drop a sandbox DB. Restore from the sandbox backup bucket; app reconnects, data intact. Postgres restore section.
Full power-cut simulation Cold-stop all three sandbox VMs. Execute CLUSTER_RECOVERY.md top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). Whole runbook.
ArgoCD bad-sync Push a deliberately broken sandbox ref. Observe prune/selfHeal; author/validate a rollback runbook section. ArgoCD rollback (to be authored).
Cert/PKI re-issue Re-init the sandbox Step-CA. Internal *.arcodange.lab (sandbox) certs re-issue and chains validate. PKI re-issue section.
  1. Node-kill stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
  2. Vault-seal seals the sandbox Vault, then unseals it with the sandbox key and confirms VSO re-authenticates and resumes injecting secrets.
  3. Longhorn corruption corrupts a sandbox volume, runs the recover/longhorn*.yml playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR.
  4. DB drop/restore drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
  5. Full power-cut cold-stops all three sandbox VMs and runs CLUSTER_RECOVERY.md end to end to green, with the ERP scaled up last.
  6. ArgoCD bad-sync pushes a broken sandbox ref, observes prune/selfHeal behaviour, and produces a rollback runbook section.
  7. Cert/PKI re-issue re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.

Recurring game-day & promotion gate

A recurring monthly game-day where the operator follows only the runbook. Any improvised step becomes a runbook PR — this is how CLUSTER_RECOVERY.md gets validated against the sandbox instead of waiting for the next real incident.

Promotion gate: no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox and survived the matching drill. This gate belongs in the PR checklist and crosslinks to STATUS.md.

Evidence trail

Each drill records time-to-recover and which step failed or was improvised. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.