[vibe](../../README.md) > [PRD](../README.md) > [Safe, production-like environment](README.md) > **QA strategy**

# QA strategy

> **Status:** In design
> **Last Updated:** 2026-06-23
> **Upstream:** [Safe, production-like environment](README.md)
> **Related:** [ADR 0001](../../ADR/0001-safe-prod-like-environment.md) · [Isolation boundary](isolation-boundary.md) · [INV-001](../../investigations/INV-001-prod-blast-radius-couplings.md)

The sandbox is only useful if a green run there reliably predicts a green run in prod. That requires two things: the sandbox must stay **faithful** to prod (fidelity gates), and dangerous change-classes must be **rehearsed** before they ship (chaos drills + promotion gate).

## Fidelity gates

- **Parity manifest** — the sandbox must match prod on: k3s `v1.34.3+k3s1`, the same Longhorn / Vault / VSO versions, the same app-of-apps list, Postgres + Gitea running **outside** k3s, and three nodes. Any drift from this manifest is a failed gate.
- **Provisioning-parity canary test** — run the [new-web-app runbook](../../../doc/runbooks/new-web-app/conventions.md) for a throwaway `<app>` named `canary` and assert the convention chain resolves end to end: Gitea repo → PG db + role → Vault creds + policies → ArgoCD `Healthy`/`Synced` → VSO injects → pod `Running`. One typo anywhere in the chain fails this test.
- **Idempotence gate (changed=0)** — `ansible-playbook --check --diff` must report `changed=0` on the converged sandbox **before** the same change is promoted to prod. A non-zero diff means the play is not idempotent and is not ready.
- **`tofu plan` diff gate** — `tofu plan` runs against **sandbox** state; for DNS it must assert the plan touches **only** the throwaway zone. A plan that proposes to touch anything else fails the gate.

## Chaos drills

Each drill maps to a section of `CLUSTER_RECOVERY.md`. The drill is "passed" only when the acceptance condition is met by following the runbook — any improvised step becomes a runbook PR.

| Drill | Action | Acceptance | Recovery section |
| --- | --- | --- | --- |
| Node-kill | Stop one sandbox VM (agent, then server). | Workloads reschedule; cluster returns to `Ready`/`Healthy`. | Node-loss / reschedule section. |
| Vault-seal | Seal the **sandbox** Vault. | Unseal via the **sandbox** key (`~/.arcodange/sandbox/cluster-keys.json`); VSO re-authenticates and resumes secret injection. | Vault unseal + VSO re-auth section. |
| Longhorn volume corruption | Corrupt/recreate a sandbox volume. | Run `recover/longhorn*.yml`; validate engine-ID re-association per the [Longhorn PVC recovery ADR](../../../ansible/arcodange/factory/docs/adr/20260414-longhorn-pvc-recovery.md). | Longhorn restore section. |
| DB drop/restore | Drop a sandbox DB. | Restore from the sandbox backup bucket; app reconnects, data intact. | Postgres restore section. |
| Full power-cut simulation | Cold-stop all three sandbox VMs. | Execute `CLUSTER_RECOVERY.md` top-to-bottom against the sandbox to green (Longhorn restore → Vault unseal → VSO re-auth → ERP scaled up last). | Whole runbook. |
| ArgoCD bad-sync | Push a deliberately broken sandbox ref. | Observe `prune`/`selfHeal`; author/validate a rollback runbook section. | ArgoCD rollback (to be authored). |
| Cert/PKI re-issue | Re-init the sandbox Step-CA. | Internal `*.arcodange.lab` (sandbox) certs re-issue and chains validate. | PKI re-issue section. |

1. **Node-kill** stops a sandbox VM and confirms workloads reschedule and the cluster returns to a healthy state.
2. **Vault-seal** seals the sandbox Vault, then unseals it with the **sandbox** key and confirms VSO re-authenticates and resumes injecting secrets.
3. **Longhorn corruption** corrupts a sandbox volume, runs the `recover/longhorn*.yml` playbooks, and validates engine-ID re-association against the Longhorn PVC recovery ADR.
4. **DB drop/restore** drops a sandbox database and restores it from the sandbox backup bucket, confirming the app reconnects with data intact.
5. **Full power-cut** cold-stops all three sandbox VMs and runs `CLUSTER_RECOVERY.md` end to end to green, with the ERP scaled up last.
6. **ArgoCD bad-sync** pushes a broken sandbox ref, observes `prune`/`selfHeal` behaviour, and produces a rollback runbook section.
7. **Cert/PKI re-issue** re-initialises the sandbox Step-CA and confirms internal certificates re-issue and validate.

## Recurring game-day & promotion gate

A **recurring monthly game-day** where the operator follows **only** the runbook. Any improvised step becomes a runbook PR — this is how `CLUSTER_RECOVERY.md` gets validated against the sandbox instead of waiting for the next real incident.

**Promotion gate:** no infra / Vault / storage / DNS change reaches prod until it has been applied to the sandbox **and** survived the matching drill. This gate belongs in the PR checklist and crosslinks to [STATUS.md](STATUS.md).

## Evidence trail

Each drill records **time-to-recover** and **which step failed or was improvised**. The evidence trail accumulates over at least one game-day cycle and is the proof that a change-class is rehearsed and that the recovery runbook is current. Failed/improvised steps feed directly back into runbook PRs.