diff --git a/ansible/arcodange/factory/docs/adr/20260407-network-architecture.md b/ansible/arcodange/factory/docs/adr/20260407-network-architecture.md index b40d097..a6e9446 100644 --- a/ansible/arcodange/factory/docs/adr/20260407-network-architecture.md +++ b/ansible/arcodange/factory/docs/adr/20260407-network-architecture.md @@ -326,9 +326,251 @@ sequenceDiagram end ``` -## Success Metrics -- Pi-hole blocks >50% of ads and trackers. -- Step CA issues certificates without downtime. -- Traefik routes 100% of external traffic via Cloudflare. -- CrowdSec bans >10 malicious IPs per day. -- Cloudflare Turnstile blocks >90% of bot traffic. +## Deep Dive: `.lab` Domain SSL/TLS Architecture + +### Overview +The `.lab` domain relies on a **zero-trust internal PKI** (Public Key Infrastructure) powered by **Step CA**, integrated with **k3s**, **Traefik**, and **cert-manager**. This section details the components, interactions, and operational workflows. + +### Core Components + +#### 1. **Step CA (Certificate Authority)** +- **Host**: `pi1` (primary), with standby nodes for resilience. +- **Ports**: `8443` (HTTPS), `443` (ACME). +- **Provisioners**: + - `cert-manager`: Dedicated for k3s workloads. + - `admin`: For manual certificate issuance. +- **Certificate Lifecycle**: + - **Short-lived certificates** (default: 24h). + - **Automatic renewal** via cert-manager. + - **OCSP stapling** for revocation checks. + +#### 2. **cert-manager** +- **Namespace**: `cert-manager`. +- **CRDs**: + - `Certificate`: Defines desired certificates. + - `CertificateRequest`: Requests signed by Step CA. + - `ClusterIssuer`/`Issuer`: References Step CA. + - `StepClusterIssuer`: Custom resource for Step CA integration. + +#### 3. **StepClusterIssuer** +- **Purpose**: Bridges cert-manager with Step CA. +- **Configuration**: + ```yaml + apiVersion: certmanager.step.sm/v1beta1 + kind: StepClusterIssuer + metadata: + name: step-issuer + namespace: cert-manager + spec: + url: "https://ssl-ca.arcodange.lab:8443" + caBundle: "" + provisioner: + name: cert-manager + kid: "" + passwordRef: + name: step-jwk-password + key: password + ``` +- **Workflow**: + 1. cert-manager creates a `CertificateRequest`. + 2. `StepClusterIssuer` forwards the request to Step CA. + 3. Step CA signs the certificate and returns it to cert-manager. + 4. cert-manager stores the certificate in a Kubernetes `Secret`. + +#### 4. **Traefik Ingress Controller** +- **Namespace**: `kube-system`. +- **TLS Configuration**: + - **EntryPoints**: `websecure` (HTTPS), `web` (HTTP → redirect). + - **Certificate Resolvers**: + - `letsencrypt`: For `.arcodange.fr` (public). + - `step-ca`: For `.lab` (internal). + - **Middlewares**: + - `localIp@file`: IP allowlisting. + - `crowdsec-bouncer`: Intrusion prevention. + +#### 5. **Certificate and CertificateRequest** +- **Example `Certificate` for `.lab`**: + ```yaml + apiVersion: cert-manager.io/v1 + kind: Certificate + metadata: + name: wildcard-arcodange-lab + namespace: kube-system + spec: + secretName: wildcard-arcodange-lab-tls + issuerRef: + name: step-issuer + kind: StepClusterIssuer + group: certmanager.step.sm + dnsNames: + - "*.arcodange.lab" + - "arcodange.lab" + ``` +- **Generated `CertificateRequest`**: + - Automatically created by cert-manager. + - References the `StepClusterIssuer`. + - Status transitions: `Pending` → `Approved` → `Ready`. + +#### 6. **k3s Cluster Integration** +- **Nodes**: `pi1` (control plane), `pi2`, `pi3` (workers). +- **Storage**: Longhorn for persistent volumes. +- **Networking**: + - **CNI**: Flannel. + - **Service Mesh**: Traefik for ingress, Linkerd (optional). + +### Workflow: Certificate Issuance and Renewal + +```mermaid +sequenceDiagram + participant App as Application (e.g., Gitea) + participant Cert as Certificate + participant CR as CertificateRequest + participant SCI as StepClusterIssuer + participant StepCA as Step CA + participant Secret as Kubernetes Secret + participant Traefik as Traefik + + App->>Cert: Declare desired certificate + Cert->>CR: Create CertificateRequest + CR->>SCI: Forward to StepClusterIssuer + SCI->>StepCA: Sign CSR (via JWK provisioner) + StepCA-->>SCI: Return signed certificate + SCI->>Secret: Store certificate/key + Secret-->>Traefik: Mount as TLS secret + Traefik->>App: Route traffic with TLS + + loop Every 2/3 of certificate lifetime + Cert->>CR: Trigger renewal + CR->>SCI: Re-sign CSR + SCI->>StepCA: Request new certificate + StepCA-->>SCI: Return signed certificate + SCI->>Secret: Update secret + end +``` + +### Device Trust: Adding `.lab` CA to External Devices + +#### **Manual Trust Installation** +1. **Export Root CA**: + ```bash + scp pi1:/home/step/.step/certs/root_ca.crt ./arcodange-lab-ca.crt + ``` +2. **Install on Devices**: + - **macOS**: + ```bash + sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ./arcodange-lab-ca.crt + ``` + - **Linux (Debian/Ubuntu)**: + ```bash + sudo cp arcodange-lab-ca.crt /usr/local/share/ca-certificates/ + sudo update-ca-certificates + ``` + - **Windows**: + - Import via `certmgr.msc` → **Trusted Root Certification Authorities**. + - **Android/iOS**: + - Email the `.crt` and install via device settings. + - **Raspberry Pi**: + ```bash + sudo cp arcodange-lab-ca.crt /etc/ssl/certs/ + sudo update-ca-certificates + ``` + +#### **Automated Trust via Ansible** +- **Playbook**: `playbooks/system/trust_ca.yml` +- **Role**: `arcodange.factory.trust_ca` +- **Targets**: All nodes in `raspberries` group. + +### Troubleshooting Common Issues + +#### 1. **Certificate Not Issued** +- **Symptoms**: `CertificateRequest` stuck in `Pending`. +- **Causes**: + - Step CA unreachable. + - Incorrect `caBundle` or provisioner `kid`. + - Network policies blocking egress to Step CA. +- **Fixes**: + ```bash + # Check StepClusterIssuer status + kubectl -n cert-manager describe stepclusterissuer step-issuer + + # Verify Step CA connectivity + kubectl -n cert-manager logs -l app.kubernetes.io/name=step-issuer + + # Test Step CA manually + step ca certificate --ca-url https://ssl-ca.arcodange.lab:8443 \ + --root /home/step/.step/certs/root_ca.crt \ + test.lab test.crt test.key + ``` + +#### 2. **Traefik TLS Errors** +- **Symptoms**: `502 Bad Gateway` or TLS handshake failures. +- **Causes**: + - Missing certificate in `Secret`. + - Incorrect SNI routing. + - Expired certificates. +- **Fixes**: + ```bash + # Check Traefik logs + kubectl -n kube-system logs -l app.kubernetes.io/name=traefik + + # Verify certificate secret + kubectl -n kube-system get secret wildcard-arcodange-lab-tls -o yaml + + # Restart Traefik + kubectl -n kube-system rollout restart deployment/traefik + ``` + +#### 3. **Device Trust Issues** +- **Symptoms**: Browser warnings (`NET::ERR_CERT_AUTHORITY_INVALID`). +- **Causes**: + - CA not installed in device trust store. + - Clock skew (certificate validity). +- **Fixes**: + - Reinstall CA certificate. + - Sync device clock with NTP: + ```bash + sudo ntpdate pool.ntp.org + ``` + +### Security Considerations + +#### 1. **Provisioner Security** +- **JWK Provisioner**: Encrypted with a password stored in Kubernetes `Secret`. +- **Password Rotation**: + ```bash + # Rotate JWK password via Ansible + ansible-playbook playbooks/ssl/rotate_jwk_password.yml + ``` + +#### 2. **Certificate Revocation** +- **OCSP**: Step CA supports Online Certificate Status Protocol. +- **Manual Revocation**: + ```bash + step ca revoke --reason superseded + ``` + +#### 3. **Network Isolation** +- **Step CA Access**: Restricted to k3s cluster IPs via firewall rules. +- **Traefik Middlewares**: Enforce IP allowlisting for internal services. + +### Future Enhancements + +1. **Automated Device Onboarding**: + - MDM (Mobile Device Management) integration for CA trust. + - Ansible playbook for bulk device enrollment. + +2. **Step CA High Availability**: + - Multi-node Step CA with RAFT consensus. + - Automatic failover for provisioners. + +3. **Certificate Transparency**: + - Log all `.lab` certificates to a private CT log. + +4. **Short-Lived Certificates**: + - Reduce default TTL to 1h for critical services. + +### References + +- [Step CA Documentation](https://smallstep.com/docs/step-ca/) +- [cert-manager Step Issuer](https://smallstep.com/docs/step-certificates/kubernetes/) +- [Traefik TLS Configuration](https://doc.traefik.io/traefik/https/tls/)