docs(adr): extend network-architecture ADR with .lab SSL/TLS deep dive

Replaces the placeholder "Success Metrics" section with a detailed
walkthrough of the internal PKI: Step CA provisioners, cert-manager +
StepClusterIssuer wiring, certificate issuance/renewal sequence diagram,
device-trust installation steps, and troubleshooting playbook for the
common stuck-CertificateRequest / Traefik TLS / device-trust failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 12:55:27 +02:00
parent 1ae28cb944
commit e3e0decd98

View File

@@ -326,9 +326,251 @@ sequenceDiagram
end end
``` ```
## Success Metrics ## Deep Dive: `.lab` Domain SSL/TLS Architecture
- Pi-hole blocks >50% of ads and trackers.
- Step CA issues certificates without downtime. ### Overview
- Traefik routes 100% of external traffic via Cloudflare. The `.lab` domain relies on a **zero-trust internal PKI** (Public Key Infrastructure) powered by **Step CA**, integrated with **k3s**, **Traefik**, and **cert-manager**. This section details the components, interactions, and operational workflows.
- CrowdSec bans >10 malicious IPs per day.
- Cloudflare Turnstile blocks >90% of bot traffic. ### Core Components
#### 1. **Step CA (Certificate Authority)**
- **Host**: `pi1` (primary), with standby nodes for resilience.
- **Ports**: `8443` (HTTPS), `443` (ACME).
- **Provisioners**:
- `cert-manager`: Dedicated for k3s workloads.
- `admin`: For manual certificate issuance.
- **Certificate Lifecycle**:
- **Short-lived certificates** (default: 24h).
- **Automatic renewal** via cert-manager.
- **OCSP stapling** for revocation checks.
#### 2. **cert-manager**
- **Namespace**: `cert-manager`.
- **CRDs**:
- `Certificate`: Defines desired certificates.
- `CertificateRequest`: Requests signed by Step CA.
- `ClusterIssuer`/`Issuer`: References Step CA.
- `StepClusterIssuer`: Custom resource for Step CA integration.
#### 3. **StepClusterIssuer**
- **Purpose**: Bridges cert-manager with Step CA.
- **Configuration**:
```yaml
apiVersion: certmanager.step.sm/v1beta1
kind: StepClusterIssuer
metadata:
name: step-issuer
namespace: cert-manager
spec:
url: "https://ssl-ca.arcodange.lab:8443"
caBundle: "<base64-encoded-root-ca>"
provisioner:
name: cert-manager
kid: "<key-id>"
passwordRef:
name: step-jwk-password
key: password
```
- **Workflow**:
1. cert-manager creates a `CertificateRequest`.
2. `StepClusterIssuer` forwards the request to Step CA.
3. Step CA signs the certificate and returns it to cert-manager.
4. cert-manager stores the certificate in a Kubernetes `Secret`.
#### 4. **Traefik Ingress Controller**
- **Namespace**: `kube-system`.
- **TLS Configuration**:
- **EntryPoints**: `websecure` (HTTPS), `web` (HTTP → redirect).
- **Certificate Resolvers**:
- `letsencrypt`: For `.arcodange.fr` (public).
- `step-ca`: For `.lab` (internal).
- **Middlewares**:
- `localIp@file`: IP allowlisting.
- `crowdsec-bouncer`: Intrusion prevention.
#### 5. **Certificate and CertificateRequest**
- **Example `Certificate` for `.lab`**:
```yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: wildcard-arcodange-lab
namespace: kube-system
spec:
secretName: wildcard-arcodange-lab-tls
issuerRef:
name: step-issuer
kind: StepClusterIssuer
group: certmanager.step.sm
dnsNames:
- "*.arcodange.lab"
- "arcodange.lab"
```
- **Generated `CertificateRequest`**:
- Automatically created by cert-manager.
- References the `StepClusterIssuer`.
- Status transitions: `Pending` → `Approved` → `Ready`.
#### 6. **k3s Cluster Integration**
- **Nodes**: `pi1` (control plane), `pi2`, `pi3` (workers).
- **Storage**: Longhorn for persistent volumes.
- **Networking**:
- **CNI**: Flannel.
- **Service Mesh**: Traefik for ingress, Linkerd (optional).
### Workflow: Certificate Issuance and Renewal
```mermaid
sequenceDiagram
participant App as Application (e.g., Gitea)
participant Cert as Certificate
participant CR as CertificateRequest
participant SCI as StepClusterIssuer
participant StepCA as Step CA
participant Secret as Kubernetes Secret
participant Traefik as Traefik
App->>Cert: Declare desired certificate
Cert->>CR: Create CertificateRequest
CR->>SCI: Forward to StepClusterIssuer
SCI->>StepCA: Sign CSR (via JWK provisioner)
StepCA-->>SCI: Return signed certificate
SCI->>Secret: Store certificate/key
Secret-->>Traefik: Mount as TLS secret
Traefik->>App: Route traffic with TLS
loop Every 2/3 of certificate lifetime
Cert->>CR: Trigger renewal
CR->>SCI: Re-sign CSR
SCI->>StepCA: Request new certificate
StepCA-->>SCI: Return signed certificate
SCI->>Secret: Update secret
end
```
### Device Trust: Adding `.lab` CA to External Devices
#### **Manual Trust Installation**
1. **Export Root CA**:
```bash
scp pi1:/home/step/.step/certs/root_ca.crt ./arcodange-lab-ca.crt
```
2. **Install on Devices**:
- **macOS**:
```bash
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ./arcodange-lab-ca.crt
```
- **Linux (Debian/Ubuntu)**:
```bash
sudo cp arcodange-lab-ca.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
```
- **Windows**:
- Import via `certmgr.msc` → **Trusted Root Certification Authorities**.
- **Android/iOS**:
- Email the `.crt` and install via device settings.
- **Raspberry Pi**:
```bash
sudo cp arcodange-lab-ca.crt /etc/ssl/certs/
sudo update-ca-certificates
```
#### **Automated Trust via Ansible**
- **Playbook**: `playbooks/system/trust_ca.yml`
- **Role**: `arcodange.factory.trust_ca`
- **Targets**: All nodes in `raspberries` group.
### Troubleshooting Common Issues
#### 1. **Certificate Not Issued**
- **Symptoms**: `CertificateRequest` stuck in `Pending`.
- **Causes**:
- Step CA unreachable.
- Incorrect `caBundle` or provisioner `kid`.
- Network policies blocking egress to Step CA.
- **Fixes**:
```bash
# Check StepClusterIssuer status
kubectl -n cert-manager describe stepclusterissuer step-issuer
# Verify Step CA connectivity
kubectl -n cert-manager logs -l app.kubernetes.io/name=step-issuer
# Test Step CA manually
step ca certificate --ca-url https://ssl-ca.arcodange.lab:8443 \
--root /home/step/.step/certs/root_ca.crt \
test.lab test.crt test.key
```
#### 2. **Traefik TLS Errors**
- **Symptoms**: `502 Bad Gateway` or TLS handshake failures.
- **Causes**:
- Missing certificate in `Secret`.
- Incorrect SNI routing.
- Expired certificates.
- **Fixes**:
```bash
# Check Traefik logs
kubectl -n kube-system logs -l app.kubernetes.io/name=traefik
# Verify certificate secret
kubectl -n kube-system get secret wildcard-arcodange-lab-tls -o yaml
# Restart Traefik
kubectl -n kube-system rollout restart deployment/traefik
```
#### 3. **Device Trust Issues**
- **Symptoms**: Browser warnings (`NET::ERR_CERT_AUTHORITY_INVALID`).
- **Causes**:
- CA not installed in device trust store.
- Clock skew (certificate validity).
- **Fixes**:
- Reinstall CA certificate.
- Sync device clock with NTP:
```bash
sudo ntpdate pool.ntp.org
```
### Security Considerations
#### 1. **Provisioner Security**
- **JWK Provisioner**: Encrypted with a password stored in Kubernetes `Secret`.
- **Password Rotation**:
```bash
# Rotate JWK password via Ansible
ansible-playbook playbooks/ssl/rotate_jwk_password.yml
```
#### 2. **Certificate Revocation**
- **OCSP**: Step CA supports Online Certificate Status Protocol.
- **Manual Revocation**:
```bash
step ca revoke <serial> --reason superseded
```
#### 3. **Network Isolation**
- **Step CA Access**: Restricted to k3s cluster IPs via firewall rules.
- **Traefik Middlewares**: Enforce IP allowlisting for internal services.
### Future Enhancements
1. **Automated Device Onboarding**:
- MDM (Mobile Device Management) integration for CA trust.
- Ansible playbook for bulk device enrollment.
2. **Step CA High Availability**:
- Multi-node Step CA with RAFT consensus.
- Automatic failover for provisioners.
3. **Certificate Transparency**:
- Log all `.lab` certificates to a private CT log.
4. **Short-Lived Certificates**:
- Reduce default TTL to 1h for critical services.
### References
- [Step CA Documentation](https://smallstep.com/docs/step-ca/)
- [cert-manager Step Issuer](https://smallstep.com/docs/step-certificates/kubernetes/)
- [Traefik TLS Configuration](https://doc.traefik.io/traefik/https/tls/)