docs(adr): extend network-architecture ADR with .lab SSL/TLS deep dive
Replaces the placeholder "Success Metrics" section with a detailed walkthrough of the internal PKI: Step CA provisioners, cert-manager + StepClusterIssuer wiring, certificate issuance/renewal sequence diagram, device-trust installation steps, and troubleshooting playbook for the common stuck-CertificateRequest / Traefik TLS / device-trust failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -326,9 +326,251 @@ sequenceDiagram
|
||||
end
|
||||
```
|
||||
|
||||
## Success Metrics
|
||||
- Pi-hole blocks >50% of ads and trackers.
|
||||
- Step CA issues certificates without downtime.
|
||||
- Traefik routes 100% of external traffic via Cloudflare.
|
||||
- CrowdSec bans >10 malicious IPs per day.
|
||||
- Cloudflare Turnstile blocks >90% of bot traffic.
|
||||
## Deep Dive: `.lab` Domain SSL/TLS Architecture
|
||||
|
||||
### Overview
|
||||
The `.lab` domain relies on a **zero-trust internal PKI** (Public Key Infrastructure) powered by **Step CA**, integrated with **k3s**, **Traefik**, and **cert-manager**. This section details the components, interactions, and operational workflows.
|
||||
|
||||
### Core Components
|
||||
|
||||
#### 1. **Step CA (Certificate Authority)**
|
||||
- **Host**: `pi1` (primary), with standby nodes for resilience.
|
||||
- **Ports**: `8443` (HTTPS), `443` (ACME).
|
||||
- **Provisioners**:
|
||||
- `cert-manager`: Dedicated for k3s workloads.
|
||||
- `admin`: For manual certificate issuance.
|
||||
- **Certificate Lifecycle**:
|
||||
- **Short-lived certificates** (default: 24h).
|
||||
- **Automatic renewal** via cert-manager.
|
||||
- **OCSP stapling** for revocation checks.
|
||||
|
||||
#### 2. **cert-manager**
|
||||
- **Namespace**: `cert-manager`.
|
||||
- **CRDs**:
|
||||
- `Certificate`: Defines desired certificates.
|
||||
- `CertificateRequest`: Requests signed by Step CA.
|
||||
- `ClusterIssuer`/`Issuer`: References Step CA.
|
||||
- `StepClusterIssuer`: Custom resource for Step CA integration.
|
||||
|
||||
#### 3. **StepClusterIssuer**
|
||||
- **Purpose**: Bridges cert-manager with Step CA.
|
||||
- **Configuration**:
|
||||
```yaml
|
||||
apiVersion: certmanager.step.sm/v1beta1
|
||||
kind: StepClusterIssuer
|
||||
metadata:
|
||||
name: step-issuer
|
||||
namespace: cert-manager
|
||||
spec:
|
||||
url: "https://ssl-ca.arcodange.lab:8443"
|
||||
caBundle: "<base64-encoded-root-ca>"
|
||||
provisioner:
|
||||
name: cert-manager
|
||||
kid: "<key-id>"
|
||||
passwordRef:
|
||||
name: step-jwk-password
|
||||
key: password
|
||||
```
|
||||
- **Workflow**:
|
||||
1. cert-manager creates a `CertificateRequest`.
|
||||
2. `StepClusterIssuer` forwards the request to Step CA.
|
||||
3. Step CA signs the certificate and returns it to cert-manager.
|
||||
4. cert-manager stores the certificate in a Kubernetes `Secret`.
|
||||
|
||||
#### 4. **Traefik Ingress Controller**
|
||||
- **Namespace**: `kube-system`.
|
||||
- **TLS Configuration**:
|
||||
- **EntryPoints**: `websecure` (HTTPS), `web` (HTTP → redirect).
|
||||
- **Certificate Resolvers**:
|
||||
- `letsencrypt`: For `.arcodange.fr` (public).
|
||||
- `step-ca`: For `.lab` (internal).
|
||||
- **Middlewares**:
|
||||
- `localIp@file`: IP allowlisting.
|
||||
- `crowdsec-bouncer`: Intrusion prevention.
|
||||
|
||||
#### 5. **Certificate and CertificateRequest**
|
||||
- **Example `Certificate` for `.lab`**:
|
||||
```yaml
|
||||
apiVersion: cert-manager.io/v1
|
||||
kind: Certificate
|
||||
metadata:
|
||||
name: wildcard-arcodange-lab
|
||||
namespace: kube-system
|
||||
spec:
|
||||
secretName: wildcard-arcodange-lab-tls
|
||||
issuerRef:
|
||||
name: step-issuer
|
||||
kind: StepClusterIssuer
|
||||
group: certmanager.step.sm
|
||||
dnsNames:
|
||||
- "*.arcodange.lab"
|
||||
- "arcodange.lab"
|
||||
```
|
||||
- **Generated `CertificateRequest`**:
|
||||
- Automatically created by cert-manager.
|
||||
- References the `StepClusterIssuer`.
|
||||
- Status transitions: `Pending` → `Approved` → `Ready`.
|
||||
|
||||
#### 6. **k3s Cluster Integration**
|
||||
- **Nodes**: `pi1` (control plane), `pi2`, `pi3` (workers).
|
||||
- **Storage**: Longhorn for persistent volumes.
|
||||
- **Networking**:
|
||||
- **CNI**: Flannel.
|
||||
- **Service Mesh**: Traefik for ingress, Linkerd (optional).
|
||||
|
||||
### Workflow: Certificate Issuance and Renewal
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant App as Application (e.g., Gitea)
|
||||
participant Cert as Certificate
|
||||
participant CR as CertificateRequest
|
||||
participant SCI as StepClusterIssuer
|
||||
participant StepCA as Step CA
|
||||
participant Secret as Kubernetes Secret
|
||||
participant Traefik as Traefik
|
||||
|
||||
App->>Cert: Declare desired certificate
|
||||
Cert->>CR: Create CertificateRequest
|
||||
CR->>SCI: Forward to StepClusterIssuer
|
||||
SCI->>StepCA: Sign CSR (via JWK provisioner)
|
||||
StepCA-->>SCI: Return signed certificate
|
||||
SCI->>Secret: Store certificate/key
|
||||
Secret-->>Traefik: Mount as TLS secret
|
||||
Traefik->>App: Route traffic with TLS
|
||||
|
||||
loop Every 2/3 of certificate lifetime
|
||||
Cert->>CR: Trigger renewal
|
||||
CR->>SCI: Re-sign CSR
|
||||
SCI->>StepCA: Request new certificate
|
||||
StepCA-->>SCI: Return signed certificate
|
||||
SCI->>Secret: Update secret
|
||||
end
|
||||
```
|
||||
|
||||
### Device Trust: Adding `.lab` CA to External Devices
|
||||
|
||||
#### **Manual Trust Installation**
|
||||
1. **Export Root CA**:
|
||||
```bash
|
||||
scp pi1:/home/step/.step/certs/root_ca.crt ./arcodange-lab-ca.crt
|
||||
```
|
||||
2. **Install on Devices**:
|
||||
- **macOS**:
|
||||
```bash
|
||||
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain ./arcodange-lab-ca.crt
|
||||
```
|
||||
- **Linux (Debian/Ubuntu)**:
|
||||
```bash
|
||||
sudo cp arcodange-lab-ca.crt /usr/local/share/ca-certificates/
|
||||
sudo update-ca-certificates
|
||||
```
|
||||
- **Windows**:
|
||||
- Import via `certmgr.msc` → **Trusted Root Certification Authorities**.
|
||||
- **Android/iOS**:
|
||||
- Email the `.crt` and install via device settings.
|
||||
- **Raspberry Pi**:
|
||||
```bash
|
||||
sudo cp arcodange-lab-ca.crt /etc/ssl/certs/
|
||||
sudo update-ca-certificates
|
||||
```
|
||||
|
||||
#### **Automated Trust via Ansible**
|
||||
- **Playbook**: `playbooks/system/trust_ca.yml`
|
||||
- **Role**: `arcodange.factory.trust_ca`
|
||||
- **Targets**: All nodes in `raspberries` group.
|
||||
|
||||
### Troubleshooting Common Issues
|
||||
|
||||
#### 1. **Certificate Not Issued**
|
||||
- **Symptoms**: `CertificateRequest` stuck in `Pending`.
|
||||
- **Causes**:
|
||||
- Step CA unreachable.
|
||||
- Incorrect `caBundle` or provisioner `kid`.
|
||||
- Network policies blocking egress to Step CA.
|
||||
- **Fixes**:
|
||||
```bash
|
||||
# Check StepClusterIssuer status
|
||||
kubectl -n cert-manager describe stepclusterissuer step-issuer
|
||||
|
||||
# Verify Step CA connectivity
|
||||
kubectl -n cert-manager logs -l app.kubernetes.io/name=step-issuer
|
||||
|
||||
# Test Step CA manually
|
||||
step ca certificate --ca-url https://ssl-ca.arcodange.lab:8443 \
|
||||
--root /home/step/.step/certs/root_ca.crt \
|
||||
test.lab test.crt test.key
|
||||
```
|
||||
|
||||
#### 2. **Traefik TLS Errors**
|
||||
- **Symptoms**: `502 Bad Gateway` or TLS handshake failures.
|
||||
- **Causes**:
|
||||
- Missing certificate in `Secret`.
|
||||
- Incorrect SNI routing.
|
||||
- Expired certificates.
|
||||
- **Fixes**:
|
||||
```bash
|
||||
# Check Traefik logs
|
||||
kubectl -n kube-system logs -l app.kubernetes.io/name=traefik
|
||||
|
||||
# Verify certificate secret
|
||||
kubectl -n kube-system get secret wildcard-arcodange-lab-tls -o yaml
|
||||
|
||||
# Restart Traefik
|
||||
kubectl -n kube-system rollout restart deployment/traefik
|
||||
```
|
||||
|
||||
#### 3. **Device Trust Issues**
|
||||
- **Symptoms**: Browser warnings (`NET::ERR_CERT_AUTHORITY_INVALID`).
|
||||
- **Causes**:
|
||||
- CA not installed in device trust store.
|
||||
- Clock skew (certificate validity).
|
||||
- **Fixes**:
|
||||
- Reinstall CA certificate.
|
||||
- Sync device clock with NTP:
|
||||
```bash
|
||||
sudo ntpdate pool.ntp.org
|
||||
```
|
||||
|
||||
### Security Considerations
|
||||
|
||||
#### 1. **Provisioner Security**
|
||||
- **JWK Provisioner**: Encrypted with a password stored in Kubernetes `Secret`.
|
||||
- **Password Rotation**:
|
||||
```bash
|
||||
# Rotate JWK password via Ansible
|
||||
ansible-playbook playbooks/ssl/rotate_jwk_password.yml
|
||||
```
|
||||
|
||||
#### 2. **Certificate Revocation**
|
||||
- **OCSP**: Step CA supports Online Certificate Status Protocol.
|
||||
- **Manual Revocation**:
|
||||
```bash
|
||||
step ca revoke <serial> --reason superseded
|
||||
```
|
||||
|
||||
#### 3. **Network Isolation**
|
||||
- **Step CA Access**: Restricted to k3s cluster IPs via firewall rules.
|
||||
- **Traefik Middlewares**: Enforce IP allowlisting for internal services.
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
1. **Automated Device Onboarding**:
|
||||
- MDM (Mobile Device Management) integration for CA trust.
|
||||
- Ansible playbook for bulk device enrollment.
|
||||
|
||||
2. **Step CA High Availability**:
|
||||
- Multi-node Step CA with RAFT consensus.
|
||||
- Automatic failover for provisioners.
|
||||
|
||||
3. **Certificate Transparency**:
|
||||
- Log all `.lab` certificates to a private CT log.
|
||||
|
||||
4. **Short-Lived Certificates**:
|
||||
- Reduce default TTL to 1h for critical services.
|
||||
|
||||
### References
|
||||
|
||||
- [Step CA Documentation](https://smallstep.com/docs/step-ca/)
|
||||
- [cert-manager Step Issuer](https://smallstep.com/docs/step-certificates/kubernetes/)
|
||||
- [Traefik TLS Configuration](https://doc.traefik.io/traefik/https/tls/)
|
||||
|
||||
Reference in New Issue
Block a user