Consolidate ADRs into docs/adr/

This commit moves Architecture Decision Records (ADRs) from ../../../docs/adr/ to docs/adr/ in the arcodange/factory repository. This centralizes all ADRs in one location for better maintainability and discoverability.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit is contained in:
2026-04-08 11:09:34 +02:00
parent fc9164f11e
commit b299469d00
3 changed files with 624 additions and 0 deletions

View File

@@ -0,0 +1,160 @@
# ADR 20260407: CI/CD Architecture with ArgoCD, Gitea, and Vault
## Status
Proposed
## Context
The home lab requires a secure and automated CI/CD pipeline to deploy applications to the k3s cluster. The pipeline must integrate with:
- **Gitea**: For Git repository management and CI runners.
- **ArgoCD**: For GitOps-based continuous deployment.
- **Vault**: For secrets management and OIDC authentication.
- **Gitea Act Runner**: For executing CI jobs.
## Decision
We will implement a **GitOps-driven CI/CD pipeline** with the following components:
### 1. Gitea OIDC Authentication with Vault
- Gitea is registered as an OIDC application in Vault.
- Vault issues short-lived tokens for Gitea users.
- The `gitea_oidc_auth.yml` playbook automates this setup using Playwright and OpenTofu.
- **OIDC Workflow**:
1. The `oidc_jwt_token.sh` script (base64-encoded in `secrets.vault_oauth__sh_b64`) handles the OIDC flow.
2. Gitea Act Runner executes the script to obtain an ID token from Gitea.
3. The ID token is used to authenticate with Vault and retrieve secrets.
### 2. Gitea Act Runner
- Deployed on `pi1` and `pi3` (not on the Gitea host, which is `pi2`).
- Uses Docker-in-Docker for job execution.
- **Custom Runner Image (`ubuntu-latest-ca`)**: Required due to the self-signed `.lab` domain. The custom image includes the local CA certificate to trust the Gitea instance (`gitea.arcodange.lab`).
- Managed via Docker Compose (`03_cicd.yml`).
### 3. ArgoCD
- Deployed on the k3s cluster (via HelmChart in `/var/lib/rancher/k3s/server/manifests/argocd.yaml`).
- Uses Gitea as the source of truth for GitOps.
- Synchronizes the `factory` repository to deploy applications.
- Configured with Traefik for TLS termination.
### 4. Vault Secrets Operator
- Deployed in the `tools` namespace.
- Manages secrets for applications deployed via ArgoCD.
- Integrates with Gitea OIDC for authentication.
- **Helm Chart Integration**:
- `VaultAuth`: Authenticates with Vault using Kubernetes service accounts.
- `VaultStaticSecret`: Retrieves static secrets (e.g., `kvv2/webapp/config`).
- `VaultDynamicSecret`: Generates dynamic secrets (e.g., PostgreSQL credentials).
### 5. Security
- **TLS**: Traefik terminates TLS using Let's Encrypt.
- **OIDC**: Gitea authentication via Vault.
- **Secrets**: Stored in Vault, injected via the Vault Secrets Operator.
## Architecture Diagram
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#333333', 'edgeLabelBackground':'#f0f0f0', 'tertiaryColor': '#e67e22'}}}%%
graph TD
%% Styles
classDef gitea fill:#ffcc99,stroke:#cc9966,color:#333;
classDef argocd fill:#99ffcc,stroke:#66cc99,color:#333;
classDef vault fill:#ccccff,stroke:#6666cc,color:#333;
classDef k3s fill:#ff9999,stroke:#cc0000,color:#333;
classDef runner fill:#ffff99,stroke:#cccc00,color:#333;
%% Components
Gitea["Gitea (pi2)"]:::gitea
ArgoCD["ArgoCD (k3s)"]:::argocd
Vault["Vault (k3s/tools)"]:::vault
Runner1["Gitea Act Runner (pi1)"]:::runner
Runner2["Gitea Act Runner (pi3)"]:::runner
VaultOperator["Vault Secrets Operator (k3s/tools)"]:::vault
k3s["k3s Cluster"]:::k3s
%% Workflow
Gitea -->|OIDC Auth| Vault
Gitea -->|Trigger CI| Runner1
Gitea -->|Trigger CI| Runner2
Runner1 -->|Deploy to| k3s
Runner2 -->|Deploy to| k3s
ArgoCD -->|GitOps Sync| Gitea
ArgoCD -->|Deploy Apps| k3s
VaultOperator -->|Inject Secrets| k3s
Vault -->|Secrets| VaultOperator
%% Annotations
linkStyle 0,1,2,3,4,5,6,7 stroke:#999,stroke-width:1px;
```
## Consequences
### Positive
- **Automated Deployments**: ArgoCD ensures the cluster state matches Git.
- **Secure Secrets**: Vault centralizes secret management.
- **Scalable CI**: Gitea Act Runners can be added to any host.
- **OIDC Integration**: Secure authentication via Vault.
### Negative
- **Complexity**: Multiple moving parts (Gitea, ArgoCD, Vault).
- **Dependency on Vault**: If Vault fails, CI/CD may be disrupted.
- **Learning Curve**: Requires familiarity with GitOps and Vault.
## Alternatives Considered
### Alternative 1: GitHub Actions
- **Rejected**: Self-hosted Gitea aligns better with the home lab's privacy goals.
### Alternative 2: Jenkins
- **Rejected**: ArgoCD + Gitea Act Runner is lighter and more GitOps-native.
### Alternative 3: No CI/CD
- **Rejected**: Manual deployments are error-prone and unscalable.
## Sequence Diagrams
### 1. CI/CD Workflow for OpenTofu/Terraform
```mermaid
sequenceDiagram
participant Gitea
participant Runner as Gitea Act Runner (pi1/pi3)
participant Vault
participant WebApp as WebApp (k3s)
Gitea->>Runner: Trigger vault.yaml workflow
Runner->>Gitea: Execute vault_oauth__sh_b64 (OIDC)
Gitea-->>Runner: Return ID Token
Runner->>Vault: Authenticate with ID Token
Vault-->>Runner: Return Vault Token
Runner->>Runner: Run OpenTofu/Terraform
Runner->>Vault: Fetch Secrets (via Vault Action)
Vault-->>Runner: Return Secrets
Runner->>WebApp: Deploy Changes
```
### 2. Vault Secrets Operator Workflow
```mermaid
sequenceDiagram
participant ArgoCD
participant WebApp as WebApp (k3s)
participant VaultOperator as Vault Secrets Operator
participant Vault
ArgoCD->>WebApp: Deploy Helm Chart
WebApp->>VaultOperator: Create VaultAuth (K8s Auth)
VaultOperator->>Vault: Authenticate (K8s Service Account)
Vault-->>VaultOperator: Return Vault Token
WebApp->>VaultOperator: Create VaultStaticSecret (kvv2/webapp/config)
VaultOperator->>Vault: Fetch Static Secret
Vault-->>VaultOperator: Return Secret
VaultOperator->>WebApp: Inject Secret (secretkv)
WebApp->>VaultOperator: Create VaultDynamicSecret (postgres/creds/webapp)
VaultOperator->>Vault: Generate Dynamic Secret
Vault-->>VaultOperator: Return Credentials
VaultOperator->>WebApp: Inject Credentials (vso-db-credentials)
WebApp->>WebApp: Restart Pods (Rollout)
```
## Success Metrics
- Gitea Act Runners successfully execute CI jobs.
- ArgoCD synchronizes the `factory` repository without errors.
- Vault Secrets Operator injects secrets into deployed applications.

View File

@@ -0,0 +1,130 @@
# ADR 20260407: Docker Storage Optimization for Gitea Act Runner
## Status
Proposed
## Context
The `pi3` machine (Raspberry Pi) is running both Docker and k3s, with the following storage constraints:
- Root filesystem (`/dev/mmcblk0p2`): 58G total, 89% used (6.4G free)
- External disk (`/dev/sda1`): 458G total, 22G used (413G free)
Gitea Act Runner images (`ubuntu-latest` and `ubuntu-latest-ca`) are frequently deleted, likely due to Docker's automatic garbage collection triggered by low disk space. This disrupts CI/CD pipelines.
### Current Setup
- Docker is configured via Ansible (`system_docker.yml`) using the `geerlingguy.docker` role.
- k3s is configured to use Docker as the container runtime (`--docker` flag).
- Longhorn is used for persistent storage in k3s, and we want to preserve its performance.
## Decision
We will implement a **hybrid storage strategy** to prevent Gitea Act Runner image deletion while maintaining Longhorn performance:
### 1. Pin Critical Images
Use a dummy container to pin the Gitea Act Runner images:
```yaml
# Add to system_docker.yml or a new playbook
- name: Pin Gitea Act Runner images
community.docker.docker_container:
name: pin-gitea-runner-ubuntu-latest-ca
image: gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca
state: present
command: ["sh", "-c", "sleep infinity"]
auto_remove: false
restart_policy: unless-stopped
```
### 2. Configure Docker Storage with Overlay on External Disk
Modify `/etc/docker/daemon.json` to use the external disk for storage while keeping the root filesystem for metadata:
```json
{
"data-root": "/mnt/arcodange/docker",
"storage-driver": "overlay2",
"storage-opts": ["overlay2.override_kernel_check=true"]
}
```
### 3. Ansible Implementation
Update `system_docker.yml` to:
1. Create `/mnt/arcodange/docker` if it doesn't exist.
2. Configure Docker to use the external disk.
3. Pin critical images post-installation.
```yaml
# Add to system_docker.yml tasks
- name: Ensure Docker storage directory exists on external disk
ansible.builtin.file:
path: /mnt/arcodange/docker
state: directory
mode: '0755'
owner: root
group: docker
- name: Configure Docker to use external storage
ansible.builtin.copy:
dest: /etc/docker/daemon.json
content: |
{
"data-root": "/mnt/arcodange/docker",
"storage-driver": "overlay2",
"storage-opts": ["overlay2.override_kernel_check=true"],
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "5"
}
}
mode: '0644'
notify: Redémarrer Docker
- name: Pin Gitea Act Runner images
community.docker.docker_container:
name: "{{ item.name }}"
image: "{{ item.image }}"
state: present
command: ["sh", "-c", "sleep infinity"]
auto_remove: false
restart_policy: unless-stopped
loop:
- { name: "pin-gitea-runner-ubuntu-latest", image: "gitea/runner-images:ubuntu-latest" }
- { name: "pin-gitea-runner-ubuntu-latest-ca", image: "gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca" }
```
## Consequences
### Positive
- **Prevents Image Deletion**: Critical images are pinned and won't be garbage-collected.
- **Preserves Longhorn Performance**: Longhorn continues to use the root filesystem for its operations, maintaining performance.
- **Scalable Storage**: Docker images are stored on the external disk (413G free), preventing root filesystem exhaustion.
- **No k3s Changes Required**: k3s continues to use Docker as the runtime without modification.
### Negative
- **Migration Effort**: Existing Docker data must be migrated to the external disk (one-time operation).
- **Dependency on External Disk**: If `/dev/sda1` fails, Docker will not function until the disk is remounted or the configuration is reverted.
- **Slight Performance Overhead**: Accessing images from the external disk may be slightly slower than the root filesystem (mitigated by SSD/HDD performance).
## Alternatives Considered
### Alternative 1: Increase Root Filesystem Size
- **Rejected**: The SD card is already at capacity, and expanding it is not feasible.
### Alternative 2: Disable Docker Garbage Collection
- **Rejected**: This would risk filling the root filesystem completely, causing system instability.
### Alternative 3: Use k3s Image Garbage Collection
- **Rejected**: k3s does not provide fine-grained control over image retention for non-k8s workloads (e.g., Gitea Act Runner).
### Alternative 4: Save/Load Images Manually
- **Rejected**: Manual intervention is not scalable and does not address the root cause.
## Migration Plan
1. **Backup**: Save critical images to `/mnt/arcodange`:
```bash
docker save gitea.arcodange.lab/arcodange-org/runner-images:ubuntu-latest-ca -o /mnt/arcodange/gitea-runner-backup.tar
```
2. **Update Ansible**: Apply the changes to `system_docker.yml`.
3. **Run Playbook**: Execute the playbook to reconfigure Docker.
4. **Verify**: Ensure Gitea Act Runner functions correctly post-migration.
## Success Metrics
- Gitea Act Runner images are no longer deleted between runs.
- Root filesystem usage drops below 80%.
- CI/CD pipelines complete without image pull errors.

View File

@@ -0,0 +1,334 @@
# ADR 20260407: Network Architecture
## Status
Proposed
## Context
The home lab requires a secure and resilient network architecture to support:
- Internal services (`.lab` domain).
- External services (`.arcodange.fr` domain).
- DNS resolution and ad-blocking (Pi-hole).
- TLS certificate management (Step CA).
- Ingress routing (Traefik).
- CDN and DDoS protection (Cloudflare).
## Decision
We will implement a **multi-layered network architecture** with the following components:
### 1. External Layer (Internet)
- **Cloudflare**: CDN, DDoS protection, and DNS for `.arcodange.fr`.
- **DuckDNS**: Dynamic DNS for external access.
- **Livebox**: ISP-provided gateway (NAT, DHCP, firewall).
### 2. Internal Layer (Home Lab)
- **Pi-hole (pi1, pi3)**: DNS sinkhole for ad-blocking and internal DNS resolution.
- **Step CA (pi1)**: Internal certificate authority for `.lab` domain.
- **Traefik (k3s)**: Ingress controller with TLS termination.
- **k3s Cluster**: Hosts internal services with Longhorn storage.
### 3. DNS Architecture
- **Pi-hole**: Primary DNS for internal clients.
- Forwards `.lab` queries to Step CA.
- Forwards external queries to Cloudflare (1.1.1.1).
- **Step CA**: Issues certificates for `.lab` services.
- **Cloudflare**: Manages `.arcodange.fr` DNS records.
### 4. Ingress and TLS
- **Traefik**: Terminates TLS for both `.lab` and `.arcodange.fr` domains.
- Uses Let's Encrypt for `.arcodange.fr`.
- Uses Step CA for `.lab`.
- **Helm Chart Annotations**:
- `traefik.ingress.kubernetes.io/router.entrypoints: websecure`
- `traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt`
- `traefik.ingress.kubernetes.io/router.middlewares: localIp@file`
### 5. Security
- **Cloudflare Tunnel**: Securely exposes internal services without port forwarding.
- **CrowdSec**: Intrusion detection and banning.
- **Traefik Middlewares**: IP filtering, rate limiting, and authentication.
- **Cloudflare Turnstile**: CAPTCHA protection for public-facing services.
## Architecture Diagrams
### 0. High-Level Network Architecture (Architecture Beta)
```mermaid
%%{init: {'theme': 'neutral', 'themeVariables': {
'primaryColor': '#f0f0f0',
'primaryBorderColor': '#333333',
'primaryTextColor': '#333333',
'lineColor': '#333333',
'tertiaryColor': '#e67e22'
}}}%%
architectureBeta
%% External Layer
box "Internet" #f9f9f9
component Cloudflare["Cloudflare\n(CDN/DNS)"] #f9f9f9
component DuckDNS["DuckDNS\n(DDNS)"] #f9f9f9
end
%% External Gateway
box "External Gateway" #e6e6e6
component Livebox["Livebox\n(NAT/Firewall)"] #e6e6e6
end
%% Internal Layer
box "Internal Network\n(192.168.1.0/24)" #d4d4d4
%% DNS Layer
box "DNS" #ffff99
component PiHole1["Pi-hole\n(pi1)"] #ffff99
component PiHole3["Pi-hole\n(pi3)"] #ffff99
component StepCA["Step CA\n(pi1)"] #ccccff
end
%% k3s Layer
box "k3s Cluster" #ff9999
component Traefik["Traefik\n(Ingress)"] #ff9999
component CrowdSec["CrowdSec\n(Security)"] #ff9999
component Gitea["Gitea\n(pi2)"] #ffcc99
component Vault["Vault\n(Secrets)"] #ccccff
end
end
%% Connections
Cloudflare --> Livebox : "DNS"
DuckDNS --> Livebox : "DDNS"
Livebox --> PiHole1 : "NAT"
Livebox --> PiHole3 : "NAT"
Livebox --> Traefik : "NAT"
PiHole1 --> StepCA : "Forward .lab"
PiHole1 --> Cloudflare : "Forward External"
PiHole3 --> StepCA : "Forward .lab"
PiHole3 --> Cloudflare : "Forward External"
Traefik --> Cloudflare : "TLS (Let's Encrypt)"
Traefik --> StepCA : "TLS (Step CA)"
CrowdSec --> Traefik : "Ban IPs"
Traefik --> Gitea : "Route"
Traefik --> Vault : "Route"
```
### 1. High-Level Network Architecture
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#333333', 'edgeLabelBackground':'#f0f0f0', 'tertiaryColor': '#f89136'}}}%%
graph TD
%% Styles
classDef internet fill:#f9f9f9,stroke:#999,color:#333;
classDef external fill:#e6e6e6,stroke:#555,color:#333;
classDef internal fill:#d4d4d4,stroke:#777,color:#333;
classDef security fill:#ff9999,stroke:#cc0000,color:#333;
classDef dns fill:#ffff99,stroke:#cccc00,color:#333;
classDef ca fill:#ccccff,stroke:#6666cc,color:#333;
%% Internet
subgraph "Internet"
Cloudflare["Cloudflare (CDN/DNS)"]:::internet
DuckDNS["DuckDNS (DDNS)"]:::internet
end
%% External Gateway
subgraph "External Gateway"
Livebox["Livebox (NAT/Firewall)"]:::external
end
%% Internal Network
subgraph "Internal Network (192.168.1.0/24)"
%% Pi-hole DNS
PiHole1["Pi-hole (pi1)"]:::dns
PiHole3["Pi-hole (pi3)"]:::dns
%% Step CA
StepCA["Step CA (pi1)"]:::ca
%% k3s Cluster
k3s["k3s Cluster"]:::internal
Traefik["Traefik (k3s)"]:::internal
CrowdSec["CrowdSec (k3s)"]:::security
%% Services
Gitea["Gitea (pi2)"]:::internal
Vault["Vault (k3s)"]:::internal
end
%% Connections
Cloudflare -->|DNS| Livebox
DuckDNS -->|DDNS| Livebox
Livebox -->|NAT| PiHole1
Livebox -->|NAT| PiHole3
Livebox -->|NAT| k3s
%% Internal DNS
PiHole1 -->|Forward .lab| StepCA
PiHole1 -->|Forward External| Cloudflare
PiHole3 -->|Forward .lab| StepCA
PiHole3 -->|Forward External| Cloudflare
%% Ingress
Traefik -->|"TLS (Let's Encrypt)"| Cloudflare
Traefik -->|"TLS (Step CA)"| StepCA
CrowdSec -->|Ban IPs| Traefik
%% Service Access
Traefik -->|Route| Gitea
Traefik -->|Route| Vault
```
### 2. DNS Resolution Flow
```mermaid
sequenceDiagram
participant Client
participant PiHole
participant StepCA
participant Cloudflare
participant ExternalDNS
Client->>PiHole: Query example.lab
PiHole->>StepCA: Forward .lab query
StepCA-->>PiHole: Return A record
PiHole-->>Client: Return response
Client->>PiHole: Query example.com
PiHole->>Cloudflare: Forward to 1.1.1.1
Cloudflare->>ExternalDNS: Resolve externally
ExternalDNS-->>Cloudflare: Return response
Cloudflare-->>PiHole: Return response
PiHole-->>Client: Return response
```
### 3. Ingress and TLS Flow
```mermaid
sequenceDiagram
participant User
participant Cloudflare
participant Traefik
participant StepCA
participant Service
User->>Cloudflare: HTTPS Request (webapp.arcodange.fr)
Cloudflare->>Traefik: Forward to internal IP
Traefik->>Let's Encrypt: Request Certificate
Let's Encrypt-->>Traefik: Issue Certificate
Traefik->>Service: Route request
Service-->>Traefik: Return response
Traefik-->>Cloudflare: Return HTTPS response
Cloudflare-->>User: Return response
User->>Traefik: HTTPS Request (webapp.arcodange.lab)
Traefik->>StepCA: Request Certificate
StepCA-->>Traefik: Issue Certificate
Traefik->>Service: Route request
Service-->>Traefik: Return response
Traefik-->>User: Return HTTPS response
```
### 4. Security Flow (CrowdSec + Traefik)
```mermaid
sequenceDiagram
participant Attacker
participant Traefik
participant CrowdSec
participant BannedIPs
Attacker->>Traefik: Malicious Request
Traefik->>CrowdSec: Log suspicious activity
CrowdSec->>BannedIPs: Add IP to ban list
BannedIPs-->>Traefik: Update middleware
Traefik-->>Attacker: Block request (403)
```
## Playbook and Role Analysis
### 1. Pi-hole Deployment
- **Playbook**: `playbooks/system/pihole.yml`
- **Role**: `arcodange.factory.pihole`
- **Configuration**:
- Upstream DNS: Cloudflare (1.1.1.1) and Step CA for `.lab`.
- Blocklists: Ad-blocking and malware domains.
### 2. Step CA Deployment
- **Playbook**: `playbooks/ssl/ssl.yml`
- **Role**: `step_ca`
- **Configuration**:
- Internal CA for `.lab` domain.
- Short-lived certificates (default: 24h).
### 3. Traefik Deployment
- **Playbook**: `playbooks/system/system_k3s.yml` (via k3s)
- **Helm Chart**: `traefik` (installed via k3s)
- **Key Annotations**:
```yaml
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
traefik.ingress.kubernetes.io/router.middlewares: localIp@file
```
### 4. CrowdSec Deployment
- **Playbook**: `playbooks/tools/crowdsec.yml`
- **Role**: `arcodange.factory.crowdsec`
- **Configuration**:
- Bouncer integration with Traefik.
- Custom scenarios for brute-force and bot detection.
## Consequences
### Positive
- **Resilient DNS**: Pi-hole provides ad-blocking and internal DNS resolution.
- **Secure TLS**: Step CA for internal services, Let's Encrypt for external.
- **DDoS Protection**: Cloudflare absorbs external attacks.
- **Intrusion Detection**: CrowdSec bans malicious IPs automatically.
### Negative
- **Complexity**: Multiple layers require careful configuration.
- **Single Point of Failure**: Pi-hole is critical for internal DNS.
- **Certificate Management**: Step CA requires maintenance for `.lab` domain.
## Alternatives Considered
### Alternative 1: Public DNS for `.lab`
- **Rejected**: Exposing internal domains is a security risk.
### Alternative 2: No Ad-Blocking
- **Rejected**: Pi-hole provides essential security and privacy.
### Alternative 3: Self-Signed Certificates
- **Rejected**: Step CA provides better usability with short-lived certs.
### 5. Cloudflare Turnstile + CrowdSec Flow
```mermaid
sequenceDiagram
participant User
participant Cloudflare
participant Turnstile
participant Traefik
participant CrowdSec
participant BannedIPs
User->>Cloudflare: Request protected endpoint
Cloudflare->>Turnstile: Challenge (CAPTCHA)
Turnstile-->>Cloudflare: Return token
Cloudflare->>Traefik: Forward request with token
alt Valid Token
Traefik->>Service: Route request
Service-->>Traefik: Return response
Traefik-->>Cloudflare: Return response
Cloudflare-->>User: Return success
else Invalid Token
Traefik->>CrowdSec: Log suspicious activity
CrowdSec->>BannedIPs: Add IP to ban list
BannedIPs-->>Traefik: Update middleware
Traefik-->>Cloudflare: Block request (403)
Cloudflare-->>User: Return "Access Denied"
end
```
## Success Metrics
- Pi-hole blocks >50% of ads and trackers.
- Step CA issues certificates without downtime.
- Traefik routes 100% of external traffic via Cloudflare.
- CrowdSec bans >10 malicious IPs per day.
- Cloudflare Turnstile blocks >90% of bot traffic.