fix(dns): harden DNS resilience after power-cut incident

During the 2026-04-13 power cut recovery, DNS resolution failures blocked
Longhorn reinstall. Root causes:
- CoreDNS forwarded to a single hardcoded Pi-hole IP instead of both HA instances
- CoreDNS main Corefile forwarded to /etc/resolv.conf which pointed to itself on pi3
- Pi-hole lacked explicit upstream DNS, relying on DHCP-provided config
- dnsmasq system service conflicted with pihole-FTL on port 53

Changes:
- k3s_dns: forward CoreDNS to both Pi-hole HA instances (pi1 + pi3) dynamically
- k3s_dns: update main CoreDNS Corefile to forward to Pi-holes instead of resolv.conf
- pihole defaults: add explicit upstream DNS servers (8.8.8.8, 1.1.1.1, 8.8.4.4)
- pihole ha_setup: write /etc/dnsmasq.d/99-upstream.conf with explicit upstreams
- rpi: add dnsmasq user to dip group and disable conflicting dnsmasq service on Pi-hole nodes

See docs/adr/20260414-internal-dns-architecture.md for full rationale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-14 10:54:42 +02:00
parent 355ab11c4d
commit e6fc24c101
5 changed files with 191 additions and 3 deletions

View File

@@ -0,0 +1,126 @@
# ADR 20260414: Internal DNS Architecture
## Status
Accepted
## Context
During the 2026-04-13 power cut incident, cluster recovery was blocked by DNS resolution failures. The investigation revealed:
1. **CoreDNS forwarding loop**: CoreDNS was configured to forward queries to `/etc/resolv.conf`, which on the node (pi3) pointed to itself (`192.168.1.203`) - a host without a running DNS service
2. **Pi-hole HA misconfiguration**: Both pi1 and pi3 run Pi-hole (pihole-FTL) but:
- pi1's `dnsmasq` service was in a **failed state** due to missing `dip` group membership
- pi3's Pi-hole was running but CoreDNS couldn't reach it due to the forwarding configuration
3. **No explicit upstream DNS**: Pi-hole instances lacked explicitly configured upstream DNS servers
The cluster's HelmChart controller requires external DNS resolution to fetch charts from `charts.longhorn.io`, making DNS a critical dependency for storage provisioning and thus the entire cluster recovery process.
## Decision
### 1. DNS Service Hierarchy
```
┌─────────────────┐ ┌─────────────────┐
│ CoreDNS Pod │────▶│ Pi-hole (pi1) │──┐
│ (kube-system) │ │ Pi-hole (pi3) │ │
└─────────────────┘ └─────────────────┘ │
┌──────────────┐
│ 8.8.8.8 │
│ 1.1.1.1 │
│ 8.8.4.4 │
└──────────────┘
```
### 2. CoreDNS Configuration
CoreDNS will forward **all non-cluster DNS queries** to **both Pi-hole instances** in HA configuration:
```coredns
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
ttl 60
reload 15s
fallthrough
}
prometheus :9153
cache 30
loop
reload
import /etc/coredns/custom/*.override
import /etc/coredns/custom/*.server
forward . 192.168.1.201:53 192.168.1.203:53
}
```
### 3. Pi-hole HA Configuration
- **Primary**: pi1 (192.168.1.201)
- **Secondary**: pi3 (192.168.1.203)
- **Synchronization**: Gravity Sync for configuration consistency
- **Upstream DNS**: Explicitly configured to Cloudflare (1.1.1.1) and Google (8.8.8.8, 8.8.4.4)
### 4. Pi-hole DNS Service Fix
The `dnsmasq` user must be a member of the `dip` group to bind to privileged port 53:
```bash
usermod -aG dip dnsmasq
```
This is managed via Ansible in `playbooks/system/rpi.yml`.
## Consequences
### Positive
- **Resilience**: DNS resolution continues if one Pi-hole node fails
- **Consistency**: Both Pi-hole instances maintain synchronized configuration via Gravity Sync
- **Recovery**: Cluster can recover from power failures without manual DNS intervention
- **Explicit configuration**: Upstream DNS servers are explicitly defined, avoiding reliance on DHCP-provided config
### Negative
- **Complexity**: Additional Ansible tasks required to maintain DNS infrastructure
- **Dependency**: Cluster recovery depends on Pi-hole availability (mitigated by HA)
## Implementation
See related changes in:
- `playbooks/system/rpi.yml` - dnsmasq group membership fix
- `playbooks/dns/k3s_dns.yml` - CoreDNS forwarding to HA Pi-hole instances
- `playbooks/dns/roles/pihole/defaults/main.yml` - Explicit upstream DNS configuration
## Post-Implementation Notes
### Issue Encountered: dnsmasq vs pihole-FTL Port Conflict
During execution, we discovered that **dnsmasq** and **pihole-FTL** both attempt to bind to port 53. On pi1:
- pihole-FTL was running and handling DNS on port 53
- dnsmasq service was failing because port 53 was already in use
**Resolution**: The dnsmasq service on Pi-hole nodes is **not needed** when pihole-FTL is running, as pihole-FTL includes its own DNS server (dnsmasq) internally. The system dnsmasq service should remain **disabled** on Pi-hole nodes to avoid conflicts.
### Verification Commands
Check DNS resolution from cluster:
```bash
kubectl run dns-test --image=busybox:1.28 -it --rm --restart=Never -- \
nslookup charts.longhorn.io 192.168.1.201
# Check CoreDNS forward to both Pi-holes
kubectl get cm -n kube-system coredns -o yaml
# Check Pi-hole instances
ssh pi1 "dig @127.0.0.1 google.com +short"
ssh pi3 "dig @127.0.0.1 google.com +short"
```
## Related Incidents
- [2026-04-13-power-cut](../incidents/2026-04-13-power-cut/README.md) - Power cut caused DNS resolution failure, blocking Longhorn reinstall and Traefik recovery