Files
dance-lessons-coach/adr/0021-jwt-secret-retention-policy.md
Gabriel Radureau 5eec64e5e8
All checks were successful
CI/CD Pipeline / Build Docker Cache (push) Successful in 9s
CI/CD Pipeline / CI Pipeline (push) Successful in 4m15s
CI/CD Pipeline / Trigger Docker Push (push) Has been skipped
🧪 test: add JWT secret rotation BDD scenarios and step implementations (#12)
 merge: implement JWT secret rotation with BDD scenario isolation

- Implement JWT secret rotation mechanism (closes #8)
- Add per-scenario state isolation for BDD tests (closes #14)
- Validate password reset workflow via BDD tests (closes #7)
- Fix port conflicts in test validation
- Add state tracer for debugging test execution
- Document BDD isolation strategies in ADR 0025
- Fix PostgreSQL configuration environment variables

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-04-11 17:56:45 +02:00

469 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 10. JWT Secret Retention Policy
## Status
**Proposed** 🟡
## Context
The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in [ADR-0009](0009-hybrid-testing-approach.md), the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup.
### Current State
- ✅ Multiple JWT secrets supported
- ✅ Graceful rotation implemented
- ✅ Backward compatibility maintained
- ❌ No automatic cleanup of old secrets
- ❌ No configurable retention periods
- ❌ No expiration-based secret management
### Problem Statement
Without a retention policy:
1. **Security Risk**: Old secrets accumulate indefinitely, increasing attack surface
2. **Memory Bloat**: Unbounded growth of secret storage
3. **Operational Overhead**: Manual cleanup required
4. **Compliance Issues**: May violate security policies requiring regular key rotation
### Requirements
1. **Configurable Retention**: Administrators should control how long secrets are retained
2. **Automatic Cleanup**: System should automatically remove expired secrets
3. **Backward Compatibility**: Existing tokens should continue working during retention period
4. **Sensible Defaults**: Should work out-of-the-box with secure defaults
5. **Performance**: Cleanup should not impact runtime performance
## Decision
### JWT Secret Retention Policy
Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components:
#### 1. Configuration Structure
```yaml
jwt:
# Token time-to-live (default: 24h)
ttl: 24h
# Secret retention configuration
secret_retention:
# Retention factor multiplier (default: 2.0)
# Retention period = JWT TTL × retention_factor
retention_factor: 2.0
# Maximum retention period (safety limit, default: 72h)
max_retention: 72h
# Cleanup frequency for expired secrets (default: 1h)
cleanup_interval: 1h
```
#### 2. Retention Period Calculation
```
retention_period = min(JWT_TTL × retention_factor, max_retention)
```
**Examples:**
- Default (24h TTL, 2.0 factor): `min(48h, 72h) = 48h`
- Short-lived tokens (1h TTL, 3.0 factor): `min(3h, 72h) = 3h`
- Long-lived tokens (72h TTL, 2.0 factor): `min(144h, 72h) = 72h`
#### 3. Secret Lifecycle
```mermaid
graph LR
A[Secret Created] --> B[Active Period]
B --> C{Retention Period}
C -->|Expired| D[Marked for Cleanup]
C -->|Valid| B
D --> E[Automatic Removal]
```
#### 4. Cleanup Process
- **Frequency**: Configurable interval (default: 1 hour)
- **Scope**: Remove secrets older than retention period
- **Safety**: Never remove current primary secret
- **Logging**: Audit trail of cleanup operations
### Implementation Strategy
#### Phase 1: Configuration Framework
1. **Extend Config Package** (`pkg/config/config.go`)
- Add JWT TTL configuration
- Add secret retention parameters
- Implement validation
2. **Environment Variables**
```bash
# JWT Token TTL
DLC_JWT_TTL=24h
# Secret Retention
DLC_JWT_SECRET_RETENTION_FACTOR=2.0
DLC_JWT_SECRET_MAX_RETENTION=72h
DLC_JWT_SECRET_CLEANUP_INTERVAL=1h
```
#### Phase 2: Secret Manager Enhancement
1. **Enhance JWTSecret Struct**
```go
type JWTSecret struct {
Secret string
IsPrimary bool
CreatedAt time.Time
ExpiresAt *time.Time // Now properly calculated
RetentionPeriod time.Duration
}
```
2. **Add Expiration Logic**
```go
func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) {
// Calculate retention period based on config
retentionPeriod := m.calculateRetentionPeriod()
expiresAt := time.Now().Add(expiresIn)
m.secrets = append(m.secrets, JWTSecret{
Secret: secret,
IsPrimary: isPrimary,
CreatedAt: time.Now(),
ExpiresAt: &expiresAt,
RetentionPeriod: retentionPeriod,
})
}
```
#### Phase 3: Automatic Cleanup
1. **Background Cleanup Job**
```go
func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) {
ticker := time.NewTicker(interval)
go func() {
for {
select {
case <-ticker.C:
m.CleanupExpiredSecrets()
case <-ctx.Done():
ticker.Stop()
return
}
}
}()
}
```
2. **Cleanup Implementation**
```go
func (m *JWTSecretManager) CleanupExpiredSecrets() {
now := time.Now()
var activeSecrets []JWTSecret
for _, secret := range m.secrets {
if secret.IsPrimary {
// Never remove current primary
activeSecrets = append(activeSecrets, secret)
continue
}
// Check if secret is within retention period
if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod {
activeSecrets = append(activeSecrets, secret)
} else {
log.Info().
Str("secret", secret.Secret).
Msg("Removed expired JWT secret")
}
}
m.secrets = activeSecrets
}
```
#### Phase 4: Integration
1. **Server Initialization**
```go
func (s *Server) InitializeJWT() error {
// Load config
jwtConfig := s.config.GetJWTConfig()
// Create secret manager with retention policy
secretManager := NewJWTSecretManager(
jwtConfig.Secret,
WithRetentionFactor(jwtConfig.RetentionFactor),
WithMaxRetention(jwtConfig.MaxRetention),
)
// Start cleanup job
secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval)
return nil
}
```
### Validation
#### 1. Configuration Validation
```go
func (c *Config) ValidateJWTConfig() error {
if c.JWT.TTL <= 0 {
return fmt.Errorf("jwt.ttl must be positive")
}
if c.JWT.SecretRetention.RetentionFactor < 1.0 {
return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0")
}
if c.JWT.SecretRetention.MaxRetention <= 0 {
return fmt.Errorf("jwt.secret_retention.max_retention must be positive")
}
if c.JWT.SecretRetention.CleanupInterval <= 0 {
return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive")
}
// Ensure max retention is reasonable
if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days
return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h")
}
return nil
}
```
#### 2. Runtime Validation
```go
func (m *JWTSecretManager) ValidateSecret(secret string) error {
// Check minimum length
if len(secret) < 16 {
return fmt.Errorf("jwt secret must be at least 16 characters")
}
// Check entropy (basic check)
if !hasSufficientEntropy(secret) {
return fmt.Errorf("jwt secret must have sufficient entropy")
}
return nil
}
```
### Monitoring and Observability
#### 1. Metrics
```go
// Prometheus metrics
var (
jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "jwt_secrets_active_count",
Help: "Number of active JWT secrets",
})
jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{
Name: "jwt_secrets_expired_total",
Help: "Total number of expired JWT secrets removed",
})
jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "jwt_secret_retention_duration_seconds",
Help: "Duration of JWT secret retention periods",
Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h
})
)
```
#### 2. Logging
```go
func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) {
log.Info().
Str("secret", maskSecret(secret)).
Str("event", event).
Interface("details", details).
Msg("JWT secret event")
}
func maskSecret(secret string) string {
if len(secret) <= 4 {
return "****"
}
return secret[:4] + "****" + secret[len(secret)-4:]
}
```
## Consequences
### Positive
1. **Enhanced Security**: Automatic cleanup reduces attack surface
2. **Reduced Memory Usage**: Prevents unbounded growth of secret storage
3. **Operational Efficiency**: No manual cleanup required
4. **Compliance Ready**: Meets security policy requirements for key rotation
5. **Flexibility**: Configurable to meet different security requirements
### Negative
1. **Complexity**: Adds configuration and cleanup logic
2. **Performance Overhead**: Background cleanup job (minimal impact)
3. **Migration**: Existing deployments need configuration updates
4. **Debugging**: More moving parts to troubleshoot
### Neutral
1. **Backward Compatibility**: Existing tokens continue to work
2. **Learning Curve**: New configuration options to understand
3. **Monitoring**: Additional metrics to track
## Alternatives Considered
### Alternative 1: Fixed Retention Period
**Proposal**: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation
**Rejected Because**:
- Less flexible for different use cases
- Doesn't scale with JWT TTL changes
- May be too short for long-lived tokens or too long for short-lived ones
### Alternative 2: Manual Cleanup Only
**Proposal**: Require administrators to manually clean up old secrets
**Rejected Because**:
- Operational overhead
- Security risk if cleanup is forgotten
- Doesn't scale for frequent rotations
### Alternative 3: No Retention (Current State)
**Proposal**: Keep current behavior with no automatic cleanup
**Rejected Because**:
- Security concerns with accumulating secrets
- Memory management issues
- Compliance violations
## Success Metrics
1. **Security**: No old secrets remain beyond retention period
2. **Reliability**: 99.9% of valid tokens continue to work during rotation
3. **Performance**: Cleanup job completes in <100ms with <1000 secrets
4. **Adoption**: Configuration used in 100% of deployments within 3 months
## Migration Plan
### Phase 1: Preparation (1 week)
- ✅ Create this ADR
- ✅ Update documentation
- ✅ Add configuration to config package
- ✅ Implement basic retention logic
### Phase 2: Testing (2 weeks)
- ✅ Write BDD scenarios for retention
- ✅ Add unit tests for secret manager
- ✅ Test with various TTL/factor combinations
- ✅ Performance testing with large secret counts
### Phase 3: Rollout (1 week)
- ✅ Update default configuration
- ✅ Add feature flag for gradual rollout
- ✅ Monitor metrics in staging
- ✅ Gradual production rollout
### Phase 4: Optimization (Ongoing)
- ✅ Monitor cleanup performance
- ✅ Adjust defaults based on real-world usage
- ✅ Add alerts for cleanup failures
- ✅ Document troubleshooting guide
## References
- [ADR-0009: Hybrid Testing Approach](0009-hybrid-testing-approach.md)
- [ADR-0008: BDD Testing](0008-bdd-testing.md)
- [RFC 7519: JSON Web Tokens](https://tools.ietf.org/html/rfc7519)
- [OWASP Key Management Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html)
## Appendix
### Configuration Examples
**Development Environment** (short retention for testing):
```yaml
jwt:
ttl: 1h
secret_retention:
retention_factor: 1.5
max_retention: 3h
cleanup_interval: 30m
```
**Production Environment** (secure defaults):
```yaml
jwt:
ttl: 24h
secret_retention:
retention_factor: 2.0
max_retention: 72h
cleanup_interval: 1h
```
**High-Security Environment** (aggressive rotation):
```yaml
jwt:
ttl: 8h
secret_retention:
retention_factor: 1.5
max_retention: 24h
cleanup_interval: 30m
```
### Troubleshooting
**Issue**: Secrets being removed too quickly
- **Check**: Retention factor and JWT TTL settings
- **Fix**: Increase retention_factor or JWT TTL
**Issue**: Too many old secrets accumulating
- **Check**: Cleanup job logs and interval
- **Fix**: Decrease cleanup_interval or retention_factor
**Issue**: Performance degradation during cleanup
- **Check**: Number of secrets and cleanup frequency
- **Fix**: Optimize cleanup algorithm or increase interval
### FAQ
**Q: What happens to tokens signed with expired secrets?**
A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate.
**Q: Can I disable automatic cleanup?**
A: Yes, set `cleanup_interval` to a very high value (e.g., `8760h` for 1 year).
**Q: How does this affect existing deployments?**
A: Existing deployments will use sensible defaults. The feature is backward compatible.
**Q: What's the recommended retention factor?**
A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs.
**Q: How often should cleanup run?**
A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup.
## Decision Record
**Approved By**:
**Approved Date**:
**Implemented By**:
**Implementation Date**:
---
*Generated by Mistral Vibe*
*Co-Authored-By: Mistral Vibe <vibe@mistral.ai>*