Implements the cleanup half of ADR-0021 (which had only config infrastructure landed). Non-primary expired secrets are removed by a goroutine that runs at auth.jwt.secret_retention.cleanup_interval (default 1h). Primary secret is never removed regardless of expiration — invariant preserved. Changes: - pkg/user/jwt_manager.go : add sync.Mutex protection; add RemoveExpiredSecrets() int and StartCleanupLoop(ctx, interval) methods. Reset() now also cancels any running cleanup goroutine. - pkg/user/auth_service.go : delegate to manager via new AuthService methods StartJWTSecretCleanupLoop and RemoveExpiredJWTSecrets. - pkg/user/user.go : extend AuthService interface accordingly. - pkg/server/server.go Run() : start cleanup loop tied to rootCtx so it stops on graceful shutdown. - pkg/jwt/* : same treatment on the secondary (less-used) implementation for consistency. - adr/0021-jwt-secret-retention-policy.md : Status → Implemented + fix numbering (was incorrectly "10."). Tests: - 4 new unit tests in pkg/user/jwt_manager_test.go covering RemoveExpiredSecrets (expired removed, primary preserved, future kept) and StartCleanupLoop (fires + stops on context cancel). - go test -race ./pkg/user/... passes. - Full BDD suite (auth/config/greet/health/info/jwt) still green. - BDD scenarios at @todo / @skip remain so — they require an admin endpoint /api/v1/admin/jwt/secrets which is explicitly out of scope. Verifier verdict: APPROVE_WITH_NITS — StartCleanupLoop is 34 lines (just over the 30-line guideline); 2 time.Sleeps in TestStartCleanupLoop_FiresAndStops are justified by the goroutine-timing nature of the test.
13 KiB
21. JWT Secret Retention Policy
Status: Implemented (2026-05-05 — pkg/user/jwt_manager.go RemoveExpiredSecrets + StartCleanupLoop, wired in pkg/server/server.go Run; admin endpoint /api/v1/admin/jwt/secrets remains explicitly out of scope and tracked under @todo BDD scenarios)
Context
The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in ADR-0009, the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup.
Current State
- ✅ Multiple JWT secrets supported
- ✅ Graceful rotation implemented
- ✅ Backward compatibility maintained
- ❌ No automatic cleanup of old secrets
- ❌ No configurable retention periods
- ❌ No expiration-based secret management
Problem Statement
Without a retention policy:
- Security Risk: Old secrets accumulate indefinitely, increasing attack surface
- Memory Bloat: Unbounded growth of secret storage
- Operational Overhead: Manual cleanup required
- Compliance Issues: May violate security policies requiring regular key rotation
Requirements
- Configurable Retention: Administrators should control how long secrets are retained
- Automatic Cleanup: System should automatically remove expired secrets
- Backward Compatibility: Existing tokens should continue working during retention period
- Sensible Defaults: Should work out-of-the-box with secure defaults
- Performance: Cleanup should not impact runtime performance
Decision
JWT Secret Retention Policy
Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components:
1. Configuration Structure
jwt:
# Token time-to-live (default: 24h)
ttl: 24h
# Secret retention configuration
secret_retention:
# Retention factor multiplier (default: 2.0)
# Retention period = JWT TTL × retention_factor
retention_factor: 2.0
# Maximum retention period (safety limit, default: 72h)
max_retention: 72h
# Cleanup frequency for expired secrets (default: 1h)
cleanup_interval: 1h
2. Retention Period Calculation
retention_period = min(JWT_TTL × retention_factor, max_retention)
Examples:
- Default (24h TTL, 2.0 factor):
min(48h, 72h) = 48h - Short-lived tokens (1h TTL, 3.0 factor):
min(3h, 72h) = 3h - Long-lived tokens (72h TTL, 2.0 factor):
min(144h, 72h) = 72h
3. Secret Lifecycle
graph LR
A[Secret Created] --> B[Active Period]
B --> C{Retention Period}
C -->|Expired| D[Marked for Cleanup]
C -->|Valid| B
D --> E[Automatic Removal]
4. Cleanup Process
- Frequency: Configurable interval (default: 1 hour)
- Scope: Remove secrets older than retention period
- Safety: Never remove current primary secret
- Logging: Audit trail of cleanup operations
Implementation Strategy
Phase 1: Configuration Framework
-
Extend Config Package (
pkg/config/config.go)- Add JWT TTL configuration
- Add secret retention parameters
- Implement validation
-
Environment Variables
# JWT Token TTL DLC_JWT_TTL=24h # Secret Retention DLC_JWT_SECRET_RETENTION_FACTOR=2.0 DLC_JWT_SECRET_MAX_RETENTION=72h DLC_JWT_SECRET_CLEANUP_INTERVAL=1h
Phase 2: Secret Manager Enhancement
-
Enhance JWTSecret Struct
type JWTSecret struct { Secret string IsPrimary bool CreatedAt time.Time ExpiresAt *time.Time // Now properly calculated RetentionPeriod time.Duration } -
Add Expiration Logic
func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) { // Calculate retention period based on config retentionPeriod := m.calculateRetentionPeriod() expiresAt := time.Now().Add(expiresIn) m.secrets = append(m.secrets, JWTSecret{ Secret: secret, IsPrimary: isPrimary, CreatedAt: time.Now(), ExpiresAt: &expiresAt, RetentionPeriod: retentionPeriod, }) }
Phase 3: Automatic Cleanup
-
Background Cleanup Job
func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) { ticker := time.NewTicker(interval) go func() { for { select { case <-ticker.C: m.CleanupExpiredSecrets() case <-ctx.Done(): ticker.Stop() return } } }() } -
Cleanup Implementation
func (m *JWTSecretManager) CleanupExpiredSecrets() { now := time.Now() var activeSecrets []JWTSecret for _, secret := range m.secrets { if secret.IsPrimary { // Never remove current primary activeSecrets = append(activeSecrets, secret) continue } // Check if secret is within retention period if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod { activeSecrets = append(activeSecrets, secret) } else { log.Info(). Str("secret", secret.Secret). Msg("Removed expired JWT secret") } } m.secrets = activeSecrets }
Phase 4: Integration
- Server Initialization
func (s *Server) InitializeJWT() error { // Load config jwtConfig := s.config.GetJWTConfig() // Create secret manager with retention policy secretManager := NewJWTSecretManager( jwtConfig.Secret, WithRetentionFactor(jwtConfig.RetentionFactor), WithMaxRetention(jwtConfig.MaxRetention), ) // Start cleanup job secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval) return nil }
Validation
1. Configuration Validation
func (c *Config) ValidateJWTConfig() error {
if c.JWT.TTL <= 0 {
return fmt.Errorf("jwt.ttl must be positive")
}
if c.JWT.SecretRetention.RetentionFactor < 1.0 {
return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0")
}
if c.JWT.SecretRetention.MaxRetention <= 0 {
return fmt.Errorf("jwt.secret_retention.max_retention must be positive")
}
if c.JWT.SecretRetention.CleanupInterval <= 0 {
return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive")
}
// Ensure max retention is reasonable
if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days
return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h")
}
return nil
}
2. Runtime Validation
func (m *JWTSecretManager) ValidateSecret(secret string) error {
// Check minimum length
if len(secret) < 16 {
return fmt.Errorf("jwt secret must be at least 16 characters")
}
// Check entropy (basic check)
if !hasSufficientEntropy(secret) {
return fmt.Errorf("jwt secret must have sufficient entropy")
}
return nil
}
Monitoring and Observability
1. Metrics
// Prometheus metrics
var (
jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "jwt_secrets_active_count",
Help: "Number of active JWT secrets",
})
jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{
Name: "jwt_secrets_expired_total",
Help: "Total number of expired JWT secrets removed",
})
jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "jwt_secret_retention_duration_seconds",
Help: "Duration of JWT secret retention periods",
Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h
})
)
2. Logging
func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) {
log.Info().
Str("secret", maskSecret(secret)).
Str("event", event).
Interface("details", details).
Msg("JWT secret event")
}
func maskSecret(secret string) string {
if len(secret) <= 4 {
return "****"
}
return secret[:4] + "****" + secret[len(secret)-4:]
}
Consequences
Positive
- Enhanced Security: Automatic cleanup reduces attack surface
- Reduced Memory Usage: Prevents unbounded growth of secret storage
- Operational Efficiency: No manual cleanup required
- Compliance Ready: Meets security policy requirements for key rotation
- Flexibility: Configurable to meet different security requirements
Negative
- Complexity: Adds configuration and cleanup logic
- Performance Overhead: Background cleanup job (minimal impact)
- Migration: Existing deployments need configuration updates
- Debugging: More moving parts to troubleshoot
Neutral
- Backward Compatibility: Existing tokens continue to work
- Learning Curve: New configuration options to understand
- Monitoring: Additional metrics to track
Alternatives Considered
Alternative 1: Fixed Retention Period
Proposal: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation
Rejected Because:
- Less flexible for different use cases
- Doesn't scale with JWT TTL changes
- May be too short for long-lived tokens or too long for short-lived ones
Alternative 2: Manual Cleanup Only
Proposal: Require administrators to manually clean up old secrets
Rejected Because:
- Operational overhead
- Security risk if cleanup is forgotten
- Doesn't scale for frequent rotations
Alternative 3: No Retention (Current State)
Proposal: Keep current behavior with no automatic cleanup
Rejected Because:
- Security concerns with accumulating secrets
- Memory management issues
- Compliance violations
Success Metrics
- Security: No old secrets remain beyond retention period
- Reliability: 99.9% of valid tokens continue to work during rotation
- Performance: Cleanup job completes in <100ms with <1000 secrets
- Adoption: Configuration used in 100% of deployments within 3 months
Migration Plan
Phase 1: Preparation (1 week)
- ✅ Create this ADR
- ✅ Update documentation
- ✅ Add configuration to config package
- ✅ Implement basic retention logic
Phase 2: Testing (2 weeks)
- ✅ Write BDD scenarios for retention
- ✅ Add unit tests for secret manager
- ✅ Test with various TTL/factor combinations
- ✅ Performance testing with large secret counts
Phase 3: Rollout (1 week)
- ✅ Update default configuration
- ✅ Add feature flag for gradual rollout
- ✅ Monitor metrics in staging
- ✅ Gradual production rollout
Phase 4: Optimization (Ongoing)
- ✅ Monitor cleanup performance
- ✅ Adjust defaults based on real-world usage
- ✅ Add alerts for cleanup failures
- ✅ Document troubleshooting guide
References
- ADR-0009: Hybrid Testing Approach
- ADR-0008: BDD Testing
- RFC 7519: JSON Web Tokens
- OWASP Key Management Cheat Sheet
Appendix
Configuration Examples
Development Environment (short retention for testing):
jwt:
ttl: 1h
secret_retention:
retention_factor: 1.5
max_retention: 3h
cleanup_interval: 30m
Production Environment (secure defaults):
jwt:
ttl: 24h
secret_retention:
retention_factor: 2.0
max_retention: 72h
cleanup_interval: 1h
High-Security Environment (aggressive rotation):
jwt:
ttl: 8h
secret_retention:
retention_factor: 1.5
max_retention: 24h
cleanup_interval: 30m
Troubleshooting
Issue: Secrets being removed too quickly
- Check: Retention factor and JWT TTL settings
- Fix: Increase retention_factor or JWT TTL
Issue: Too many old secrets accumulating
- Check: Cleanup job logs and interval
- Fix: Decrease cleanup_interval or retention_factor
Issue: Performance degradation during cleanup
- Check: Number of secrets and cleanup frequency
- Fix: Optimize cleanup algorithm or increase interval
FAQ
Q: What happens to tokens signed with expired secrets? A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate.
Q: Can I disable automatic cleanup?
A: Yes, set cleanup_interval to a very high value (e.g., 8760h for 1 year).
Q: How does this affect existing deployments? A: Existing deployments will use sensible defaults. The feature is backward compatible.
Q: What's the recommended retention factor? A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs.
Q: How often should cleanup run? A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup.
Decision Record
Approved By: Approved Date: Implemented By: Implementation Date:
Generated by Mistral Vibe Co-Authored-By: Mistral Vibe vibe@mistral.ai