# 10. JWT Secret Retention Policy **Status:** Proposed ## Context The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in [ADR-0009](0009-hybrid-testing-approach.md), the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup. ### Current State - ✅ Multiple JWT secrets supported - ✅ Graceful rotation implemented - ✅ Backward compatibility maintained - ❌ No automatic cleanup of old secrets - ❌ No configurable retention periods - ❌ No expiration-based secret management ### Problem Statement Without a retention policy: 1. **Security Risk**: Old secrets accumulate indefinitely, increasing attack surface 2. **Memory Bloat**: Unbounded growth of secret storage 3. **Operational Overhead**: Manual cleanup required 4. **Compliance Issues**: May violate security policies requiring regular key rotation ### Requirements 1. **Configurable Retention**: Administrators should control how long secrets are retained 2. **Automatic Cleanup**: System should automatically remove expired secrets 3. **Backward Compatibility**: Existing tokens should continue working during retention period 4. **Sensible Defaults**: Should work out-of-the-box with secure defaults 5. **Performance**: Cleanup should not impact runtime performance ## Decision ### JWT Secret Retention Policy Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components: #### 1. Configuration Structure ```yaml jwt: # Token time-to-live (default: 24h) ttl: 24h # Secret retention configuration secret_retention: # Retention factor multiplier (default: 2.0) # Retention period = JWT TTL × retention_factor retention_factor: 2.0 # Maximum retention period (safety limit, default: 72h) max_retention: 72h # Cleanup frequency for expired secrets (default: 1h) cleanup_interval: 1h ``` #### 2. Retention Period Calculation ``` retention_period = min(JWT_TTL × retention_factor, max_retention) ``` **Examples:** - Default (24h TTL, 2.0 factor): `min(48h, 72h) = 48h` - Short-lived tokens (1h TTL, 3.0 factor): `min(3h, 72h) = 3h` - Long-lived tokens (72h TTL, 2.0 factor): `min(144h, 72h) = 72h` #### 3. Secret Lifecycle ```mermaid graph LR A[Secret Created] --> B[Active Period] B --> C{Retention Period} C -->|Expired| D[Marked for Cleanup] C -->|Valid| B D --> E[Automatic Removal] ``` #### 4. Cleanup Process - **Frequency**: Configurable interval (default: 1 hour) - **Scope**: Remove secrets older than retention period - **Safety**: Never remove current primary secret - **Logging**: Audit trail of cleanup operations ### Implementation Strategy #### Phase 1: Configuration Framework 1. **Extend Config Package** (`pkg/config/config.go`) - Add JWT TTL configuration - Add secret retention parameters - Implement validation 2. **Environment Variables** ```bash # JWT Token TTL DLC_JWT_TTL=24h # Secret Retention DLC_JWT_SECRET_RETENTION_FACTOR=2.0 DLC_JWT_SECRET_MAX_RETENTION=72h DLC_JWT_SECRET_CLEANUP_INTERVAL=1h ``` #### Phase 2: Secret Manager Enhancement 1. **Enhance JWTSecret Struct** ```go type JWTSecret struct { Secret string IsPrimary bool CreatedAt time.Time ExpiresAt *time.Time // Now properly calculated RetentionPeriod time.Duration } ``` 2. **Add Expiration Logic** ```go func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) { // Calculate retention period based on config retentionPeriod := m.calculateRetentionPeriod() expiresAt := time.Now().Add(expiresIn) m.secrets = append(m.secrets, JWTSecret{ Secret: secret, IsPrimary: isPrimary, CreatedAt: time.Now(), ExpiresAt: &expiresAt, RetentionPeriod: retentionPeriod, }) } ``` #### Phase 3: Automatic Cleanup 1. **Background Cleanup Job** ```go func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) { ticker := time.NewTicker(interval) go func() { for { select { case <-ticker.C: m.CleanupExpiredSecrets() case <-ctx.Done(): ticker.Stop() return } } }() } ``` 2. **Cleanup Implementation** ```go func (m *JWTSecretManager) CleanupExpiredSecrets() { now := time.Now() var activeSecrets []JWTSecret for _, secret := range m.secrets { if secret.IsPrimary { // Never remove current primary activeSecrets = append(activeSecrets, secret) continue } // Check if secret is within retention period if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod { activeSecrets = append(activeSecrets, secret) } else { log.Info(). Str("secret", secret.Secret). Msg("Removed expired JWT secret") } } m.secrets = activeSecrets } ``` #### Phase 4: Integration 1. **Server Initialization** ```go func (s *Server) InitializeJWT() error { // Load config jwtConfig := s.config.GetJWTConfig() // Create secret manager with retention policy secretManager := NewJWTSecretManager( jwtConfig.Secret, WithRetentionFactor(jwtConfig.RetentionFactor), WithMaxRetention(jwtConfig.MaxRetention), ) // Start cleanup job secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval) return nil } ``` ### Validation #### 1. Configuration Validation ```go func (c *Config) ValidateJWTConfig() error { if c.JWT.TTL <= 0 { return fmt.Errorf("jwt.ttl must be positive") } if c.JWT.SecretRetention.RetentionFactor < 1.0 { return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0") } if c.JWT.SecretRetention.MaxRetention <= 0 { return fmt.Errorf("jwt.secret_retention.max_retention must be positive") } if c.JWT.SecretRetention.CleanupInterval <= 0 { return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive") } // Ensure max retention is reasonable if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h") } return nil } ``` #### 2. Runtime Validation ```go func (m *JWTSecretManager) ValidateSecret(secret string) error { // Check minimum length if len(secret) < 16 { return fmt.Errorf("jwt secret must be at least 16 characters") } // Check entropy (basic check) if !hasSufficientEntropy(secret) { return fmt.Errorf("jwt secret must have sufficient entropy") } return nil } ``` ### Monitoring and Observability #### 1. Metrics ```go // Prometheus metrics var ( jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{ Name: "jwt_secrets_active_count", Help: "Number of active JWT secrets", }) jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{ Name: "jwt_secrets_expired_total", Help: "Total number of expired JWT secrets removed", }) jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "jwt_secret_retention_duration_seconds", Help: "Duration of JWT secret retention periods", Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h }) ) ``` #### 2. Logging ```go func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) { log.Info(). Str("secret", maskSecret(secret)). Str("event", event). Interface("details", details). Msg("JWT secret event") } func maskSecret(secret string) string { if len(secret) <= 4 { return "****" } return secret[:4] + "****" + secret[len(secret)-4:] } ``` ## Consequences ### Positive 1. **Enhanced Security**: Automatic cleanup reduces attack surface 2. **Reduced Memory Usage**: Prevents unbounded growth of secret storage 3. **Operational Efficiency**: No manual cleanup required 4. **Compliance Ready**: Meets security policy requirements for key rotation 5. **Flexibility**: Configurable to meet different security requirements ### Negative 1. **Complexity**: Adds configuration and cleanup logic 2. **Performance Overhead**: Background cleanup job (minimal impact) 3. **Migration**: Existing deployments need configuration updates 4. **Debugging**: More moving parts to troubleshoot ### Neutral 1. **Backward Compatibility**: Existing tokens continue to work 2. **Learning Curve**: New configuration options to understand 3. **Monitoring**: Additional metrics to track ## Alternatives Considered ### Alternative 1: Fixed Retention Period **Proposal**: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation **Rejected Because**: - Less flexible for different use cases - Doesn't scale with JWT TTL changes - May be too short for long-lived tokens or too long for short-lived ones ### Alternative 2: Manual Cleanup Only **Proposal**: Require administrators to manually clean up old secrets **Rejected Because**: - Operational overhead - Security risk if cleanup is forgotten - Doesn't scale for frequent rotations ### Alternative 3: No Retention (Current State) **Proposal**: Keep current behavior with no automatic cleanup **Rejected Because**: - Security concerns with accumulating secrets - Memory management issues - Compliance violations ## Success Metrics 1. **Security**: No old secrets remain beyond retention period 2. **Reliability**: 99.9% of valid tokens continue to work during rotation 3. **Performance**: Cleanup job completes in <100ms with <1000 secrets 4. **Adoption**: Configuration used in 100% of deployments within 3 months ## Migration Plan ### Phase 1: Preparation (1 week) - ✅ Create this ADR - ✅ Update documentation - ✅ Add configuration to config package - ✅ Implement basic retention logic ### Phase 2: Testing (2 weeks) - ✅ Write BDD scenarios for retention - ✅ Add unit tests for secret manager - ✅ Test with various TTL/factor combinations - ✅ Performance testing with large secret counts ### Phase 3: Rollout (1 week) - ✅ Update default configuration - ✅ Add feature flag for gradual rollout - ✅ Monitor metrics in staging - ✅ Gradual production rollout ### Phase 4: Optimization (Ongoing) - ✅ Monitor cleanup performance - ✅ Adjust defaults based on real-world usage - ✅ Add alerts for cleanup failures - ✅ Document troubleshooting guide ## References - [ADR-0009: Hybrid Testing Approach](0009-hybrid-testing-approach.md) - [ADR-0008: BDD Testing](0008-bdd-testing.md) - [RFC 7519: JSON Web Tokens](https://tools.ietf.org/html/rfc7519) - [OWASP Key Management Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html) ## Appendix ### Configuration Examples **Development Environment** (short retention for testing): ```yaml jwt: ttl: 1h secret_retention: retention_factor: 1.5 max_retention: 3h cleanup_interval: 30m ``` **Production Environment** (secure defaults): ```yaml jwt: ttl: 24h secret_retention: retention_factor: 2.0 max_retention: 72h cleanup_interval: 1h ``` **High-Security Environment** (aggressive rotation): ```yaml jwt: ttl: 8h secret_retention: retention_factor: 1.5 max_retention: 24h cleanup_interval: 30m ``` ### Troubleshooting **Issue**: Secrets being removed too quickly - **Check**: Retention factor and JWT TTL settings - **Fix**: Increase retention_factor or JWT TTL **Issue**: Too many old secrets accumulating - **Check**: Cleanup job logs and interval - **Fix**: Decrease cleanup_interval or retention_factor **Issue**: Performance degradation during cleanup - **Check**: Number of secrets and cleanup frequency - **Fix**: Optimize cleanup algorithm or increase interval ### FAQ **Q: What happens to tokens signed with expired secrets?** A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate. **Q: Can I disable automatic cleanup?** A: Yes, set `cleanup_interval` to a very high value (e.g., `8760h` for 1 year). **Q: How does this affect existing deployments?** A: Existing deployments will use sensible defaults. The feature is backward compatible. **Q: What's the recommended retention factor?** A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs. **Q: How often should cleanup run?** A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup. ## Decision Record **Approved By**: **Approved Date**: **Implemented By**: **Implementation Date**: --- *Generated by Mistral Vibe* *Co-Authored-By: Mistral Vibe *