Files
dance-lessons-coach/adr/0021-jwt-secret-retention-policy.md
Gabriel Radureau 5eec64e5e8
All checks were successful
CI/CD Pipeline / Build Docker Cache (push) Successful in 9s
CI/CD Pipeline / CI Pipeline (push) Successful in 4m15s
CI/CD Pipeline / Trigger Docker Push (push) Has been skipped
🧪 test: add JWT secret rotation BDD scenarios and step implementations (#12)
 merge: implement JWT secret rotation with BDD scenario isolation

- Implement JWT secret rotation mechanism (closes #8)
- Add per-scenario state isolation for BDD tests (closes #14)
- Validate password reset workflow via BDD tests (closes #7)
- Fix port conflicts in test validation
- Add state tracer for debugging test execution
- Document BDD isolation strategies in ADR 0025
- Fix PostgreSQL configuration environment variables

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-04-11 17:56:45 +02:00

13 KiB
Raw Blame History

10. JWT Secret Retention Policy

Status

Proposed 🟡

Context

The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in ADR-0009, the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup.

Current State

  • Multiple JWT secrets supported
  • Graceful rotation implemented
  • Backward compatibility maintained
  • No automatic cleanup of old secrets
  • No configurable retention periods
  • No expiration-based secret management

Problem Statement

Without a retention policy:

  1. Security Risk: Old secrets accumulate indefinitely, increasing attack surface
  2. Memory Bloat: Unbounded growth of secret storage
  3. Operational Overhead: Manual cleanup required
  4. Compliance Issues: May violate security policies requiring regular key rotation

Requirements

  1. Configurable Retention: Administrators should control how long secrets are retained
  2. Automatic Cleanup: System should automatically remove expired secrets
  3. Backward Compatibility: Existing tokens should continue working during retention period
  4. Sensible Defaults: Should work out-of-the-box with secure defaults
  5. Performance: Cleanup should not impact runtime performance

Decision

JWT Secret Retention Policy

Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components:

1. Configuration Structure

jwt:
  # Token time-to-live (default: 24h)
  ttl: 24h

  # Secret retention configuration
  secret_retention:
    # Retention factor multiplier (default: 2.0)
    # Retention period = JWT TTL × retention_factor
    retention_factor: 2.0

    # Maximum retention period (safety limit, default: 72h)
    max_retention: 72h

    # Cleanup frequency for expired secrets (default: 1h)
    cleanup_interval: 1h

2. Retention Period Calculation

retention_period = min(JWT_TTL × retention_factor, max_retention)

Examples:

  • Default (24h TTL, 2.0 factor): min(48h, 72h) = 48h
  • Short-lived tokens (1h TTL, 3.0 factor): min(3h, 72h) = 3h
  • Long-lived tokens (72h TTL, 2.0 factor): min(144h, 72h) = 72h

3. Secret Lifecycle

graph LR
    A[Secret Created] --> B[Active Period]
    B --> C{Retention Period}
    C -->|Expired| D[Marked for Cleanup]
    C -->|Valid| B
    D --> E[Automatic Removal]

4. Cleanup Process

  • Frequency: Configurable interval (default: 1 hour)
  • Scope: Remove secrets older than retention period
  • Safety: Never remove current primary secret
  • Logging: Audit trail of cleanup operations

Implementation Strategy

Phase 1: Configuration Framework

  1. Extend Config Package (pkg/config/config.go)

    • Add JWT TTL configuration
    • Add secret retention parameters
    • Implement validation
  2. Environment Variables

    # JWT Token TTL
    DLC_JWT_TTL=24h
    
    # Secret Retention
    DLC_JWT_SECRET_RETENTION_FACTOR=2.0
    DLC_JWT_SECRET_MAX_RETENTION=72h
    DLC_JWT_SECRET_CLEANUP_INTERVAL=1h
    

Phase 2: Secret Manager Enhancement

  1. Enhance JWTSecret Struct

    type JWTSecret struct {
        Secret     string
        IsPrimary  bool
        CreatedAt  time.Time
        ExpiresAt  *time.Time  // Now properly calculated
        RetentionPeriod time.Duration
    }
    
  2. Add Expiration Logic

    func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) {
        // Calculate retention period based on config
        retentionPeriod := m.calculateRetentionPeriod()
        expiresAt := time.Now().Add(expiresIn)
    
        m.secrets = append(m.secrets, JWTSecret{
            Secret:        secret,
            IsPrimary:     isPrimary,
            CreatedAt:     time.Now(),
            ExpiresAt:     &expiresAt,
            RetentionPeriod: retentionPeriod,
        })
    }
    

Phase 3: Automatic Cleanup

  1. Background Cleanup Job

    func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) {
        ticker := time.NewTicker(interval)
        go func() {
            for {
                select {
                case <-ticker.C:
                    m.CleanupExpiredSecrets()
                case <-ctx.Done():
                    ticker.Stop()
                    return
                }
            }
        }()
    }
    
  2. Cleanup Implementation

    func (m *JWTSecretManager) CleanupExpiredSecrets() {
        now := time.Now()
        var activeSecrets []JWTSecret
    
        for _, secret := range m.secrets {
            if secret.IsPrimary {
                // Never remove current primary
                activeSecrets = append(activeSecrets, secret)
                continue
            }
    
            // Check if secret is within retention period
            if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod {
                activeSecrets = append(activeSecrets, secret)
            } else {
                log.Info().
                    Str("secret", secret.Secret).
                    Msg("Removed expired JWT secret")
            }
        }
    
        m.secrets = activeSecrets
    }
    

Phase 4: Integration

  1. Server Initialization
    func (s *Server) InitializeJWT() error {
        // Load config
        jwtConfig := s.config.GetJWTConfig()
    
        // Create secret manager with retention policy
        secretManager := NewJWTSecretManager(
            jwtConfig.Secret,
            WithRetentionFactor(jwtConfig.RetentionFactor),
            WithMaxRetention(jwtConfig.MaxRetention),
        )
    
        // Start cleanup job
        secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval)
    
        return nil
    }
    

Validation

1. Configuration Validation

func (c *Config) ValidateJWTConfig() error {
    if c.JWT.TTL <= 0 {
        return fmt.Errorf("jwt.ttl must be positive")
    }
    
    if c.JWT.SecretRetention.RetentionFactor < 1.0 {
        return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0")
    }
    
    if c.JWT.SecretRetention.MaxRetention <= 0 {
        return fmt.Errorf("jwt.secret_retention.max_retention must be positive")
    }
    
    if c.JWT.SecretRetention.CleanupInterval <= 0 {
        return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive")
    }
    
    // Ensure max retention is reasonable
    if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days
        return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h")
    }
    
    return nil
}

2. Runtime Validation

func (m *JWTSecretManager) ValidateSecret(secret string) error {
    // Check minimum length
    if len(secret) < 16 {
        return fmt.Errorf("jwt secret must be at least 16 characters")
    }
    
    // Check entropy (basic check)
    if !hasSufficientEntropy(secret) {
        return fmt.Errorf("jwt secret must have sufficient entropy")
    }
    
    return nil
}

Monitoring and Observability

1. Metrics

// Prometheus metrics
var (
    jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "jwt_secrets_active_count",
        Help: "Number of active JWT secrets",
    })
    
    jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{
        Name: "jwt_secrets_expired_total",
        Help: "Total number of expired JWT secrets removed",
    })
    
    jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
        Name: "jwt_secret_retention_duration_seconds",
        Help: "Duration of JWT secret retention periods",
        Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h
    })
)

2. Logging

func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) {
    log.Info().
        Str("secret", maskSecret(secret)).
        Str("event", event).
        Interface("details", details).
        Msg("JWT secret event")
}

func maskSecret(secret string) string {
    if len(secret) <= 4 {
        return "****"
    }
    return secret[:4] + "****" + secret[len(secret)-4:]
}

Consequences

Positive

  1. Enhanced Security: Automatic cleanup reduces attack surface
  2. Reduced Memory Usage: Prevents unbounded growth of secret storage
  3. Operational Efficiency: No manual cleanup required
  4. Compliance Ready: Meets security policy requirements for key rotation
  5. Flexibility: Configurable to meet different security requirements

Negative

  1. Complexity: Adds configuration and cleanup logic
  2. Performance Overhead: Background cleanup job (minimal impact)
  3. Migration: Existing deployments need configuration updates
  4. Debugging: More moving parts to troubleshoot

Neutral

  1. Backward Compatibility: Existing tokens continue to work
  2. Learning Curve: New configuration options to understand
  3. Monitoring: Additional metrics to track

Alternatives Considered

Alternative 1: Fixed Retention Period

Proposal: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation

Rejected Because:

  • Less flexible for different use cases
  • Doesn't scale with JWT TTL changes
  • May be too short for long-lived tokens or too long for short-lived ones

Alternative 2: Manual Cleanup Only

Proposal: Require administrators to manually clean up old secrets

Rejected Because:

  • Operational overhead
  • Security risk if cleanup is forgotten
  • Doesn't scale for frequent rotations

Alternative 3: No Retention (Current State)

Proposal: Keep current behavior with no automatic cleanup

Rejected Because:

  • Security concerns with accumulating secrets
  • Memory management issues
  • Compliance violations

Success Metrics

  1. Security: No old secrets remain beyond retention period
  2. Reliability: 99.9% of valid tokens continue to work during rotation
  3. Performance: Cleanup job completes in <100ms with <1000 secrets
  4. Adoption: Configuration used in 100% of deployments within 3 months

Migration Plan

Phase 1: Preparation (1 week)

  • Create this ADR
  • Update documentation
  • Add configuration to config package
  • Implement basic retention logic

Phase 2: Testing (2 weeks)

  • Write BDD scenarios for retention
  • Add unit tests for secret manager
  • Test with various TTL/factor combinations
  • Performance testing with large secret counts

Phase 3: Rollout (1 week)

  • Update default configuration
  • Add feature flag for gradual rollout
  • Monitor metrics in staging
  • Gradual production rollout

Phase 4: Optimization (Ongoing)

  • Monitor cleanup performance
  • Adjust defaults based on real-world usage
  • Add alerts for cleanup failures
  • Document troubleshooting guide

References

Appendix

Configuration Examples

Development Environment (short retention for testing):

jwt:
  ttl: 1h
  secret_retention:
    retention_factor: 1.5
    max_retention: 3h
    cleanup_interval: 30m

Production Environment (secure defaults):

jwt:
  ttl: 24h
  secret_retention:
    retention_factor: 2.0
    max_retention: 72h
    cleanup_interval: 1h

High-Security Environment (aggressive rotation):

jwt:
  ttl: 8h
  secret_retention:
    retention_factor: 1.5
    max_retention: 24h
    cleanup_interval: 30m

Troubleshooting

Issue: Secrets being removed too quickly

  • Check: Retention factor and JWT TTL settings
  • Fix: Increase retention_factor or JWT TTL

Issue: Too many old secrets accumulating

  • Check: Cleanup job logs and interval
  • Fix: Decrease cleanup_interval or retention_factor

Issue: Performance degradation during cleanup

  • Check: Number of secrets and cleanup frequency
  • Fix: Optimize cleanup algorithm or increase interval

FAQ

Q: What happens to tokens signed with expired secrets? A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate.

Q: Can I disable automatic cleanup? A: Yes, set cleanup_interval to a very high value (e.g., 8760h for 1 year).

Q: How does this affect existing deployments? A: Existing deployments will use sensible defaults. The feature is backward compatible.

Q: What's the recommended retention factor? A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs.

Q: How often should cleanup run? A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup.

Decision Record

Approved By: Approved Date: Implemented By: Implementation Date:


Generated by Mistral Vibe Co-Authored-By: Mistral Vibe vibe@mistral.ai