dance-lessons-coach/adr/0021-jwt-secret-retention-policy.md

# 10. JWT Secret Retention Policy

**Status:** Proposed

## Context

The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in [ADR-0009](0009-hybrid-testing-approach.md), the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup.

### Current State

- ✅ Multiple JWT secrets supported
- ✅ Graceful rotation implemented
- ✅ Backward compatibility maintained
- ❌ No automatic cleanup of old secrets
- ❌ No configurable retention periods
- ❌ No expiration-based secret management

### Problem Statement

Without a retention policy:
1. **Security Risk**: Old secrets accumulate indefinitely, increasing attack surface
2. **Memory Bloat**: Unbounded growth of secret storage
3. **Operational Overhead**: Manual cleanup required
4. **Compliance Issues**: May violate security policies requiring regular key rotation

### Requirements

1. **Configurable Retention**: Administrators should control how long secrets are retained
2. **Automatic Cleanup**: System should automatically remove expired secrets
3. **Backward Compatibility**: Existing tokens should continue working during retention period
4. **Sensible Defaults**: Should work out-of-the-box with secure defaults
5. **Performance**: Cleanup should not impact runtime performance

## Decision

### JWT Secret Retention Policy

Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components:

#### 1. Configuration Structure

```yaml
jwt:
  # Token time-to-live (default: 24h)
  ttl: 24h

  # Secret retention configuration
  secret_retention:
    # Retention factor multiplier (default: 2.0)
    # Retention period = JWT TTL × retention_factor
    retention_factor: 2.0

    # Maximum retention period (safety limit, default: 72h)
    max_retention: 72h

    # Cleanup frequency for expired secrets (default: 1h)
    cleanup_interval: 1h
```

#### 2. Retention Period Calculation

```
retention_period = min(JWT_TTL × retention_factor, max_retention)
```

**Examples:**
- Default (24h TTL, 2.0 factor): `min(48h, 72h) = 48h`
- Short-lived tokens (1h TTL, 3.0 factor): `min(3h, 72h) = 3h`
- Long-lived tokens (72h TTL, 2.0 factor): `min(144h, 72h) = 72h`

#### 3. Secret Lifecycle

```mermaid
graph LR
    A[Secret Created] --> B[Active Period]
    B --> C{Retention Period}
    C -->|Expired| D[Marked for Cleanup]
    C -->|Valid| B
    D --> E[Automatic Removal]
```

#### 4. Cleanup Process

- **Frequency**: Configurable interval (default: 1 hour)
- **Scope**: Remove secrets older than retention period
- **Safety**: Never remove current primary secret
- **Logging**: Audit trail of cleanup operations

### Implementation Strategy

#### Phase 1: Configuration Framework

1. **Extend Config Package** (`pkg/config/config.go`)
   - Add JWT TTL configuration
   - Add secret retention parameters
   - Implement validation

2. **Environment Variables**
   ```bash
   # JWT Token TTL
   DLC_JWT_TTL=24h

   # Secret Retention
   DLC_JWT_SECRET_RETENTION_FACTOR=2.0
   DLC_JWT_SECRET_MAX_RETENTION=72h
   DLC_JWT_SECRET_CLEANUP_INTERVAL=1h
   ```

#### Phase 2: Secret Manager Enhancement

1. **Enhance JWTSecret Struct**
   ```go
   type JWTSecret struct {
       Secret     string
       IsPrimary  bool
       CreatedAt  time.Time
       ExpiresAt  *time.Time  // Now properly calculated
       RetentionPeriod time.Duration
   }
   ```

2. **Add Expiration Logic**
   ```go
   func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) {
       // Calculate retention period based on config
       retentionPeriod := m.calculateRetentionPeriod()
       expiresAt := time.Now().Add(expiresIn)

       m.secrets = append(m.secrets, JWTSecret{
           Secret:        secret,
           IsPrimary:     isPrimary,
           CreatedAt:     time.Now(),
           ExpiresAt:     &expiresAt,
           RetentionPeriod: retentionPeriod,
       })
   }
   ```

#### Phase 3: Automatic Cleanup

1. **Background Cleanup Job**
   ```go
   func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) {
       ticker := time.NewTicker(interval)
       go func() {
           for {
               select {
               case <-ticker.C:
                   m.CleanupExpiredSecrets()
               case <-ctx.Done():
                   ticker.Stop()
                   return
               }
           }
       }()
   }
   ```

2. **Cleanup Implementation**
   ```go
   func (m *JWTSecretManager) CleanupExpiredSecrets() {
       now := time.Now()
       var activeSecrets []JWTSecret

       for _, secret := range m.secrets {
           if secret.IsPrimary {
               // Never remove current primary
               activeSecrets = append(activeSecrets, secret)
               continue
           }

           // Check if secret is within retention period
           if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod {
               activeSecrets = append(activeSecrets, secret)
           } else {
               log.Info().
                   Str("secret", secret.Secret).
                   Msg("Removed expired JWT secret")
           }
       }

       m.secrets = activeSecrets
   }
   ```

#### Phase 4: Integration

1. **Server Initialization**
   ```go
   func (s *Server) InitializeJWT() error {
       // Load config
       jwtConfig := s.config.GetJWTConfig()

       // Create secret manager with retention policy
       secretManager := NewJWTSecretManager(
           jwtConfig.Secret,
           WithRetentionFactor(jwtConfig.RetentionFactor),
           WithMaxRetention(jwtConfig.MaxRetention),
       )

       // Start cleanup job
       secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval)

       return nil
   }
   ```

### Validation

#### 1. Configuration Validation

```go
func (c *Config) ValidateJWTConfig() error {
    if c.JWT.TTL <= 0 {
        return fmt.Errorf("jwt.ttl must be positive")
    }

    if c.JWT.SecretRetention.RetentionFactor < 1.0 {
        return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0")
    }

    if c.JWT.SecretRetention.MaxRetention <= 0 {
        return fmt.Errorf("jwt.secret_retention.max_retention must be positive")
    }

    if c.JWT.SecretRetention.CleanupInterval <= 0 {
        return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive")
    }

    // Ensure max retention is reasonable
    if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days
        return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h")
    }

    return nil
}
```

#### 2. Runtime Validation

```go
func (m *JWTSecretManager) ValidateSecret(secret string) error {
    // Check minimum length
    if len(secret) < 16 {
        return fmt.Errorf("jwt secret must be at least 16 characters")
    }

    // Check entropy (basic check)
    if !hasSufficientEntropy(secret) {
        return fmt.Errorf("jwt secret must have sufficient entropy")
    }

    return nil
}
```

### Monitoring and Observability

#### 1. Metrics

```go
// Prometheus metrics
var (
    jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "jwt_secrets_active_count",
        Help: "Number of active JWT secrets",
    })

    jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{
        Name: "jwt_secrets_expired_total",
        Help: "Total number of expired JWT secrets removed",
    })

    jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
        Name: "jwt_secret_retention_duration_seconds",
        Help: "Duration of JWT secret retention periods",
        Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h
    })
)
```

#### 2. Logging

```go
func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) {
    log.Info().
        Str("secret", maskSecret(secret)).
        Str("event", event).
        Interface("details", details).
        Msg("JWT secret event")
}

func maskSecret(secret string) string {
    if len(secret) <= 4 {
        return "****"
    }
    return secret[:4] + "****" + secret[len(secret)-4:]
}
```

## Consequences

### Positive

1. **Enhanced Security**: Automatic cleanup reduces attack surface
2. **Reduced Memory Usage**: Prevents unbounded growth of secret storage
3. **Operational Efficiency**: No manual cleanup required
4. **Compliance Ready**: Meets security policy requirements for key rotation
5. **Flexibility**: Configurable to meet different security requirements

### Negative

1. **Complexity**: Adds configuration and cleanup logic
2. **Performance Overhead**: Background cleanup job (minimal impact)
3. **Migration**: Existing deployments need configuration updates
4. **Debugging**: More moving parts to troubleshoot

### Neutral

1. **Backward Compatibility**: Existing tokens continue to work
2. **Learning Curve**: New configuration options to understand
3. **Monitoring**: Additional metrics to track

## Alternatives Considered

### Alternative 1: Fixed Retention Period

**Proposal**: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation

**Rejected Because**:
- Less flexible for different use cases
- Doesn't scale with JWT TTL changes
- May be too short for long-lived tokens or too long for short-lived ones

### Alternative 2: Manual Cleanup Only

**Proposal**: Require administrators to manually clean up old secrets

**Rejected Because**:
- Operational overhead
- Security risk if cleanup is forgotten
- Doesn't scale for frequent rotations

### Alternative 3: No Retention (Current State)

**Proposal**: Keep current behavior with no automatic cleanup

**Rejected Because**:
- Security concerns with accumulating secrets
- Memory management issues
- Compliance violations

## Success Metrics

1. **Security**: No old secrets remain beyond retention period
2. **Reliability**: 99.9% of valid tokens continue to work during rotation
3. **Performance**: Cleanup job completes in <100ms with <1000 secrets
4. **Adoption**: Configuration used in 100% of deployments within 3 months

## Migration Plan

### Phase 1: Preparation (1 week)
- ✅ Create this ADR
- ✅ Update documentation
- ✅ Add configuration to config package
- ✅ Implement basic retention logic

### Phase 2: Testing (2 weeks)
- ✅ Write BDD scenarios for retention
- ✅ Add unit tests for secret manager
- ✅ Test with various TTL/factor combinations
- ✅ Performance testing with large secret counts

### Phase 3: Rollout (1 week)
- ✅ Update default configuration
- ✅ Add feature flag for gradual rollout
- ✅ Monitor metrics in staging
- ✅ Gradual production rollout

### Phase 4: Optimization (Ongoing)
- ✅ Monitor cleanup performance
- ✅ Adjust defaults based on real-world usage
- ✅ Add alerts for cleanup failures
- ✅ Document troubleshooting guide

## References

- [ADR-0009: Hybrid Testing Approach](0009-hybrid-testing-approach.md)
- [ADR-0008: BDD Testing](0008-bdd-testing.md)
- [RFC 7519: JSON Web Tokens](https://tools.ietf.org/html/rfc7519)
- [OWASP Key Management Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html)

## Appendix

### Configuration Examples

**Development Environment** (short retention for testing):
```yaml
jwt:
  ttl: 1h
  secret_retention:
    retention_factor: 1.5
    max_retention: 3h
    cleanup_interval: 30m
```

**Production Environment** (secure defaults):
```yaml
jwt:
  ttl: 24h
  secret_retention:
    retention_factor: 2.0
    max_retention: 72h
    cleanup_interval: 1h
```

**High-Security Environment** (aggressive rotation):
```yaml
jwt:
  ttl: 8h
  secret_retention:
    retention_factor: 1.5
    max_retention: 24h
    cleanup_interval: 30m
```

### Troubleshooting

**Issue**: Secrets being removed too quickly
- **Check**: Retention factor and JWT TTL settings
- **Fix**: Increase retention_factor or JWT TTL

**Issue**: Too many old secrets accumulating
- **Check**: Cleanup job logs and interval
- **Fix**: Decrease cleanup_interval or retention_factor

**Issue**: Performance degradation during cleanup
- **Check**: Number of secrets and cleanup frequency
- **Fix**: Optimize cleanup algorithm or increase interval

### FAQ

**Q: What happens to tokens signed with expired secrets?**
A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate.

**Q: Can I disable automatic cleanup?**
A: Yes, set `cleanup_interval` to a very high value (e.g., `8760h` for 1 year).

**Q: How does this affect existing deployments?**
A: Existing deployments will use sensible defaults. The feature is backward compatible.

**Q: What's the recommended retention factor?**
A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs.

**Q: How often should cleanup run?**
A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup.

## Decision Record

**Approved By**:
**Approved Date**:
**Implemented By**:
**Implementation Date**:

---

*Generated by Mistral Vibe*
*Co-Authored-By: Mistral Vibe <vibe@mistral.ai>*