Files
dance-lessons-coach/adr/0021-jwt-secret-retention-policy.md
Gabriel Radureau a24b4fdb3b 📝 docs(adr): homogenize 23 ADRs + rewrite README (Tâche 7 migration) (#18)
## Summary

Homogenize all 23 ADRs to a single canonical header format, and rewrite `adr/README.md` to match the actual state of the corpus.

This is **Tâche 7** of the ARCODANGE Phase 1 migration (Claude Code → Mistral Vibe). Independent from PR #17 (Tâche 6 — restructure AGENTS.md) — both can merge in any order. No code changes; only documentation.

## Changes

### 1. Homogenize 21 ADR headers (commit `db09d0a`)

The audit (Tâche 6 Phase A, Mistral intent-router agent, 2026-05-02) had identified **3 inconsistent header formats** :

- **F1** — list bullets (`* Status:` / `* Date:` / `* Deciders:`) : 11 ADRs (0001-0008, 0011, 0014, 0023)
- **F2** — bold fields (`**Status:**` / `**Date:**` / `**Authors:**`) : 9 ADRs (0009, 0010, 0012, 0013, 0015, 0016, 0017, 0018, 0019)
- **F3** — dedicated section (`## Status\n**Value** `) : 5 ADRs (0020, 0021, 0022, 0024, 0025)

Plus mixed metadata names (Authors / Deciders / Decision Date / Implementation Date / Implementation Status / Last Updated) and decorative emojis on status values made the corpus hard to scan or template against.

**Canonical format adopted** (see `adr/README.md` for full template) :

```markdown
# NN. Title

**Status:** <Proposed | Accepted | Implemented | Partially Implemented | Approved | Rejected | Deferred | Deprecated | Superseded by ADR-NNNN>
**Date:** YYYY-MM-DD
**Authors:** Name(s)

[optional **Field:** ... lines]

## Context...
```

**Transformations applied** (via `/tmp/homogenize-adrs.py` script, 23 files scanned, 21 modified — 0010 and 0012 were already conform) :

- F1 list bullets → bold fields
- F2 cleanup : `**Deciders:**` → `**Authors:**`, strip status emojis
- F3 sections : `## Status\n**Value** ` → `**Status:** Value` (single line)
- Strip decorative emojis from `**Status:**` and `**Implementation Status:**`
- Convert `* Last Updated:` / `* Implementation Status:` / `* Decision Drivers:` / `* Decision Date:` to bold
- Date typo fix : `2024-04-XX` → `2026-04-XX` for ADRs 0018, 0019 (off-by-2-years in original)
- Normalize multiple blank lines after header (max 1)

**ADR body content is preserved unchanged.** Only headers transformed.

### 2. Rewrite `adr/README.md` (commit `d64ab02`)

Previous README had multiple inconsistencies :

- Index table listed wrong titles for ADRs 0010-0021 (looked like an aspirational forecast that never matched reality — e.g. "0011 = Trunk-Based Development" but real 0011 is absent and Trunk-Based Development is actually 0017)
- Listed entries for ADRs 0011 (validation library) and 0014 (gRPC) but **these files do not exist** in the repo
- 0024 (BDD Test Organization) was missing from the detail list
- Template still showed the obsolete F1 format (`* Status:`)
- Decorative emojis on every status entry

Rewrite :

- Index table **regenerated from actual file contents** (title from H1, status from `**Status:**` line) — emoji-free, accurate
- Notes that 0011 / 0014 are not currently in use (reserved)
- Updated template block matches the canonical format
- Status Legend extended with `Approved`, `Partially Implemented`, `Deferred`
- Added note that 0026 is the next free number for new ADRs

## Test plan

- [x] All 23 ADRs follow `**Status:**` / `**Date:**` / `**Authors:**` (verified via grep)
- [x] No more occurrences of `* Status:` (F1) or `## Status` (F3) in any ADR header
- [x] No more emojis on `**Status:**` lines
- [x] `adr/README.md` index links resolve to existing files (no more 0011 / 0014 dead links)
- [x] Pre-commit hooks pass (`go mod tidy`, `go fmt`, `swag fmt`)

## Migration context

Part of Phase 1 of the ARCODANGE migration from Claude Code to Mistral Vibe. Tâche 7 of the curriculum.

Independent from PR #17 (which restructures `AGENTS.md`). The two PRs touch disjoint files — no merge conflict expected when both are merged.

🤖 Generated with [Claude Code](https://claude.com/claude-code) (Opus 4.7, 1M context). Mistral Vibe (intent-router agent / mistral-medium-3.5) did the original audit identifying the 3 formats during Tâche 6 Phase A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Mistral Vibe (devstral-2 / mistral-medium-3.5)
Reviewed-on: #18
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
2026-05-03 11:01:13 +02:00

13 KiB
Raw Permalink Blame History

10. JWT Secret Retention Policy

Status: Proposed

Context

The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in ADR-0009, the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup.

Current State

  • Multiple JWT secrets supported
  • Graceful rotation implemented
  • Backward compatibility maintained
  • No automatic cleanup of old secrets
  • No configurable retention periods
  • No expiration-based secret management

Problem Statement

Without a retention policy:

  1. Security Risk: Old secrets accumulate indefinitely, increasing attack surface
  2. Memory Bloat: Unbounded growth of secret storage
  3. Operational Overhead: Manual cleanup required
  4. Compliance Issues: May violate security policies requiring regular key rotation

Requirements

  1. Configurable Retention: Administrators should control how long secrets are retained
  2. Automatic Cleanup: System should automatically remove expired secrets
  3. Backward Compatibility: Existing tokens should continue working during retention period
  4. Sensible Defaults: Should work out-of-the-box with secure defaults
  5. Performance: Cleanup should not impact runtime performance

Decision

JWT Secret Retention Policy

Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components:

1. Configuration Structure

jwt:
  # Token time-to-live (default: 24h)
  ttl: 24h

  # Secret retention configuration
  secret_retention:
    # Retention factor multiplier (default: 2.0)
    # Retention period = JWT TTL × retention_factor
    retention_factor: 2.0

    # Maximum retention period (safety limit, default: 72h)
    max_retention: 72h

    # Cleanup frequency for expired secrets (default: 1h)
    cleanup_interval: 1h

2. Retention Period Calculation

retention_period = min(JWT_TTL × retention_factor, max_retention)

Examples:

  • Default (24h TTL, 2.0 factor): min(48h, 72h) = 48h
  • Short-lived tokens (1h TTL, 3.0 factor): min(3h, 72h) = 3h
  • Long-lived tokens (72h TTL, 2.0 factor): min(144h, 72h) = 72h

3. Secret Lifecycle

graph LR
    A[Secret Created] --> B[Active Period]
    B --> C{Retention Period}
    C -->|Expired| D[Marked for Cleanup]
    C -->|Valid| B
    D --> E[Automatic Removal]

4. Cleanup Process

  • Frequency: Configurable interval (default: 1 hour)
  • Scope: Remove secrets older than retention period
  • Safety: Never remove current primary secret
  • Logging: Audit trail of cleanup operations

Implementation Strategy

Phase 1: Configuration Framework

  1. Extend Config Package (pkg/config/config.go)

    • Add JWT TTL configuration
    • Add secret retention parameters
    • Implement validation
  2. Environment Variables

    # JWT Token TTL
    DLC_JWT_TTL=24h
    
    # Secret Retention
    DLC_JWT_SECRET_RETENTION_FACTOR=2.0
    DLC_JWT_SECRET_MAX_RETENTION=72h
    DLC_JWT_SECRET_CLEANUP_INTERVAL=1h
    

Phase 2: Secret Manager Enhancement

  1. Enhance JWTSecret Struct

    type JWTSecret struct {
        Secret     string
        IsPrimary  bool
        CreatedAt  time.Time
        ExpiresAt  *time.Time  // Now properly calculated
        RetentionPeriod time.Duration
    }
    
  2. Add Expiration Logic

    func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) {
        // Calculate retention period based on config
        retentionPeriod := m.calculateRetentionPeriod()
        expiresAt := time.Now().Add(expiresIn)
    
        m.secrets = append(m.secrets, JWTSecret{
            Secret:        secret,
            IsPrimary:     isPrimary,
            CreatedAt:     time.Now(),
            ExpiresAt:     &expiresAt,
            RetentionPeriod: retentionPeriod,
        })
    }
    

Phase 3: Automatic Cleanup

  1. Background Cleanup Job

    func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) {
        ticker := time.NewTicker(interval)
        go func() {
            for {
                select {
                case <-ticker.C:
                    m.CleanupExpiredSecrets()
                case <-ctx.Done():
                    ticker.Stop()
                    return
                }
            }
        }()
    }
    
  2. Cleanup Implementation

    func (m *JWTSecretManager) CleanupExpiredSecrets() {
        now := time.Now()
        var activeSecrets []JWTSecret
    
        for _, secret := range m.secrets {
            if secret.IsPrimary {
                // Never remove current primary
                activeSecrets = append(activeSecrets, secret)
                continue
            }
    
            // Check if secret is within retention period
            if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod {
                activeSecrets = append(activeSecrets, secret)
            } else {
                log.Info().
                    Str("secret", secret.Secret).
                    Msg("Removed expired JWT secret")
            }
        }
    
        m.secrets = activeSecrets
    }
    

Phase 4: Integration

  1. Server Initialization
    func (s *Server) InitializeJWT() error {
        // Load config
        jwtConfig := s.config.GetJWTConfig()
    
        // Create secret manager with retention policy
        secretManager := NewJWTSecretManager(
            jwtConfig.Secret,
            WithRetentionFactor(jwtConfig.RetentionFactor),
            WithMaxRetention(jwtConfig.MaxRetention),
        )
    
        // Start cleanup job
        secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval)
    
        return nil
    }
    

Validation

1. Configuration Validation

func (c *Config) ValidateJWTConfig() error {
    if c.JWT.TTL <= 0 {
        return fmt.Errorf("jwt.ttl must be positive")
    }
    
    if c.JWT.SecretRetention.RetentionFactor < 1.0 {
        return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0")
    }
    
    if c.JWT.SecretRetention.MaxRetention <= 0 {
        return fmt.Errorf("jwt.secret_retention.max_retention must be positive")
    }
    
    if c.JWT.SecretRetention.CleanupInterval <= 0 {
        return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive")
    }
    
    // Ensure max retention is reasonable
    if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days
        return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h")
    }
    
    return nil
}

2. Runtime Validation

func (m *JWTSecretManager) ValidateSecret(secret string) error {
    // Check minimum length
    if len(secret) < 16 {
        return fmt.Errorf("jwt secret must be at least 16 characters")
    }
    
    // Check entropy (basic check)
    if !hasSufficientEntropy(secret) {
        return fmt.Errorf("jwt secret must have sufficient entropy")
    }
    
    return nil
}

Monitoring and Observability

1. Metrics

// Prometheus metrics
var (
    jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "jwt_secrets_active_count",
        Help: "Number of active JWT secrets",
    })
    
    jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{
        Name: "jwt_secrets_expired_total",
        Help: "Total number of expired JWT secrets removed",
    })
    
    jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
        Name: "jwt_secret_retention_duration_seconds",
        Help: "Duration of JWT secret retention periods",
        Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h
    })
)

2. Logging

func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) {
    log.Info().
        Str("secret", maskSecret(secret)).
        Str("event", event).
        Interface("details", details).
        Msg("JWT secret event")
}

func maskSecret(secret string) string {
    if len(secret) <= 4 {
        return "****"
    }
    return secret[:4] + "****" + secret[len(secret)-4:]
}

Consequences

Positive

  1. Enhanced Security: Automatic cleanup reduces attack surface
  2. Reduced Memory Usage: Prevents unbounded growth of secret storage
  3. Operational Efficiency: No manual cleanup required
  4. Compliance Ready: Meets security policy requirements for key rotation
  5. Flexibility: Configurable to meet different security requirements

Negative

  1. Complexity: Adds configuration and cleanup logic
  2. Performance Overhead: Background cleanup job (minimal impact)
  3. Migration: Existing deployments need configuration updates
  4. Debugging: More moving parts to troubleshoot

Neutral

  1. Backward Compatibility: Existing tokens continue to work
  2. Learning Curve: New configuration options to understand
  3. Monitoring: Additional metrics to track

Alternatives Considered

Alternative 1: Fixed Retention Period

Proposal: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation

Rejected Because:

  • Less flexible for different use cases
  • Doesn't scale with JWT TTL changes
  • May be too short for long-lived tokens or too long for short-lived ones

Alternative 2: Manual Cleanup Only

Proposal: Require administrators to manually clean up old secrets

Rejected Because:

  • Operational overhead
  • Security risk if cleanup is forgotten
  • Doesn't scale for frequent rotations

Alternative 3: No Retention (Current State)

Proposal: Keep current behavior with no automatic cleanup

Rejected Because:

  • Security concerns with accumulating secrets
  • Memory management issues
  • Compliance violations

Success Metrics

  1. Security: No old secrets remain beyond retention period
  2. Reliability: 99.9% of valid tokens continue to work during rotation
  3. Performance: Cleanup job completes in <100ms with <1000 secrets
  4. Adoption: Configuration used in 100% of deployments within 3 months

Migration Plan

Phase 1: Preparation (1 week)

  • Create this ADR
  • Update documentation
  • Add configuration to config package
  • Implement basic retention logic

Phase 2: Testing (2 weeks)

  • Write BDD scenarios for retention
  • Add unit tests for secret manager
  • Test with various TTL/factor combinations
  • Performance testing with large secret counts

Phase 3: Rollout (1 week)

  • Update default configuration
  • Add feature flag for gradual rollout
  • Monitor metrics in staging
  • Gradual production rollout

Phase 4: Optimization (Ongoing)

  • Monitor cleanup performance
  • Adjust defaults based on real-world usage
  • Add alerts for cleanup failures
  • Document troubleshooting guide

References

Appendix

Configuration Examples

Development Environment (short retention for testing):

jwt:
  ttl: 1h
  secret_retention:
    retention_factor: 1.5
    max_retention: 3h
    cleanup_interval: 30m

Production Environment (secure defaults):

jwt:
  ttl: 24h
  secret_retention:
    retention_factor: 2.0
    max_retention: 72h
    cleanup_interval: 1h

High-Security Environment (aggressive rotation):

jwt:
  ttl: 8h
  secret_retention:
    retention_factor: 1.5
    max_retention: 24h
    cleanup_interval: 30m

Troubleshooting

Issue: Secrets being removed too quickly

  • Check: Retention factor and JWT TTL settings
  • Fix: Increase retention_factor or JWT TTL

Issue: Too many old secrets accumulating

  • Check: Cleanup job logs and interval
  • Fix: Decrease cleanup_interval or retention_factor

Issue: Performance degradation during cleanup

  • Check: Number of secrets and cleanup frequency
  • Fix: Optimize cleanup algorithm or increase interval

FAQ

Q: What happens to tokens signed with expired secrets? A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate.

Q: Can I disable automatic cleanup? A: Yes, set cleanup_interval to a very high value (e.g., 8760h for 1 year).

Q: How does this affect existing deployments? A: Existing deployments will use sensible defaults. The feature is backward compatible.

Q: What's the recommended retention factor? A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs.

Q: How often should cleanup run? A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup.

Decision Record

Approved By: Approved Date: Implemented By: Implementation Date:


Generated by Mistral Vibe Co-Authored-By: Mistral Vibe vibe@mistral.ai