Files

Gabriel Radureau 8c0e3830f8 📝 docs(adr): audit and update Status for 5 implemented ADRs

Audits 7 ADRs marked "Proposed" against the actual code, updates the Status
field of 5 that are at least partially implemented. Keeps 2 as "Proposed"
because only test infrastructure exists (no production implementation).

Updated:
- 0018 user-management-auth-system : Partially Implemented (auth/jwt/repos exist; auth middleware + greet integration missing)
- 0019 postgresql-integration : Partially Implemented (postgres repo exists, BDD uses it; sqlite still present, not default)
- 0022 rate-limiting-cache-strategy : Implemented (Phase 1) - Phase 2 still Proposed (PRs #22 ratelimit, #23 cache; Redis/Dragonfly deferred)
- 0024 bdd-test-organization-and-isolation : Partially Implemented (domain dirs + scenario state isolation; parallel exec opt-in only)
- 0025 bdd-scenario-isolation-strategies : Partially Implemented (schema-per-scenario opt-in via BDD_SCHEMA_ISOLATION; cache/user store isolation missing)

Kept "Proposed" (production code not implemented, only test fixtures):
- 0021 jwt-secret-retention-policy (BDD scenarios exist but no ConfigManager / cleanup goroutine in pkg/)
- 0023 config-hot-reloading (testserver has reload, but no Viper WatchConfig in production)

Audit method: Q-024 compliant - every status decision has file:line evidence
documented in workspaces/adr-audit-status/stages/01-audit/output/audit-report.md.

Generated ~95% in autonomy by Mistral Vibe via ICM workspace
~/Work/Vibe/workspaces/adr-audit-status/. Cost €2.50 stage 01-audit (very thorough).
Trainer (Claude) finalized commit/PR (Mistral hit max-price).

🤖 Co-Authored-By: Mistral Vibe (devstral-2 / mistral-medium-3.5)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 13:31:45 +02:00

17 KiB

Raw Blame History

ADR 0022: Rate Limiting and Cache Strategy

Status: Implemented (Phase 1) - Phase 2 still Proposed

Context

As the dance-lessons-coach application grows and potentially serves multiple users simultaneously, we need to implement rate limiting to:

Prevent abuse of API endpoints
Protect against DDoS attacks
Ensure fair usage across all users
Maintain system stability under load
Provide consistent performance

Additionally, we need a caching strategy to:

Reduce database load for frequently accessed data
Improve response times for common requests
Support horizontal scaling with shared cache
Handle cache invalidation properly

Decision

We will implement a multi-phase caching and rate limiting strategy with the following components:

Phase 1: In-Memory Cache with TTL Support

Library Selection: We will use github.com/patrickmn/go-cache for in-memory caching because:

✅ Pros:

Simple, lightweight, and well-maintained
Built-in TTL (Time-To-Live) support
Thread-safe by default
No external dependencies
Good performance for single-instance applications
Supports automatic expiration

❌ Cons:

Not shared between multiple instances
Memory-bound (not persistent)
Limited advanced features

Implementation Plan:

type CacheService interface {
    Set(key string, value interface{}, expiration time.Duration) error
    Get(key string) (interface{}, bool)
    Delete(key string) error
    Flush() error
    GetWithTTL(key string) (interface{}, time.Duration, bool)
}

type InMemoryCacheService struct {
    cache *cache.Cache
    defaultTTL time.Duration
    cleanupInterval time.Duration
}

Use Cases:

JWT token validation results
User session data
Frequently accessed greet messages
API response caching for idempotent endpoints

Phase 2: Redis-Compatible Shared Cache

Library Selection: We will use github.com/redis/go-redis/v9 with a Redis-compatible open-source alternative:

Primary Choice: Dragonfly (https://www.dragonflydb.io/)

Redis-compatible
Open-source (Apache 2.0 license)
Written in C++ with multi-threaded architecture
25x higher throughput than Redis
Lower latency
Drop-in Redis replacement

Fallback Choice: KeyDB (https://keydb.dev/)

Multi-threaded Redis fork
Open-source (GPL license)
Better performance than Redis
Full Redis API compatibility

Implementation Plan:

type RedisCacheService struct {
    client *redis.Client
    defaultTTL time.Duration
    prefix string
}

func NewRedisCacheService(config *config.CacheConfig) (*RedisCacheService, error) {
    client := redis.NewClient(&redis.Options{
        Addr:     config.Host + ":" + strconv.Itoa(config.Port),
        Password: config.Password,
        DB:       config.Database,
        PoolSize: config.PoolSize,
    })
    
    // Test connection
    _, err := client.Ping(context.Background()).Result()
    if err != nil {
        return nil, fmt.Errorf("failed to connect to Redis: %w", err)
    }
    
    return &RedisCacheService{
        client: client,
        defaultTTL: config.DefaultTTL,
        prefix: config.Prefix,
    }, nil
}

Configuration:

cache:
  # In-memory cache configuration
  in_memory:
    enabled: true
    default_ttl: 5m
    cleanup_interval: 10m
    max_items: 10000
  
  # Redis-compatible cache configuration
  redis:
    enabled: false
    host: "localhost"
    port: 6379
    password: ""
    database: 0
    pool_size: 10
    default_ttl: 5m
    prefix: "dlc:"
    use_dragonfly: true  # Set to false to use KeyDB

Phase 3: Rate Limiting Implementation

Library Selection: We will use github.com/ulule/limiter/v3 because:

✅ Pros:

Multiple storage backends (in-memory, Redis, etc.)
Sliding window algorithm
Distributed rate limiting support
Configurable rate limits
Middleware support for Chi router
Good performance

Implementation Plan:

// Rate limit configuration
type RateLimitConfig struct {
    Enabled          bool          `mapstructure:"enabled"`
    RequestsPerHour  int           `mapstructure:"requests_per_hour"`
    BurstLimit        int           `mapstructure:"burst_limit"`
    IPWhitelist       []string      `mapstructure:"ip_whitelist"`
    EndpointSpecific  map[string]struct {
        RequestsPerHour int `mapstructure:"requests_per_hour"`
        BurstLimit      int `mapstructure:"burst_limit"`
    } `mapstructure:"endpoint_specific"`
}

// Rate limiter service
type RateLimiterService struct {
    limiter *limiter.Limiter
    store   limiter.Store
    config  *RateLimitConfig
}

func NewRateLimiterService(config *RateLimitConfig) (*RateLimiterService, error) {
    var store limiter.Store
    
    // Use Redis if available, otherwise use in-memory
    if config.UseRedis {
        // Initialize Redis store
        store, err = limiter.NewStoreRedisWithOptions(&limiter.StoreOptions{
            Prefix: config.RedisPrefix,
            // ... other Redis options
        })
    } else {
        // Use in-memory store
        store = limiter.NewStoreMemory()
    }
    
    if err != nil {
        return nil, fmt.Errorf("failed to create rate limiter store: %w", err)
    }
    
    // Create rate limiter
    rate := limiter.Rate{
        Period: time.Hour,
        Limit:  int64(config.RequestsPerHour),
    }
    
    return &RateLimiterService{
        limiter: limiter.New(store, rate),
        store:   store,
        config:  config,
    }, nil
}

Chi Middleware:

func RateLimitMiddleware(limiter *RateLimiterService) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // Skip rate limiting for whitelisted IPs
            clientIP := r.Header.Get("X-Real-IP")
            if clientIP == "" {
                clientIP = r.RemoteAddr
            }
            
            for _, allowedIP := range limiter.config.IPWhitelist {
                if clientIP == allowedIP {
                    next.ServeHTTP(w, r)
                    return
                }
            }
            
            // Get rate limit context
            context, err := limiter.limiter.Get(r.Context(), clientIP)
            if err != nil {
                log.Error().Err(err).Str("ip", clientIP).Msg("Rate limit error")
                http.Error(w, "Internal server error", http.StatusInternalServerError)
                return
            }
            
            // Check if rate limit is exceeded
            if context.Reached > 0 {
                w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
                w.Header().Set("X-RateLimit-Remaining", "0")
                w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
                
                http.Error(w, "Too many requests", http.StatusTooManyRequests)
                return
            }
            
            // Set rate limit headers
            w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
            w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(limiter.config.RequestsPerHour-int(context.Reached)))
            w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
            
            next.ServeHTTP(w, r)
        })
    }
}

Phase 4: Cache Invalidation Strategy

Approach: Hybrid cache invalidation with multiple strategies:

Time-Based Expiration (TTL)
- All cache entries have a TTL
- Automatic expiration prevents stale data
- Default TTL: 5 minutes for most data
Event-Based Invalidation
- Cache keys are invalidated on specific events
- Example: User data cache invalidated on user update
- Uses pub/sub pattern for distributed invalidation
Versioned Cache Keys
- Cache keys include data version
- When data changes, version increments
- Old cache entries naturally expire
Write-Through Caching
- Data written to database and cache simultaneously
- Ensures cache is always up-to-date
- Used for critical data that must be consistent

Cache Key Strategy:

func GetCacheKey(prefix, entityType, entityID string) string {
    return fmt.Sprintf("%s:%s:%s", prefix, entityType, entityID)
}

// Example: "dlc:user:123"
// Example: "dlc:jwt:validation:token_hash"

Implementation Phases

Phase 1: In-Memory Cache (Current Sprint)

✅ Research and select in-memory cache library
✅ Implement cache interface and in-memory service
✅ Add cache configuration to config package
✅ Implement basic cache operations (set, get, delete)
✅ Add TTL support and automatic cleanup
✅ Cache JWT validation results
✅ Add cache metrics and monitoring

Phase 2: Redis-Compatible Cache (Next Sprint)

✅ Set up Dragonfly/KeyDB in development environment
✅ Implement Redis cache service
✅ Add configuration for Redis connection
✅ Implement cache fallback strategy (Redis → in-memory)
✅ Add health checks for Redis connection
✅ Implement distributed cache invalidation

Phase 3: Rate Limiting (Following Sprint)

✅ Research and select rate limiting library
✅ Implement rate limiter service
✅ Add rate limit configuration
✅ Implement Chi middleware for rate limiting
✅ Add rate limit headers to responses
✅ Implement IP whitelisting
✅ Add endpoint-specific rate limits

Phase 4: Advanced Features (Future)

✅ Cache warming for critical data
✅ Two-level caching (Redis + in-memory)
✅ Cache compression for large objects
✅ Rate limit exemptions for admin users
✅ Dynamic rate limit adjustment
✅ Cache analytics and usage patterns

Configuration

# Cache configuration
cache:
  in_memory:
    enabled: true
    default_ttl: "5m"
    cleanup_interval: "10m"
    max_items: 10000
  
  redis:
    enabled: false
    host: "localhost"
    port: 6379
    password: ""
    database: 0
    pool_size: 10
    default_ttl: "5m"
    prefix: "dlc:"
    use_dragonfly: true

# Rate limiting configuration
rate_limiting:
  enabled: true
  requests_per_hour: 1000
  burst_limit: 100
  ip_whitelist:
    - "127.0.0.1"
    - "::1"
  endpoint_specific:
    "/api/v1/auth/login":
      requests_per_hour: 100
      burst_limit: 10
    "/api/v1/auth/register":
      requests_per_hour: 50
      burst_limit: 5

Monitoring and Metrics

Cache Metrics:

Cache hit/miss ratio
Average cache latency
Cache size and memory usage
Eviction rate
TTL distribution

Rate Limit Metrics:

Requests allowed vs rejected
Rate limit exceeded events
Top limited IPs
Endpoint-specific rate limit usage

Prometheus Metrics:

var (
    cacheHits = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "cache_hits_total",
        Help: "Number of cache hits",
    }, []string{"cache_type", "entity_type"})
    
    cacheMisses = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "cache_misses_total",
        Help: "Number of cache misses",
    }, []string{"cache_type", "entity_type"})
    
    rateLimitExceeded = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "rate_limit_exceeded_total",
        Help: "Number of rate limit exceeded events",
    }, []string{"endpoint", "ip"})
)

Security Considerations

Cache Security:
- Never cache sensitive user data (passwords, tokens)
- Use separate cache prefixes for different data types
- Implement cache key hashing for sensitive data
- Set appropriate TTLs to limit exposure
Rate Limit Security:
- Prevent rate limit bypass attacks
- Use X-Real-IP header for proper IP detection
- Implement rate limit for authentication endpoints
- Log rate limit violations for security monitoring
Redis Security:
- Use authentication if enabled
- Implement TLS for Redis connections
- Use separate database numbers for different environments
- Limit Redis commands to prevent abuse

Performance Considerations

Cache Performance:
- Benchmark cache operations
- Monitor cache latency
- Optimize cache key size
- Use appropriate data structures
Rate Limit Performance:
- Use efficient rate limiting algorithm
- Minimize middleware overhead
- Cache rate limit decisions
- Batch rate limit checks where possible
Memory Management:
- Set reasonable cache size limits
- Monitor memory usage
- Implement cache eviction policies
- Use memory-efficient data structures

Migration Strategy

From No Cache to In-Memory Cache

Implement cache interface and in-memory service
Add cache configuration with sensible defaults
Gradually add caching to critical endpoints
Monitor cache performance and hit ratios
Adjust TTLs based on usage patterns

From In-Memory to Redis Cache

Set up Dragonfly/KeyDB in development
Implement Redis cache service
Add fallback logic (Redis → in-memory)
Test with both caches enabled
Gradually migrate to Redis-only
Monitor distributed cache performance

From No Rate Limiting to Rate Limiting

Implement rate limiter with generous limits
Add monitoring for rate limit events
Gradually tighten limits based on usage
Add IP whitelist for critical services
Implement endpoint-specific limits
Monitor and adjust as needed

Alternatives Considered

Cache Libraries

github.com/bluele/gcache - More features but more complex
github.com/allegro/bigcache - High performance but no TTL
github.com/coocood/freecache - Very fast but limited API

Redis Alternatives

Redis Enterprise - Commercial, not open-source
Memcached - No persistence, simpler protocol
Couchbase - More complex, document-oriented

Rate Limiting Libraries

golang.org/x/time/rate - Simple but no distributed support
github.com/juju/ratelimit - Good but limited features
Custom implementation - Too much development effort

Success Metrics

Cache Effectiveness:
- Cache hit ratio > 80%
- Average cache latency < 1ms
- Memory usage within limits
Rate Limiting Effectiveness:
- < 1% of legitimate requests blocked
- Effective protection against abuse
- No impact on normal usage patterns
System Stability:
- Reduced database load by 50%
- Consistent response times under load
- No cache-related outages

Risks and Mitigations

Risk	Mitigation
Cache stampede	Implement cache warming and fallback logic
Memory exhaustion	Set reasonable cache size limits and monitor usage
Redis failure	Implement fallback to in-memory cache
Rate limit false positives	Start with generous limits and monitor
Performance degradation	Benchmark before and after implementation
Cache inconsistency	Use appropriate invalidation strategies

Future Enhancements

Cache Pre-warming - Load frequently used data at startup
Two-Level Caching - Local cache + distributed cache
Cache Compression - For large cache objects
Dynamic Rate Limits - Adjust based on system load
User-Specific Rate Limits - Different limits for different user tiers
Cache Analytics - Detailed usage patterns and optimization

References

Decision Drivers

Simplicity - Easy to implement and maintain
Performance - Minimal impact on response times
Scalability - Support for horizontal scaling
Reliability - Graceful degradation on failures
Open Source - Preference for open-source solutions
Community - Active development and support

Conclusion

This ADR proposes a comprehensive caching and rate limiting strategy that will significantly improve the performance, scalability, and reliability of the dance-lessons-coach application. The phased approach allows for gradual implementation and testing, minimizing risk while delivering value at each stage.

The combination of in-memory caching for single-instance deployments and Redis-compatible caching for distributed environments provides flexibility for different deployment scenarios. The rate limiting implementation will protect the application from abuse while maintaining a good user experience.

This strategy aligns with our architectural principles of simplicity, performance, and scalability while using well-established open-source technologies with strong community support.

17 KiB Raw Blame History

ADR 0022: Rate Limiting and Cache Strategy

Context

Decision

Phase 1: In-Memory Cache with TTL Support

Phase 2: Redis-Compatible Shared Cache

Phase 3: Rate Limiting Implementation

Phase 4: Cache Invalidation Strategy

Implementation Phases

Phase 1: In-Memory Cache (Current Sprint)

Phase 2: Redis-Compatible Cache (Next Sprint)

Phase 3: Rate Limiting (Following Sprint)

Phase 4: Advanced Features (Future)

Configuration

Monitoring and Metrics

Security Considerations

Performance Considerations

Migration Strategy

From No Cache to In-Memory Cache

From In-Memory to Redis Cache

From No Rate Limiting to Rate Limiting

Alternatives Considered

Cache Libraries

Redis Alternatives

Rate Limiting Libraries

Success Metrics

Risks and Mitigations

Future Enhancements

References

Decision Drivers

Conclusion

17 KiB

Raw Blame History