diff --git a/adr/0022-rate-limiting-cache-strategy.md b/adr/0022-rate-limiting-cache-strategy.md new file mode 100644 index 0000000..c37d9ab --- /dev/null +++ b/adr/0022-rate-limiting-cache-strategy.md @@ -0,0 +1,536 @@ +# ADR 0022: Rate Limiting and Cache Strategy + +## Status +**Proposed** 🟡 + +## Context + +As the dance-lessons-coach application grows and potentially serves multiple users simultaneously, we need to implement rate limiting to: + +1. **Prevent abuse** of API endpoints +2. **Protect against DDoS attacks** +3. **Ensure fair usage** across all users +4. **Maintain system stability** under load +5. **Provide consistent performance** + +Additionally, we need a caching strategy to: +1. **Reduce database load** for frequently accessed data +2. **Improve response times** for common requests +3. **Support horizontal scaling** with shared cache +4. **Handle cache invalidation** properly + +## Decision + +We will implement a **multi-phase caching and rate limiting strategy** with the following components: + +### Phase 1: In-Memory Cache with TTL Support + +**Library Selection**: We will use **`github.com/patrickmn/go-cache`** for in-memory caching because: + +✅ **Pros:** +- Simple, lightweight, and well-maintained +- Built-in TTL (Time-To-Live) support +- Thread-safe by default +- No external dependencies +- Good performance for single-instance applications +- Supports automatic expiration + +❌ **Cons:** +- Not shared between multiple instances +- Memory-bound (not persistent) +- Limited advanced features + +**Implementation Plan:** +```go +type CacheService interface { + Set(key string, value interface{}, expiration time.Duration) error + Get(key string) (interface{}, bool) + Delete(key string) error + Flush() error + GetWithTTL(key string) (interface{}, time.Duration, bool) +} + +type InMemoryCacheService struct { + cache *cache.Cache + defaultTTL time.Duration + cleanupInterval time.Duration +} +``` + +**Use Cases:** +- JWT token validation results +- User session data +- Frequently accessed greet messages +- API response caching for idempotent endpoints + +### Phase 2: Redis-Compatible Shared Cache + +**Library Selection**: We will use **`github.com/redis/go-redis/v9`** with a **Redis-compatible open-source alternative**: + +**Primary Choice**: **Dragonfly** (https://www.dragonflydb.io/) +- Redis-compatible +- Open-source (Apache 2.0 license) +- Written in C++ with multi-threaded architecture +- 25x higher throughput than Redis +- Lower latency +- Drop-in Redis replacement + +**Fallback Choice**: **KeyDB** (https://keydb.dev/) +- Multi-threaded Redis fork +- Open-source (GPL license) +- Better performance than Redis +- Full Redis API compatibility + +**Implementation Plan:** +```go +type RedisCacheService struct { + client *redis.Client + defaultTTL time.Duration + prefix string +} + +func NewRedisCacheService(config *config.CacheConfig) (*RedisCacheService, error) { + client := redis.NewClient(&redis.Options{ + Addr: config.Host + ":" + strconv.Itoa(config.Port), + Password: config.Password, + DB: config.Database, + PoolSize: config.PoolSize, + }) + + // Test connection + _, err := client.Ping(context.Background()).Result() + if err != nil { + return nil, fmt.Errorf("failed to connect to Redis: %w", err) + } + + return &RedisCacheService{ + client: client, + defaultTTL: config.DefaultTTL, + prefix: config.Prefix, + }, nil +} +``` + +**Configuration:** +```yaml +cache: + # In-memory cache configuration + in_memory: + enabled: true + default_ttl: 5m + cleanup_interval: 10m + max_items: 10000 + + # Redis-compatible cache configuration + redis: + enabled: false + host: "localhost" + port: 6379 + password: "" + database: 0 + pool_size: 10 + default_ttl: 5m + prefix: "dlc:" + use_dragonfly: true # Set to false to use KeyDB +``` + +### Phase 3: Rate Limiting Implementation + +**Library Selection**: We will use **`github.com/ulule/limiter/v3`** because: + +✅ **Pros:** +- Multiple storage backends (in-memory, Redis, etc.) +- Sliding window algorithm +- Distributed rate limiting support +- Configurable rate limits +- Middleware support for Chi router +- Good performance + +**Implementation Plan:** +```go +// Rate limit configuration +type RateLimitConfig struct { + Enabled bool `mapstructure:"enabled"` + RequestsPerHour int `mapstructure:"requests_per_hour"` + BurstLimit int `mapstructure:"burst_limit"` + IPWhitelist []string `mapstructure:"ip_whitelist"` + EndpointSpecific map[string]struct { + RequestsPerHour int `mapstructure:"requests_per_hour"` + BurstLimit int `mapstructure:"burst_limit"` + } `mapstructure:"endpoint_specific"` +} + +// Rate limiter service +type RateLimiterService struct { + limiter *limiter.Limiter + store limiter.Store + config *RateLimitConfig +} + +func NewRateLimiterService(config *RateLimitConfig) (*RateLimiterService, error) { + var store limiter.Store + + // Use Redis if available, otherwise use in-memory + if config.UseRedis { + // Initialize Redis store + store, err = limiter.NewStoreRedisWithOptions(&limiter.StoreOptions{ + Prefix: config.RedisPrefix, + // ... other Redis options + }) + } else { + // Use in-memory store + store = limiter.NewStoreMemory() + } + + if err != nil { + return nil, fmt.Errorf("failed to create rate limiter store: %w", err) + } + + // Create rate limiter + rate := limiter.Rate{ + Period: time.Hour, + Limit: int64(config.RequestsPerHour), + } + + return &RateLimiterService{ + limiter: limiter.New(store, rate), + store: store, + config: config, + }, nil +} +``` + +**Chi Middleware:** +```go +func RateLimitMiddleware(limiter *RateLimiterService) func(http.Handler) http.Handler { + return func(next http.Handler) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + // Skip rate limiting for whitelisted IPs + clientIP := r.Header.Get("X-Real-IP") + if clientIP == "" { + clientIP = r.RemoteAddr + } + + for _, allowedIP := range limiter.config.IPWhitelist { + if clientIP == allowedIP { + next.ServeHTTP(w, r) + return + } + } + + // Get rate limit context + context, err := limiter.limiter.Get(r.Context(), clientIP) + if err != nil { + log.Error().Err(err).Str("ip", clientIP).Msg("Rate limit error") + http.Error(w, "Internal server error", http.StatusInternalServerError) + return + } + + // Check if rate limit is exceeded + if context.Reached > 0 { + w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour)) + w.Header().Set("X-RateLimit-Remaining", "0") + w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset))) + + http.Error(w, "Too many requests", http.StatusTooManyRequests) + return + } + + // Set rate limit headers + w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour)) + w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(limiter.config.RequestsPerHour-int(context.Reached))) + w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset))) + + next.ServeHTTP(w, r) + }) + } +} +``` + +### Phase 4: Cache Invalidation Strategy + +**Approach**: Hybrid cache invalidation with multiple strategies: + +1. **Time-Based Expiration (TTL)** + - All cache entries have a TTL + - Automatic expiration prevents stale data + - Default TTL: 5 minutes for most data + +2. **Event-Based Invalidation** + - Cache keys are invalidated on specific events + - Example: User data cache invalidated on user update + - Uses pub/sub pattern for distributed invalidation + +3. **Versioned Cache Keys** + - Cache keys include data version + - When data changes, version increments + - Old cache entries naturally expire + +4. **Write-Through Caching** + - Data written to database and cache simultaneously + - Ensures cache is always up-to-date + - Used for critical data that must be consistent + +**Cache Key Strategy:** +```go +func GetCacheKey(prefix, entityType, entityID string) string { + return fmt.Sprintf("%s:%s:%s", prefix, entityType, entityID) +} + +// Example: "dlc:user:123" +// Example: "dlc:jwt:validation:token_hash" +``` + +## Implementation Phases + +### Phase 1: In-Memory Cache (Current Sprint) +- ✅ Research and select in-memory cache library +- ✅ Implement cache interface and in-memory service +- ✅ Add cache configuration to config package +- ✅ Implement basic cache operations (set, get, delete) +- ✅ Add TTL support and automatic cleanup +- ✅ Cache JWT validation results +- ✅ Add cache metrics and monitoring + +### Phase 2: Redis-Compatible Cache (Next Sprint) +- ✅ Set up Dragonfly/KeyDB in development environment +- ✅ Implement Redis cache service +- ✅ Add configuration for Redis connection +- ✅ Implement cache fallback strategy (Redis → in-memory) +- ✅ Add health checks for Redis connection +- ✅ Implement distributed cache invalidation + +### Phase 3: Rate Limiting (Following Sprint) +- ✅ Research and select rate limiting library +- ✅ Implement rate limiter service +- ✅ Add rate limit configuration +- ✅ Implement Chi middleware for rate limiting +- ✅ Add rate limit headers to responses +- ✅ Implement IP whitelisting +- ✅ Add endpoint-specific rate limits + +### Phase 4: Advanced Features (Future) +- ✅ Cache warming for critical data +- ✅ Two-level caching (Redis + in-memory) +- ✅ Cache compression for large objects +- ✅ Rate limit exemptions for admin users +- ✅ Dynamic rate limit adjustment +- ✅ Cache analytics and usage patterns + +## Configuration + +```yaml +# Cache configuration +cache: + in_memory: + enabled: true + default_ttl: "5m" + cleanup_interval: "10m" + max_items: 10000 + + redis: + enabled: false + host: "localhost" + port: 6379 + password: "" + database: 0 + pool_size: 10 + default_ttl: "5m" + prefix: "dlc:" + use_dragonfly: true + +# Rate limiting configuration +rate_limiting: + enabled: true + requests_per_hour: 1000 + burst_limit: 100 + ip_whitelist: + - "127.0.0.1" + - "::1" + endpoint_specific: + "/api/v1/auth/login": + requests_per_hour: 100 + burst_limit: 10 + "/api/v1/auth/register": + requests_per_hour: 50 + burst_limit: 5 +``` + +## Monitoring and Metrics + +**Cache Metrics:** +- Cache hit/miss ratio +- Average cache latency +- Cache size and memory usage +- Eviction rate +- TTL distribution + +**Rate Limit Metrics:** +- Requests allowed vs rejected +- Rate limit exceeded events +- Top limited IPs +- Endpoint-specific rate limit usage + +**Prometheus Metrics:** +```go +var ( + cacheHits = prometheus.NewCounterVec(prometheus.CounterOpts{ + Name: "cache_hits_total", + Help: "Number of cache hits", + }, []string{"cache_type", "entity_type"}) + + cacheMisses = prometheus.NewCounterVec(prometheus.CounterOpts{ + Name: "cache_misses_total", + Help: "Number of cache misses", + }, []string{"cache_type", "entity_type"}) + + rateLimitExceeded = prometheus.NewCounterVec(prometheus.CounterOpts{ + Name: "rate_limit_exceeded_total", + Help: "Number of rate limit exceeded events", + }, []string{"endpoint", "ip"}) +) +``` + +## Security Considerations + +1. **Cache Security:** + - Never cache sensitive user data (passwords, tokens) + - Use separate cache prefixes for different data types + - Implement cache key hashing for sensitive data + - Set appropriate TTLs to limit exposure + +2. **Rate Limit Security:** + - Prevent rate limit bypass attacks + - Use X-Real-IP header for proper IP detection + - Implement rate limit for authentication endpoints + - Log rate limit violations for security monitoring + +3. **Redis Security:** + - Use authentication if enabled + - Implement TLS for Redis connections + - Use separate database numbers for different environments + - Limit Redis commands to prevent abuse + +## Performance Considerations + +1. **Cache Performance:** + - Benchmark cache operations + - Monitor cache latency + - Optimize cache key size + - Use appropriate data structures + +2. **Rate Limit Performance:** + - Use efficient rate limiting algorithm + - Minimize middleware overhead + - Cache rate limit decisions + - Batch rate limit checks where possible + +3. **Memory Management:** + - Set reasonable cache size limits + - Monitor memory usage + - Implement cache eviction policies + - Use memory-efficient data structures + +## Migration Strategy + +### From No Cache to In-Memory Cache +1. Implement cache interface and in-memory service +2. Add cache configuration with sensible defaults +3. Gradually add caching to critical endpoints +4. Monitor cache performance and hit ratios +5. Adjust TTLs based on usage patterns + +### From In-Memory to Redis Cache +1. Set up Dragonfly/KeyDB in development +2. Implement Redis cache service +3. Add fallback logic (Redis → in-memory) +4. Test with both caches enabled +5. Gradually migrate to Redis-only +6. Monitor distributed cache performance + +### From No Rate Limiting to Rate Limiting +1. Implement rate limiter with generous limits +2. Add monitoring for rate limit events +3. Gradually tighten limits based on usage +4. Add IP whitelist for critical services +5. Implement endpoint-specific limits +6. Monitor and adjust as needed + +## Alternatives Considered + +### Cache Libraries +1. **`github.com/bluele/gcache`** - More features but more complex +2. **`github.com/allegro/bigcache`** - High performance but no TTL +3. **`github.com/coocood/freecache`** - Very fast but limited API + +### Redis Alternatives +1. **Redis Enterprise** - Commercial, not open-source +2. **Memcached** - No persistence, simpler protocol +3. **Couchbase** - More complex, document-oriented + +### Rate Limiting Libraries +1. **`golang.org/x/time/rate`** - Simple but no distributed support +2. **`github.com/juju/ratelimit`** - Good but limited features +3. **Custom implementation** - Too much development effort + +## Success Metrics + +1. **Cache Effectiveness:** + - Cache hit ratio > 80% + - Average cache latency < 1ms + - Memory usage within limits + +2. **Rate Limiting Effectiveness:** + - < 1% of legitimate requests blocked + - Effective protection against abuse + - No impact on normal usage patterns + +3. **System Stability:** + - Reduced database load by 50% + - Consistent response times under load + - No cache-related outages + +## Risks and Mitigations + +| Risk | Mitigation | +|------|------------| +| Cache stampede | Implement cache warming and fallback logic | +| Memory exhaustion | Set reasonable cache size limits and monitor usage | +| Redis failure | Implement fallback to in-memory cache | +| Rate limit false positives | Start with generous limits and monitor | +| Performance degradation | Benchmark before and after implementation | +| Cache inconsistency | Use appropriate invalidation strategies | + +## Future Enhancements + +1. **Cache Pre-warming** - Load frequently used data at startup +2. **Two-Level Caching** - Local cache + distributed cache +3. **Cache Compression** - For large cache objects +4. **Dynamic Rate Limits** - Adjust based on system load +5. **User-Specific Rate Limits** - Different limits for different user tiers +6. **Cache Analytics** - Detailed usage patterns and optimization + +## References + +- [go-cache documentation](https://github.com/patrickmn/go-cache) +- [Dragonfly documentation](https://www.dragonflydb.io/docs) +- [KeyDB documentation](https://keydb.dev/) +- [limiter/v3 documentation](https://github.com/ulule/limiter) +- [Chi middleware documentation](https://github.com/go-chi/chi) + +## Decision Drivers + +1. **Simplicity** - Easy to implement and maintain +2. **Performance** - Minimal impact on response times +3. **Scalability** - Support for horizontal scaling +4. **Reliability** - Graceful degradation on failures +5. **Open Source** - Preference for open-source solutions +6. **Community** - Active development and support + +## Conclusion + +This ADR proposes a comprehensive caching and rate limiting strategy that will significantly improve the performance, scalability, and reliability of the dance-lessons-coach application. The phased approach allows for gradual implementation and testing, minimizing risk while delivering value at each stage. + +The combination of in-memory caching for single-instance deployments and Redis-compatible caching for distributed environments provides flexibility for different deployment scenarios. The rate limiting implementation will protect the application from abuse while maintaining a good user experience. + +This strategy aligns with our architectural principles of simplicity, performance, and scalability while using well-established open-source technologies with strong community support. \ No newline at end of file diff --git a/adr/README.md b/adr/README.md index 1282e0e..9f0b55f 100644 --- a/adr/README.md +++ b/adr/README.md @@ -79,6 +79,8 @@ Chosen option: "[Option 1]" because [justification] * [0018-user-management-auth-system.md](0018-user-management-auth-system.md) - User management and authentication system * [0019-postgresql-integration.md](0019-postgresql-integration.md) - PostgreSQL database integration * [0020-docker-build-strategy.md](0020-docker-build-strategy.md) - Docker Build Strategy: Traditional vs Buildx +* [0021-jwt-secret-retention-policy.md](0021-jwt-secret-retention-policy.md) - JWT Secret Retention Policy with Configurable TTL and Retention +* [0022-rate-limiting-cache-strategy.md](0022-rate-limiting-cache-strategy.md) - Rate Limiting and Cache Strategy with Multi-Phase Implementation ## How to Add a New ADR