Audits 7 ADRs marked "Proposed" against the actual code, updates the Status field of 5 that are at least partially implemented. Keeps 2 as "Proposed" because only test infrastructure exists (no production implementation). Updated: - 0018 user-management-auth-system : Partially Implemented (auth/jwt/repos exist; auth middleware + greet integration missing) - 0019 postgresql-integration : Partially Implemented (postgres repo exists, BDD uses it; sqlite still present, not default) - 0022 rate-limiting-cache-strategy : Implemented (Phase 1) - Phase 2 still Proposed (PRs #22 ratelimit, #23 cache; Redis/Dragonfly deferred) - 0024 bdd-test-organization-and-isolation : Partially Implemented (domain dirs + scenario state isolation; parallel exec opt-in only) - 0025 bdd-scenario-isolation-strategies : Partially Implemented (schema-per-scenario opt-in via BDD_SCHEMA_ISOLATION; cache/user store isolation missing) Kept "Proposed" (production code not implemented, only test fixtures): - 0021 jwt-secret-retention-policy (BDD scenarios exist but no ConfigManager / cleanup goroutine in pkg/) - 0023 config-hot-reloading (testserver has reload, but no Viper WatchConfig in production) Audit method: Q-024 compliant - every status decision has file:line evidence documented in workspaces/adr-audit-status/stages/01-audit/output/audit-report.md. Generated ~95% in autonomy by Mistral Vibe via ICM workspace ~/Work/Vibe/workspaces/adr-audit-status/. Cost €2.50 stage 01-audit (very thorough). Trainer (Claude) finalized commit/PR (Mistral hit max-price). 🤖 Co-Authored-By: Mistral Vibe (devstral-2 / mistral-medium-3.5) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
ADR 0022: Rate Limiting and Cache Strategy
Status: Implemented (Phase 1) - Phase 2 still Proposed
Context
As the dance-lessons-coach application grows and potentially serves multiple users simultaneously, we need to implement rate limiting to:
- Prevent abuse of API endpoints
- Protect against DDoS attacks
- Ensure fair usage across all users
- Maintain system stability under load
- Provide consistent performance
Additionally, we need a caching strategy to:
- Reduce database load for frequently accessed data
- Improve response times for common requests
- Support horizontal scaling with shared cache
- Handle cache invalidation properly
Decision
We will implement a multi-phase caching and rate limiting strategy with the following components:
Phase 1: In-Memory Cache with TTL Support
Library Selection: We will use github.com/patrickmn/go-cache for in-memory caching because:
✅ Pros:
- Simple, lightweight, and well-maintained
- Built-in TTL (Time-To-Live) support
- Thread-safe by default
- No external dependencies
- Good performance for single-instance applications
- Supports automatic expiration
❌ Cons:
- Not shared between multiple instances
- Memory-bound (not persistent)
- Limited advanced features
Implementation Plan:
type CacheService interface {
Set(key string, value interface{}, expiration time.Duration) error
Get(key string) (interface{}, bool)
Delete(key string) error
Flush() error
GetWithTTL(key string) (interface{}, time.Duration, bool)
}
type InMemoryCacheService struct {
cache *cache.Cache
defaultTTL time.Duration
cleanupInterval time.Duration
}
Use Cases:
- JWT token validation results
- User session data
- Frequently accessed greet messages
- API response caching for idempotent endpoints
Phase 2: Redis-Compatible Shared Cache
Library Selection: We will use github.com/redis/go-redis/v9 with a Redis-compatible open-source alternative:
Primary Choice: Dragonfly (https://www.dragonflydb.io/)
- Redis-compatible
- Open-source (Apache 2.0 license)
- Written in C++ with multi-threaded architecture
- 25x higher throughput than Redis
- Lower latency
- Drop-in Redis replacement
Fallback Choice: KeyDB (https://keydb.dev/)
- Multi-threaded Redis fork
- Open-source (GPL license)
- Better performance than Redis
- Full Redis API compatibility
Implementation Plan:
type RedisCacheService struct {
client *redis.Client
defaultTTL time.Duration
prefix string
}
func NewRedisCacheService(config *config.CacheConfig) (*RedisCacheService, error) {
client := redis.NewClient(&redis.Options{
Addr: config.Host + ":" + strconv.Itoa(config.Port),
Password: config.Password,
DB: config.Database,
PoolSize: config.PoolSize,
})
// Test connection
_, err := client.Ping(context.Background()).Result()
if err != nil {
return nil, fmt.Errorf("failed to connect to Redis: %w", err)
}
return &RedisCacheService{
client: client,
defaultTTL: config.DefaultTTL,
prefix: config.Prefix,
}, nil
}
Configuration:
cache:
# In-memory cache configuration
in_memory:
enabled: true
default_ttl: 5m
cleanup_interval: 10m
max_items: 10000
# Redis-compatible cache configuration
redis:
enabled: false
host: "localhost"
port: 6379
password: ""
database: 0
pool_size: 10
default_ttl: 5m
prefix: "dlc:"
use_dragonfly: true # Set to false to use KeyDB
Phase 3: Rate Limiting Implementation
Library Selection: We will use github.com/ulule/limiter/v3 because:
✅ Pros:
- Multiple storage backends (in-memory, Redis, etc.)
- Sliding window algorithm
- Distributed rate limiting support
- Configurable rate limits
- Middleware support for Chi router
- Good performance
Implementation Plan:
// Rate limit configuration
type RateLimitConfig struct {
Enabled bool `mapstructure:"enabled"`
RequestsPerHour int `mapstructure:"requests_per_hour"`
BurstLimit int `mapstructure:"burst_limit"`
IPWhitelist []string `mapstructure:"ip_whitelist"`
EndpointSpecific map[string]struct {
RequestsPerHour int `mapstructure:"requests_per_hour"`
BurstLimit int `mapstructure:"burst_limit"`
} `mapstructure:"endpoint_specific"`
}
// Rate limiter service
type RateLimiterService struct {
limiter *limiter.Limiter
store limiter.Store
config *RateLimitConfig
}
func NewRateLimiterService(config *RateLimitConfig) (*RateLimiterService, error) {
var store limiter.Store
// Use Redis if available, otherwise use in-memory
if config.UseRedis {
// Initialize Redis store
store, err = limiter.NewStoreRedisWithOptions(&limiter.StoreOptions{
Prefix: config.RedisPrefix,
// ... other Redis options
})
} else {
// Use in-memory store
store = limiter.NewStoreMemory()
}
if err != nil {
return nil, fmt.Errorf("failed to create rate limiter store: %w", err)
}
// Create rate limiter
rate := limiter.Rate{
Period: time.Hour,
Limit: int64(config.RequestsPerHour),
}
return &RateLimiterService{
limiter: limiter.New(store, rate),
store: store,
config: config,
}, nil
}
Chi Middleware:
func RateLimitMiddleware(limiter *RateLimiterService) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Skip rate limiting for whitelisted IPs
clientIP := r.Header.Get("X-Real-IP")
if clientIP == "" {
clientIP = r.RemoteAddr
}
for _, allowedIP := range limiter.config.IPWhitelist {
if clientIP == allowedIP {
next.ServeHTTP(w, r)
return
}
}
// Get rate limit context
context, err := limiter.limiter.Get(r.Context(), clientIP)
if err != nil {
log.Error().Err(err).Str("ip", clientIP).Msg("Rate limit error")
http.Error(w, "Internal server error", http.StatusInternalServerError)
return
}
// Check if rate limit is exceeded
if context.Reached > 0 {
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
w.Header().Set("X-RateLimit-Remaining", "0")
w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
http.Error(w, "Too many requests", http.StatusTooManyRequests)
return
}
// Set rate limit headers
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(limiter.config.RequestsPerHour-int(context.Reached)))
w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
next.ServeHTTP(w, r)
})
}
}
Phase 4: Cache Invalidation Strategy
Approach: Hybrid cache invalidation with multiple strategies:
-
Time-Based Expiration (TTL)
- All cache entries have a TTL
- Automatic expiration prevents stale data
- Default TTL: 5 minutes for most data
-
Event-Based Invalidation
- Cache keys are invalidated on specific events
- Example: User data cache invalidated on user update
- Uses pub/sub pattern for distributed invalidation
-
Versioned Cache Keys
- Cache keys include data version
- When data changes, version increments
- Old cache entries naturally expire
-
Write-Through Caching
- Data written to database and cache simultaneously
- Ensures cache is always up-to-date
- Used for critical data that must be consistent
Cache Key Strategy:
func GetCacheKey(prefix, entityType, entityID string) string {
return fmt.Sprintf("%s:%s:%s", prefix, entityType, entityID)
}
// Example: "dlc:user:123"
// Example: "dlc:jwt:validation:token_hash"
Implementation Phases
Phase 1: In-Memory Cache (Current Sprint)
- ✅ Research and select in-memory cache library
- ✅ Implement cache interface and in-memory service
- ✅ Add cache configuration to config package
- ✅ Implement basic cache operations (set, get, delete)
- ✅ Add TTL support and automatic cleanup
- ✅ Cache JWT validation results
- ✅ Add cache metrics and monitoring
Phase 2: Redis-Compatible Cache (Next Sprint)
- ✅ Set up Dragonfly/KeyDB in development environment
- ✅ Implement Redis cache service
- ✅ Add configuration for Redis connection
- ✅ Implement cache fallback strategy (Redis → in-memory)
- ✅ Add health checks for Redis connection
- ✅ Implement distributed cache invalidation
Phase 3: Rate Limiting (Following Sprint)
- ✅ Research and select rate limiting library
- ✅ Implement rate limiter service
- ✅ Add rate limit configuration
- ✅ Implement Chi middleware for rate limiting
- ✅ Add rate limit headers to responses
- ✅ Implement IP whitelisting
- ✅ Add endpoint-specific rate limits
Phase 4: Advanced Features (Future)
- ✅ Cache warming for critical data
- ✅ Two-level caching (Redis + in-memory)
- ✅ Cache compression for large objects
- ✅ Rate limit exemptions for admin users
- ✅ Dynamic rate limit adjustment
- ✅ Cache analytics and usage patterns
Configuration
# Cache configuration
cache:
in_memory:
enabled: true
default_ttl: "5m"
cleanup_interval: "10m"
max_items: 10000
redis:
enabled: false
host: "localhost"
port: 6379
password: ""
database: 0
pool_size: 10
default_ttl: "5m"
prefix: "dlc:"
use_dragonfly: true
# Rate limiting configuration
rate_limiting:
enabled: true
requests_per_hour: 1000
burst_limit: 100
ip_whitelist:
- "127.0.0.1"
- "::1"
endpoint_specific:
"/api/v1/auth/login":
requests_per_hour: 100
burst_limit: 10
"/api/v1/auth/register":
requests_per_hour: 50
burst_limit: 5
Monitoring and Metrics
Cache Metrics:
- Cache hit/miss ratio
- Average cache latency
- Cache size and memory usage
- Eviction rate
- TTL distribution
Rate Limit Metrics:
- Requests allowed vs rejected
- Rate limit exceeded events
- Top limited IPs
- Endpoint-specific rate limit usage
Prometheus Metrics:
var (
cacheHits = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "cache_hits_total",
Help: "Number of cache hits",
}, []string{"cache_type", "entity_type"})
cacheMisses = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "cache_misses_total",
Help: "Number of cache misses",
}, []string{"cache_type", "entity_type"})
rateLimitExceeded = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "rate_limit_exceeded_total",
Help: "Number of rate limit exceeded events",
}, []string{"endpoint", "ip"})
)
Security Considerations
-
Cache Security:
- Never cache sensitive user data (passwords, tokens)
- Use separate cache prefixes for different data types
- Implement cache key hashing for sensitive data
- Set appropriate TTLs to limit exposure
-
Rate Limit Security:
- Prevent rate limit bypass attacks
- Use X-Real-IP header for proper IP detection
- Implement rate limit for authentication endpoints
- Log rate limit violations for security monitoring
-
Redis Security:
- Use authentication if enabled
- Implement TLS for Redis connections
- Use separate database numbers for different environments
- Limit Redis commands to prevent abuse
Performance Considerations
-
Cache Performance:
- Benchmark cache operations
- Monitor cache latency
- Optimize cache key size
- Use appropriate data structures
-
Rate Limit Performance:
- Use efficient rate limiting algorithm
- Minimize middleware overhead
- Cache rate limit decisions
- Batch rate limit checks where possible
-
Memory Management:
- Set reasonable cache size limits
- Monitor memory usage
- Implement cache eviction policies
- Use memory-efficient data structures
Migration Strategy
From No Cache to In-Memory Cache
- Implement cache interface and in-memory service
- Add cache configuration with sensible defaults
- Gradually add caching to critical endpoints
- Monitor cache performance and hit ratios
- Adjust TTLs based on usage patterns
From In-Memory to Redis Cache
- Set up Dragonfly/KeyDB in development
- Implement Redis cache service
- Add fallback logic (Redis → in-memory)
- Test with both caches enabled
- Gradually migrate to Redis-only
- Monitor distributed cache performance
From No Rate Limiting to Rate Limiting
- Implement rate limiter with generous limits
- Add monitoring for rate limit events
- Gradually tighten limits based on usage
- Add IP whitelist for critical services
- Implement endpoint-specific limits
- Monitor and adjust as needed
Alternatives Considered
Cache Libraries
github.com/bluele/gcache- More features but more complexgithub.com/allegro/bigcache- High performance but no TTLgithub.com/coocood/freecache- Very fast but limited API
Redis Alternatives
- Redis Enterprise - Commercial, not open-source
- Memcached - No persistence, simpler protocol
- Couchbase - More complex, document-oriented
Rate Limiting Libraries
golang.org/x/time/rate- Simple but no distributed supportgithub.com/juju/ratelimit- Good but limited features- Custom implementation - Too much development effort
Success Metrics
-
Cache Effectiveness:
- Cache hit ratio > 80%
- Average cache latency < 1ms
- Memory usage within limits
-
Rate Limiting Effectiveness:
- < 1% of legitimate requests blocked
- Effective protection against abuse
- No impact on normal usage patterns
-
System Stability:
- Reduced database load by 50%
- Consistent response times under load
- No cache-related outages
Risks and Mitigations
| Risk | Mitigation |
|---|---|
| Cache stampede | Implement cache warming and fallback logic |
| Memory exhaustion | Set reasonable cache size limits and monitor usage |
| Redis failure | Implement fallback to in-memory cache |
| Rate limit false positives | Start with generous limits and monitor |
| Performance degradation | Benchmark before and after implementation |
| Cache inconsistency | Use appropriate invalidation strategies |
Future Enhancements
- Cache Pre-warming - Load frequently used data at startup
- Two-Level Caching - Local cache + distributed cache
- Cache Compression - For large cache objects
- Dynamic Rate Limits - Adjust based on system load
- User-Specific Rate Limits - Different limits for different user tiers
- Cache Analytics - Detailed usage patterns and optimization
References
- go-cache documentation
- Dragonfly documentation
- KeyDB documentation
- limiter/v3 documentation
- Chi middleware documentation
Decision Drivers
- Simplicity - Easy to implement and maintain
- Performance - Minimal impact on response times
- Scalability - Support for horizontal scaling
- Reliability - Graceful degradation on failures
- Open Source - Preference for open-source solutions
- Community - Active development and support
Conclusion
This ADR proposes a comprehensive caching and rate limiting strategy that will significantly improve the performance, scalability, and reliability of the dance-lessons-coach application. The phased approach allows for gradual implementation and testing, minimizing risk while delivering value at each stage.
The combination of in-memory caching for single-instance deployments and Redis-compatible caching for distributed environments provides flexibility for different deployment scenarios. The rate limiting implementation will protect the application from abuse while maintaining a good user experience.
This strategy aligns with our architectural principles of simplicity, performance, and scalability while using well-established open-source technologies with strong community support.