- Create comprehensive ADR-0022 covering multi-phase implementation - Phase 1: In-memory cache with go-cache library - Phase 2: Redis-compatible cache with Dragonfly/KeyDB - Phase 3: Rate limiting with ulule/limiter/v3 - Add detailed technical specifications and implementation plans - Update ADR README with new entries - Addresses performance and security requirements Related to Issue #13: Implement Rate Limiting and Caching Strategy Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
536 lines
16 KiB
Markdown
536 lines
16 KiB
Markdown
# ADR 0022: Rate Limiting and Cache Strategy
|
|
|
|
## Status
|
|
**Proposed** 🟡
|
|
|
|
## Context
|
|
|
|
As the dance-lessons-coach application grows and potentially serves multiple users simultaneously, we need to implement rate limiting to:
|
|
|
|
1. **Prevent abuse** of API endpoints
|
|
2. **Protect against DDoS attacks**
|
|
3. **Ensure fair usage** across all users
|
|
4. **Maintain system stability** under load
|
|
5. **Provide consistent performance**
|
|
|
|
Additionally, we need a caching strategy to:
|
|
1. **Reduce database load** for frequently accessed data
|
|
2. **Improve response times** for common requests
|
|
3. **Support horizontal scaling** with shared cache
|
|
4. **Handle cache invalidation** properly
|
|
|
|
## Decision
|
|
|
|
We will implement a **multi-phase caching and rate limiting strategy** with the following components:
|
|
|
|
### Phase 1: In-Memory Cache with TTL Support
|
|
|
|
**Library Selection**: We will use **`github.com/patrickmn/go-cache`** for in-memory caching because:
|
|
|
|
✅ **Pros:**
|
|
- Simple, lightweight, and well-maintained
|
|
- Built-in TTL (Time-To-Live) support
|
|
- Thread-safe by default
|
|
- No external dependencies
|
|
- Good performance for single-instance applications
|
|
- Supports automatic expiration
|
|
|
|
❌ **Cons:**
|
|
- Not shared between multiple instances
|
|
- Memory-bound (not persistent)
|
|
- Limited advanced features
|
|
|
|
**Implementation Plan:**
|
|
```go
|
|
type CacheService interface {
|
|
Set(key string, value interface{}, expiration time.Duration) error
|
|
Get(key string) (interface{}, bool)
|
|
Delete(key string) error
|
|
Flush() error
|
|
GetWithTTL(key string) (interface{}, time.Duration, bool)
|
|
}
|
|
|
|
type InMemoryCacheService struct {
|
|
cache *cache.Cache
|
|
defaultTTL time.Duration
|
|
cleanupInterval time.Duration
|
|
}
|
|
```
|
|
|
|
**Use Cases:**
|
|
- JWT token validation results
|
|
- User session data
|
|
- Frequently accessed greet messages
|
|
- API response caching for idempotent endpoints
|
|
|
|
### Phase 2: Redis-Compatible Shared Cache
|
|
|
|
**Library Selection**: We will use **`github.com/redis/go-redis/v9`** with a **Redis-compatible open-source alternative**:
|
|
|
|
**Primary Choice**: **Dragonfly** (https://www.dragonflydb.io/)
|
|
- Redis-compatible
|
|
- Open-source (Apache 2.0 license)
|
|
- Written in C++ with multi-threaded architecture
|
|
- 25x higher throughput than Redis
|
|
- Lower latency
|
|
- Drop-in Redis replacement
|
|
|
|
**Fallback Choice**: **KeyDB** (https://keydb.dev/)
|
|
- Multi-threaded Redis fork
|
|
- Open-source (GPL license)
|
|
- Better performance than Redis
|
|
- Full Redis API compatibility
|
|
|
|
**Implementation Plan:**
|
|
```go
|
|
type RedisCacheService struct {
|
|
client *redis.Client
|
|
defaultTTL time.Duration
|
|
prefix string
|
|
}
|
|
|
|
func NewRedisCacheService(config *config.CacheConfig) (*RedisCacheService, error) {
|
|
client := redis.NewClient(&redis.Options{
|
|
Addr: config.Host + ":" + strconv.Itoa(config.Port),
|
|
Password: config.Password,
|
|
DB: config.Database,
|
|
PoolSize: config.PoolSize,
|
|
})
|
|
|
|
// Test connection
|
|
_, err := client.Ping(context.Background()).Result()
|
|
if err != nil {
|
|
return nil, fmt.Errorf("failed to connect to Redis: %w", err)
|
|
}
|
|
|
|
return &RedisCacheService{
|
|
client: client,
|
|
defaultTTL: config.DefaultTTL,
|
|
prefix: config.Prefix,
|
|
}, nil
|
|
}
|
|
```
|
|
|
|
**Configuration:**
|
|
```yaml
|
|
cache:
|
|
# In-memory cache configuration
|
|
in_memory:
|
|
enabled: true
|
|
default_ttl: 5m
|
|
cleanup_interval: 10m
|
|
max_items: 10000
|
|
|
|
# Redis-compatible cache configuration
|
|
redis:
|
|
enabled: false
|
|
host: "localhost"
|
|
port: 6379
|
|
password: ""
|
|
database: 0
|
|
pool_size: 10
|
|
default_ttl: 5m
|
|
prefix: "dlc:"
|
|
use_dragonfly: true # Set to false to use KeyDB
|
|
```
|
|
|
|
### Phase 3: Rate Limiting Implementation
|
|
|
|
**Library Selection**: We will use **`github.com/ulule/limiter/v3`** because:
|
|
|
|
✅ **Pros:**
|
|
- Multiple storage backends (in-memory, Redis, etc.)
|
|
- Sliding window algorithm
|
|
- Distributed rate limiting support
|
|
- Configurable rate limits
|
|
- Middleware support for Chi router
|
|
- Good performance
|
|
|
|
**Implementation Plan:**
|
|
```go
|
|
// Rate limit configuration
|
|
type RateLimitConfig struct {
|
|
Enabled bool `mapstructure:"enabled"`
|
|
RequestsPerHour int `mapstructure:"requests_per_hour"`
|
|
BurstLimit int `mapstructure:"burst_limit"`
|
|
IPWhitelist []string `mapstructure:"ip_whitelist"`
|
|
EndpointSpecific map[string]struct {
|
|
RequestsPerHour int `mapstructure:"requests_per_hour"`
|
|
BurstLimit int `mapstructure:"burst_limit"`
|
|
} `mapstructure:"endpoint_specific"`
|
|
}
|
|
|
|
// Rate limiter service
|
|
type RateLimiterService struct {
|
|
limiter *limiter.Limiter
|
|
store limiter.Store
|
|
config *RateLimitConfig
|
|
}
|
|
|
|
func NewRateLimiterService(config *RateLimitConfig) (*RateLimiterService, error) {
|
|
var store limiter.Store
|
|
|
|
// Use Redis if available, otherwise use in-memory
|
|
if config.UseRedis {
|
|
// Initialize Redis store
|
|
store, err = limiter.NewStoreRedisWithOptions(&limiter.StoreOptions{
|
|
Prefix: config.RedisPrefix,
|
|
// ... other Redis options
|
|
})
|
|
} else {
|
|
// Use in-memory store
|
|
store = limiter.NewStoreMemory()
|
|
}
|
|
|
|
if err != nil {
|
|
return nil, fmt.Errorf("failed to create rate limiter store: %w", err)
|
|
}
|
|
|
|
// Create rate limiter
|
|
rate := limiter.Rate{
|
|
Period: time.Hour,
|
|
Limit: int64(config.RequestsPerHour),
|
|
}
|
|
|
|
return &RateLimiterService{
|
|
limiter: limiter.New(store, rate),
|
|
store: store,
|
|
config: config,
|
|
}, nil
|
|
}
|
|
```
|
|
|
|
**Chi Middleware:**
|
|
```go
|
|
func RateLimitMiddleware(limiter *RateLimiterService) func(http.Handler) http.Handler {
|
|
return func(next http.Handler) http.Handler {
|
|
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
|
// Skip rate limiting for whitelisted IPs
|
|
clientIP := r.Header.Get("X-Real-IP")
|
|
if clientIP == "" {
|
|
clientIP = r.RemoteAddr
|
|
}
|
|
|
|
for _, allowedIP := range limiter.config.IPWhitelist {
|
|
if clientIP == allowedIP {
|
|
next.ServeHTTP(w, r)
|
|
return
|
|
}
|
|
}
|
|
|
|
// Get rate limit context
|
|
context, err := limiter.limiter.Get(r.Context(), clientIP)
|
|
if err != nil {
|
|
log.Error().Err(err).Str("ip", clientIP).Msg("Rate limit error")
|
|
http.Error(w, "Internal server error", http.StatusInternalServerError)
|
|
return
|
|
}
|
|
|
|
// Check if rate limit is exceeded
|
|
if context.Reached > 0 {
|
|
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
|
|
w.Header().Set("X-RateLimit-Remaining", "0")
|
|
w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
|
|
|
|
http.Error(w, "Too many requests", http.StatusTooManyRequests)
|
|
return
|
|
}
|
|
|
|
// Set rate limit headers
|
|
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
|
|
w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(limiter.config.RequestsPerHour-int(context.Reached)))
|
|
w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
|
|
|
|
next.ServeHTTP(w, r)
|
|
})
|
|
}
|
|
}
|
|
```
|
|
|
|
### Phase 4: Cache Invalidation Strategy
|
|
|
|
**Approach**: Hybrid cache invalidation with multiple strategies:
|
|
|
|
1. **Time-Based Expiration (TTL)**
|
|
- All cache entries have a TTL
|
|
- Automatic expiration prevents stale data
|
|
- Default TTL: 5 minutes for most data
|
|
|
|
2. **Event-Based Invalidation**
|
|
- Cache keys are invalidated on specific events
|
|
- Example: User data cache invalidated on user update
|
|
- Uses pub/sub pattern for distributed invalidation
|
|
|
|
3. **Versioned Cache Keys**
|
|
- Cache keys include data version
|
|
- When data changes, version increments
|
|
- Old cache entries naturally expire
|
|
|
|
4. **Write-Through Caching**
|
|
- Data written to database and cache simultaneously
|
|
- Ensures cache is always up-to-date
|
|
- Used for critical data that must be consistent
|
|
|
|
**Cache Key Strategy:**
|
|
```go
|
|
func GetCacheKey(prefix, entityType, entityID string) string {
|
|
return fmt.Sprintf("%s:%s:%s", prefix, entityType, entityID)
|
|
}
|
|
|
|
// Example: "dlc:user:123"
|
|
// Example: "dlc:jwt:validation:token_hash"
|
|
```
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: In-Memory Cache (Current Sprint)
|
|
- ✅ Research and select in-memory cache library
|
|
- ✅ Implement cache interface and in-memory service
|
|
- ✅ Add cache configuration to config package
|
|
- ✅ Implement basic cache operations (set, get, delete)
|
|
- ✅ Add TTL support and automatic cleanup
|
|
- ✅ Cache JWT validation results
|
|
- ✅ Add cache metrics and monitoring
|
|
|
|
### Phase 2: Redis-Compatible Cache (Next Sprint)
|
|
- ✅ Set up Dragonfly/KeyDB in development environment
|
|
- ✅ Implement Redis cache service
|
|
- ✅ Add configuration for Redis connection
|
|
- ✅ Implement cache fallback strategy (Redis → in-memory)
|
|
- ✅ Add health checks for Redis connection
|
|
- ✅ Implement distributed cache invalidation
|
|
|
|
### Phase 3: Rate Limiting (Following Sprint)
|
|
- ✅ Research and select rate limiting library
|
|
- ✅ Implement rate limiter service
|
|
- ✅ Add rate limit configuration
|
|
- ✅ Implement Chi middleware for rate limiting
|
|
- ✅ Add rate limit headers to responses
|
|
- ✅ Implement IP whitelisting
|
|
- ✅ Add endpoint-specific rate limits
|
|
|
|
### Phase 4: Advanced Features (Future)
|
|
- ✅ Cache warming for critical data
|
|
- ✅ Two-level caching (Redis + in-memory)
|
|
- ✅ Cache compression for large objects
|
|
- ✅ Rate limit exemptions for admin users
|
|
- ✅ Dynamic rate limit adjustment
|
|
- ✅ Cache analytics and usage patterns
|
|
|
|
## Configuration
|
|
|
|
```yaml
|
|
# Cache configuration
|
|
cache:
|
|
in_memory:
|
|
enabled: true
|
|
default_ttl: "5m"
|
|
cleanup_interval: "10m"
|
|
max_items: 10000
|
|
|
|
redis:
|
|
enabled: false
|
|
host: "localhost"
|
|
port: 6379
|
|
password: ""
|
|
database: 0
|
|
pool_size: 10
|
|
default_ttl: "5m"
|
|
prefix: "dlc:"
|
|
use_dragonfly: true
|
|
|
|
# Rate limiting configuration
|
|
rate_limiting:
|
|
enabled: true
|
|
requests_per_hour: 1000
|
|
burst_limit: 100
|
|
ip_whitelist:
|
|
- "127.0.0.1"
|
|
- "::1"
|
|
endpoint_specific:
|
|
"/api/v1/auth/login":
|
|
requests_per_hour: 100
|
|
burst_limit: 10
|
|
"/api/v1/auth/register":
|
|
requests_per_hour: 50
|
|
burst_limit: 5
|
|
```
|
|
|
|
## Monitoring and Metrics
|
|
|
|
**Cache Metrics:**
|
|
- Cache hit/miss ratio
|
|
- Average cache latency
|
|
- Cache size and memory usage
|
|
- Eviction rate
|
|
- TTL distribution
|
|
|
|
**Rate Limit Metrics:**
|
|
- Requests allowed vs rejected
|
|
- Rate limit exceeded events
|
|
- Top limited IPs
|
|
- Endpoint-specific rate limit usage
|
|
|
|
**Prometheus Metrics:**
|
|
```go
|
|
var (
|
|
cacheHits = prometheus.NewCounterVec(prometheus.CounterOpts{
|
|
Name: "cache_hits_total",
|
|
Help: "Number of cache hits",
|
|
}, []string{"cache_type", "entity_type"})
|
|
|
|
cacheMisses = prometheus.NewCounterVec(prometheus.CounterOpts{
|
|
Name: "cache_misses_total",
|
|
Help: "Number of cache misses",
|
|
}, []string{"cache_type", "entity_type"})
|
|
|
|
rateLimitExceeded = prometheus.NewCounterVec(prometheus.CounterOpts{
|
|
Name: "rate_limit_exceeded_total",
|
|
Help: "Number of rate limit exceeded events",
|
|
}, []string{"endpoint", "ip"})
|
|
)
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
1. **Cache Security:**
|
|
- Never cache sensitive user data (passwords, tokens)
|
|
- Use separate cache prefixes for different data types
|
|
- Implement cache key hashing for sensitive data
|
|
- Set appropriate TTLs to limit exposure
|
|
|
|
2. **Rate Limit Security:**
|
|
- Prevent rate limit bypass attacks
|
|
- Use X-Real-IP header for proper IP detection
|
|
- Implement rate limit for authentication endpoints
|
|
- Log rate limit violations for security monitoring
|
|
|
|
3. **Redis Security:**
|
|
- Use authentication if enabled
|
|
- Implement TLS for Redis connections
|
|
- Use separate database numbers for different environments
|
|
- Limit Redis commands to prevent abuse
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Cache Performance:**
|
|
- Benchmark cache operations
|
|
- Monitor cache latency
|
|
- Optimize cache key size
|
|
- Use appropriate data structures
|
|
|
|
2. **Rate Limit Performance:**
|
|
- Use efficient rate limiting algorithm
|
|
- Minimize middleware overhead
|
|
- Cache rate limit decisions
|
|
- Batch rate limit checks where possible
|
|
|
|
3. **Memory Management:**
|
|
- Set reasonable cache size limits
|
|
- Monitor memory usage
|
|
- Implement cache eviction policies
|
|
- Use memory-efficient data structures
|
|
|
|
## Migration Strategy
|
|
|
|
### From No Cache to In-Memory Cache
|
|
1. Implement cache interface and in-memory service
|
|
2. Add cache configuration with sensible defaults
|
|
3. Gradually add caching to critical endpoints
|
|
4. Monitor cache performance and hit ratios
|
|
5. Adjust TTLs based on usage patterns
|
|
|
|
### From In-Memory to Redis Cache
|
|
1. Set up Dragonfly/KeyDB in development
|
|
2. Implement Redis cache service
|
|
3. Add fallback logic (Redis → in-memory)
|
|
4. Test with both caches enabled
|
|
5. Gradually migrate to Redis-only
|
|
6. Monitor distributed cache performance
|
|
|
|
### From No Rate Limiting to Rate Limiting
|
|
1. Implement rate limiter with generous limits
|
|
2. Add monitoring for rate limit events
|
|
3. Gradually tighten limits based on usage
|
|
4. Add IP whitelist for critical services
|
|
5. Implement endpoint-specific limits
|
|
6. Monitor and adjust as needed
|
|
|
|
## Alternatives Considered
|
|
|
|
### Cache Libraries
|
|
1. **`github.com/bluele/gcache`** - More features but more complex
|
|
2. **`github.com/allegro/bigcache`** - High performance but no TTL
|
|
3. **`github.com/coocood/freecache`** - Very fast but limited API
|
|
|
|
### Redis Alternatives
|
|
1. **Redis Enterprise** - Commercial, not open-source
|
|
2. **Memcached** - No persistence, simpler protocol
|
|
3. **Couchbase** - More complex, document-oriented
|
|
|
|
### Rate Limiting Libraries
|
|
1. **`golang.org/x/time/rate`** - Simple but no distributed support
|
|
2. **`github.com/juju/ratelimit`** - Good but limited features
|
|
3. **Custom implementation** - Too much development effort
|
|
|
|
## Success Metrics
|
|
|
|
1. **Cache Effectiveness:**
|
|
- Cache hit ratio > 80%
|
|
- Average cache latency < 1ms
|
|
- Memory usage within limits
|
|
|
|
2. **Rate Limiting Effectiveness:**
|
|
- < 1% of legitimate requests blocked
|
|
- Effective protection against abuse
|
|
- No impact on normal usage patterns
|
|
|
|
3. **System Stability:**
|
|
- Reduced database load by 50%
|
|
- Consistent response times under load
|
|
- No cache-related outages
|
|
|
|
## Risks and Mitigations
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Cache stampede | Implement cache warming and fallback logic |
|
|
| Memory exhaustion | Set reasonable cache size limits and monitor usage |
|
|
| Redis failure | Implement fallback to in-memory cache |
|
|
| Rate limit false positives | Start with generous limits and monitor |
|
|
| Performance degradation | Benchmark before and after implementation |
|
|
| Cache inconsistency | Use appropriate invalidation strategies |
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Cache Pre-warming** - Load frequently used data at startup
|
|
2. **Two-Level Caching** - Local cache + distributed cache
|
|
3. **Cache Compression** - For large cache objects
|
|
4. **Dynamic Rate Limits** - Adjust based on system load
|
|
5. **User-Specific Rate Limits** - Different limits for different user tiers
|
|
6. **Cache Analytics** - Detailed usage patterns and optimization
|
|
|
|
## References
|
|
|
|
- [go-cache documentation](https://github.com/patrickmn/go-cache)
|
|
- [Dragonfly documentation](https://www.dragonflydb.io/docs)
|
|
- [KeyDB documentation](https://keydb.dev/)
|
|
- [limiter/v3 documentation](https://github.com/ulule/limiter)
|
|
- [Chi middleware documentation](https://github.com/go-chi/chi)
|
|
|
|
## Decision Drivers
|
|
|
|
1. **Simplicity** - Easy to implement and maintain
|
|
2. **Performance** - Minimal impact on response times
|
|
3. **Scalability** - Support for horizontal scaling
|
|
4. **Reliability** - Graceful degradation on failures
|
|
5. **Open Source** - Preference for open-source solutions
|
|
6. **Community** - Active development and support
|
|
|
|
## Conclusion
|
|
|
|
This ADR proposes a comprehensive caching and rate limiting strategy that will significantly improve the performance, scalability, and reliability of the dance-lessons-coach application. The phased approach allows for gradual implementation and testing, minimizing risk while delivering value at each stage.
|
|
|
|
The combination of in-memory caching for single-instance deployments and Redis-compatible caching for distributed environments provides flexibility for different deployment scenarios. The rate limiting implementation will protect the application from abuse while maintaining a good user experience.
|
|
|
|
This strategy aligns with our architectural principles of simplicity, performance, and scalability while using well-established open-source technologies with strong community support. |