🧪 test: add JWT secret rotation BDD scenarios and step implementations (#12)
All checks were successful
CI/CD Pipeline / Build Docker Cache (push) Successful in 9s
CI/CD Pipeline / CI Pipeline (push) Successful in 4m15s
CI/CD Pipeline / Trigger Docker Push (push) Has been skipped

 merge: implement JWT secret rotation with BDD scenario isolation

- Implement JWT secret rotation mechanism (closes #8)
- Add per-scenario state isolation for BDD tests (closes #14)
- Validate password reset workflow via BDD tests (closes #7)
- Fix port conflicts in test validation
- Add state tracer for debugging test execution
- Document BDD isolation strategies in ADR 0025
- Fix PostgreSQL configuration environment variables

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Co-authored-by: Gabriel Radureau <arcodange@gmail.com>
Co-committed-by: Gabriel Radureau <arcodange@gmail.com>
This commit was merged in pull request #12.
This commit is contained in:
2026-04-11 17:56:45 +02:00
committed by arcodange
parent 5de703468f
commit 5eec64e5e8
66 changed files with 10025 additions and 701 deletions

View File

@@ -0,0 +1,469 @@
# 10. JWT Secret Retention Policy
## Status
**Proposed** 🟡
## Context
The dance-lessons-coach application requires a robust JWT secret management system that balances security and user experience. As implemented in [ADR-0009](0009-hybrid-testing-approach.md), the system supports multiple JWT secrets for graceful rotation. However, the current implementation lacks a clear policy for secret retention and cleanup.
### Current State
- ✅ Multiple JWT secrets supported
- ✅ Graceful rotation implemented
- ✅ Backward compatibility maintained
- ❌ No automatic cleanup of old secrets
- ❌ No configurable retention periods
- ❌ No expiration-based secret management
### Problem Statement
Without a retention policy:
1. **Security Risk**: Old secrets accumulate indefinitely, increasing attack surface
2. **Memory Bloat**: Unbounded growth of secret storage
3. **Operational Overhead**: Manual cleanup required
4. **Compliance Issues**: May violate security policies requiring regular key rotation
### Requirements
1. **Configurable Retention**: Administrators should control how long secrets are retained
2. **Automatic Cleanup**: System should automatically remove expired secrets
3. **Backward Compatibility**: Existing tokens should continue working during retention period
4. **Sensible Defaults**: Should work out-of-the-box with secure defaults
5. **Performance**: Cleanup should not impact runtime performance
## Decision
### JWT Secret Retention Policy
Implement a configurable retention policy based on JWT TTL (Time-To-Live) with the following components:
#### 1. Configuration Structure
```yaml
jwt:
# Token time-to-live (default: 24h)
ttl: 24h
# Secret retention configuration
secret_retention:
# Retention factor multiplier (default: 2.0)
# Retention period = JWT TTL × retention_factor
retention_factor: 2.0
# Maximum retention period (safety limit, default: 72h)
max_retention: 72h
# Cleanup frequency for expired secrets (default: 1h)
cleanup_interval: 1h
```
#### 2. Retention Period Calculation
```
retention_period = min(JWT_TTL × retention_factor, max_retention)
```
**Examples:**
- Default (24h TTL, 2.0 factor): `min(48h, 72h) = 48h`
- Short-lived tokens (1h TTL, 3.0 factor): `min(3h, 72h) = 3h`
- Long-lived tokens (72h TTL, 2.0 factor): `min(144h, 72h) = 72h`
#### 3. Secret Lifecycle
```mermaid
graph LR
A[Secret Created] --> B[Active Period]
B --> C{Retention Period}
C -->|Expired| D[Marked for Cleanup]
C -->|Valid| B
D --> E[Automatic Removal]
```
#### 4. Cleanup Process
- **Frequency**: Configurable interval (default: 1 hour)
- **Scope**: Remove secrets older than retention period
- **Safety**: Never remove current primary secret
- **Logging**: Audit trail of cleanup operations
### Implementation Strategy
#### Phase 1: Configuration Framework
1. **Extend Config Package** (`pkg/config/config.go`)
- Add JWT TTL configuration
- Add secret retention parameters
- Implement validation
2. **Environment Variables**
```bash
# JWT Token TTL
DLC_JWT_TTL=24h
# Secret Retention
DLC_JWT_SECRET_RETENTION_FACTOR=2.0
DLC_JWT_SECRET_MAX_RETENTION=72h
DLC_JWT_SECRET_CLEANUP_INTERVAL=1h
```
#### Phase 2: Secret Manager Enhancement
1. **Enhance JWTSecret Struct**
```go
type JWTSecret struct {
Secret string
IsPrimary bool
CreatedAt time.Time
ExpiresAt *time.Time // Now properly calculated
RetentionPeriod time.Duration
}
```
2. **Add Expiration Logic**
```go
func (m *JWTSecretManager) AddSecret(secret string, isPrimary bool, expiresIn time.Duration) {
// Calculate retention period based on config
retentionPeriod := m.calculateRetentionPeriod()
expiresAt := time.Now().Add(expiresIn)
m.secrets = append(m.secrets, JWTSecret{
Secret: secret,
IsPrimary: isPrimary,
CreatedAt: time.Now(),
ExpiresAt: &expiresAt,
RetentionPeriod: retentionPeriod,
})
}
```
#### Phase 3: Automatic Cleanup
1. **Background Cleanup Job**
```go
func (m *JWTSecretManager) StartCleanupJob(ctx context.Context, interval time.Duration) {
ticker := time.NewTicker(interval)
go func() {
for {
select {
case <-ticker.C:
m.CleanupExpiredSecrets()
case <-ctx.Done():
ticker.Stop()
return
}
}
}()
}
```
2. **Cleanup Implementation**
```go
func (m *JWTSecretManager) CleanupExpiredSecrets() {
now := time.Now()
var activeSecrets []JWTSecret
for _, secret := range m.secrets {
if secret.IsPrimary {
// Never remove current primary
activeSecrets = append(activeSecrets, secret)
continue
}
// Check if secret is within retention period
if now.Sub(secret.CreatedAt) <= secret.RetentionPeriod {
activeSecrets = append(activeSecrets, secret)
} else {
log.Info().
Str("secret", secret.Secret).
Msg("Removed expired JWT secret")
}
}
m.secrets = activeSecrets
}
```
#### Phase 4: Integration
1. **Server Initialization**
```go
func (s *Server) InitializeJWT() error {
// Load config
jwtConfig := s.config.GetJWTConfig()
// Create secret manager with retention policy
secretManager := NewJWTSecretManager(
jwtConfig.Secret,
WithRetentionFactor(jwtConfig.RetentionFactor),
WithMaxRetention(jwtConfig.MaxRetention),
)
// Start cleanup job
secretManager.StartCleanupJob(s.ctx, jwtConfig.CleanupInterval)
return nil
}
```
### Validation
#### 1. Configuration Validation
```go
func (c *Config) ValidateJWTConfig() error {
if c.JWT.TTL <= 0 {
return fmt.Errorf("jwt.ttl must be positive")
}
if c.JWT.SecretRetention.RetentionFactor < 1.0 {
return fmt.Errorf("jwt.secret_retention.retention_factor must be ≥ 1.0")
}
if c.JWT.SecretRetention.MaxRetention <= 0 {
return fmt.Errorf("jwt.secret_retention.max_retention must be positive")
}
if c.JWT.SecretRetention.CleanupInterval <= 0 {
return fmt.Errorf("jwt.secret_retention.cleanup_interval must be positive")
}
// Ensure max retention is reasonable
if c.JWT.SecretRetention.MaxRetention > 720h { // 30 days
return fmt.Errorf("jwt.secret_retention.max_retention exceeds maximum of 720h")
}
return nil
}
```
#### 2. Runtime Validation
```go
func (m *JWTSecretManager) ValidateSecret(secret string) error {
// Check minimum length
if len(secret) < 16 {
return fmt.Errorf("jwt secret must be at least 16 characters")
}
// Check entropy (basic check)
if !hasSufficientEntropy(secret) {
return fmt.Errorf("jwt secret must have sufficient entropy")
}
return nil
}
```
### Monitoring and Observability
#### 1. Metrics
```go
// Prometheus metrics
var (
jwtSecretsActive = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "jwt_secrets_active_count",
Help: "Number of active JWT secrets",
})
jwtSecretsExpired = prometheus.NewCounter(prometheus.CounterOpts{
Name: "jwt_secrets_expired_total",
Help: "Total number of expired JWT secrets removed",
})
jwtSecretRetentionDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "jwt_secret_retention_duration_seconds",
Help: "Duration of JWT secret retention periods",
Buckets: prometheus.ExponentialBuckets(3600, 2, 6), // 1h to 32h
})
)
```
#### 2. Logging
```go
func (m *JWTSecretManager) logSecretEvent(secret string, event string, details ...interface{}) {
log.Info().
Str("secret", maskSecret(secret)).
Str("event", event).
Interface("details", details).
Msg("JWT secret event")
}
func maskSecret(secret string) string {
if len(secret) <= 4 {
return "****"
}
return secret[:4] + "****" + secret[len(secret)-4:]
}
```
## Consequences
### Positive
1. **Enhanced Security**: Automatic cleanup reduces attack surface
2. **Reduced Memory Usage**: Prevents unbounded growth of secret storage
3. **Operational Efficiency**: No manual cleanup required
4. **Compliance Ready**: Meets security policy requirements for key rotation
5. **Flexibility**: Configurable to meet different security requirements
### Negative
1. **Complexity**: Adds configuration and cleanup logic
2. **Performance Overhead**: Background cleanup job (minimal impact)
3. **Migration**: Existing deployments need configuration updates
4. **Debugging**: More moving parts to troubleshoot
### Neutral
1. **Backward Compatibility**: Existing tokens continue to work
2. **Learning Curve**: New configuration options to understand
3. **Monitoring**: Additional metrics to track
## Alternatives Considered
### Alternative 1: Fixed Retention Period
**Proposal**: Use fixed retention period (e.g., 48 hours) instead of TTL-based calculation
**Rejected Because**:
- Less flexible for different use cases
- Doesn't scale with JWT TTL changes
- May be too short for long-lived tokens or too long for short-lived ones
### Alternative 2: Manual Cleanup Only
**Proposal**: Require administrators to manually clean up old secrets
**Rejected Because**:
- Operational overhead
- Security risk if cleanup is forgotten
- Doesn't scale for frequent rotations
### Alternative 3: No Retention (Current State)
**Proposal**: Keep current behavior with no automatic cleanup
**Rejected Because**:
- Security concerns with accumulating secrets
- Memory management issues
- Compliance violations
## Success Metrics
1. **Security**: No old secrets remain beyond retention period
2. **Reliability**: 99.9% of valid tokens continue to work during rotation
3. **Performance**: Cleanup job completes in <100ms with <1000 secrets
4. **Adoption**: Configuration used in 100% of deployments within 3 months
## Migration Plan
### Phase 1: Preparation (1 week)
- ✅ Create this ADR
- ✅ Update documentation
- ✅ Add configuration to config package
- ✅ Implement basic retention logic
### Phase 2: Testing (2 weeks)
- ✅ Write BDD scenarios for retention
- ✅ Add unit tests for secret manager
- ✅ Test with various TTL/factor combinations
- ✅ Performance testing with large secret counts
### Phase 3: Rollout (1 week)
- ✅ Update default configuration
- ✅ Add feature flag for gradual rollout
- ✅ Monitor metrics in staging
- ✅ Gradual production rollout
### Phase 4: Optimization (Ongoing)
- ✅ Monitor cleanup performance
- ✅ Adjust defaults based on real-world usage
- ✅ Add alerts for cleanup failures
- ✅ Document troubleshooting guide
## References
- [ADR-0009: Hybrid Testing Approach](0009-hybrid-testing-approach.md)
- [ADR-0008: BDD Testing](0008-bdd-testing.md)
- [RFC 7519: JSON Web Tokens](https://tools.ietf.org/html/rfc7519)
- [OWASP Key Management Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html)
## Appendix
### Configuration Examples
**Development Environment** (short retention for testing):
```yaml
jwt:
ttl: 1h
secret_retention:
retention_factor: 1.5
max_retention: 3h
cleanup_interval: 30m
```
**Production Environment** (secure defaults):
```yaml
jwt:
ttl: 24h
secret_retention:
retention_factor: 2.0
max_retention: 72h
cleanup_interval: 1h
```
**High-Security Environment** (aggressive rotation):
```yaml
jwt:
ttl: 8h
secret_retention:
retention_factor: 1.5
max_retention: 24h
cleanup_interval: 30m
```
### Troubleshooting
**Issue**: Secrets being removed too quickly
- **Check**: Retention factor and JWT TTL settings
- **Fix**: Increase retention_factor or JWT TTL
**Issue**: Too many old secrets accumulating
- **Check**: Cleanup job logs and interval
- **Fix**: Decrease cleanup_interval or retention_factor
**Issue**: Performance degradation during cleanup
- **Check**: Number of secrets and cleanup frequency
- **Fix**: Optimize cleanup algorithm or increase interval
### FAQ
**Q: What happens to tokens signed with expired secrets?**
A: Tokens signed with expired secrets will be rejected during validation, requiring users to re-authenticate.
**Q: Can I disable automatic cleanup?**
A: Yes, set `cleanup_interval` to a very high value (e.g., `8760h` for 1 year).
**Q: How does this affect existing deployments?**
A: Existing deployments will use sensible defaults. The feature is backward compatible.
**Q: What's the recommended retention factor?**
A: Start with 2.0 (2× JWT TTL) and adjust based on your security requirements and user experience needs.
**Q: How often should cleanup run?**
A: For most deployments, every 1 hour is sufficient. High-volume systems may need more frequent cleanup.
## Decision Record
**Approved By**:
**Approved Date**:
**Implemented By**:
**Implementation Date**:
---
*Generated by Mistral Vibe*
*Co-Authored-By: Mistral Vibe <vibe@mistral.ai>*

View File

@@ -0,0 +1,536 @@
# ADR 0022: Rate Limiting and Cache Strategy
## Status
**Proposed** 🟡
## Context
As the dance-lessons-coach application grows and potentially serves multiple users simultaneously, we need to implement rate limiting to:
1. **Prevent abuse** of API endpoints
2. **Protect against DDoS attacks**
3. **Ensure fair usage** across all users
4. **Maintain system stability** under load
5. **Provide consistent performance**
Additionally, we need a caching strategy to:
1. **Reduce database load** for frequently accessed data
2. **Improve response times** for common requests
3. **Support horizontal scaling** with shared cache
4. **Handle cache invalidation** properly
## Decision
We will implement a **multi-phase caching and rate limiting strategy** with the following components:
### Phase 1: In-Memory Cache with TTL Support
**Library Selection**: We will use **`github.com/patrickmn/go-cache`** for in-memory caching because:
**Pros:**
- Simple, lightweight, and well-maintained
- Built-in TTL (Time-To-Live) support
- Thread-safe by default
- No external dependencies
- Good performance for single-instance applications
- Supports automatic expiration
**Cons:**
- Not shared between multiple instances
- Memory-bound (not persistent)
- Limited advanced features
**Implementation Plan:**
```go
type CacheService interface {
Set(key string, value interface{}, expiration time.Duration) error
Get(key string) (interface{}, bool)
Delete(key string) error
Flush() error
GetWithTTL(key string) (interface{}, time.Duration, bool)
}
type InMemoryCacheService struct {
cache *cache.Cache
defaultTTL time.Duration
cleanupInterval time.Duration
}
```
**Use Cases:**
- JWT token validation results
- User session data
- Frequently accessed greet messages
- API response caching for idempotent endpoints
### Phase 2: Redis-Compatible Shared Cache
**Library Selection**: We will use **`github.com/redis/go-redis/v9`** with a **Redis-compatible open-source alternative**:
**Primary Choice**: **Dragonfly** (https://www.dragonflydb.io/)
- Redis-compatible
- Open-source (Apache 2.0 license)
- Written in C++ with multi-threaded architecture
- 25x higher throughput than Redis
- Lower latency
- Drop-in Redis replacement
**Fallback Choice**: **KeyDB** (https://keydb.dev/)
- Multi-threaded Redis fork
- Open-source (GPL license)
- Better performance than Redis
- Full Redis API compatibility
**Implementation Plan:**
```go
type RedisCacheService struct {
client *redis.Client
defaultTTL time.Duration
prefix string
}
func NewRedisCacheService(config *config.CacheConfig) (*RedisCacheService, error) {
client := redis.NewClient(&redis.Options{
Addr: config.Host + ":" + strconv.Itoa(config.Port),
Password: config.Password,
DB: config.Database,
PoolSize: config.PoolSize,
})
// Test connection
_, err := client.Ping(context.Background()).Result()
if err != nil {
return nil, fmt.Errorf("failed to connect to Redis: %w", err)
}
return &RedisCacheService{
client: client,
defaultTTL: config.DefaultTTL,
prefix: config.Prefix,
}, nil
}
```
**Configuration:**
```yaml
cache:
# In-memory cache configuration
in_memory:
enabled: true
default_ttl: 5m
cleanup_interval: 10m
max_items: 10000
# Redis-compatible cache configuration
redis:
enabled: false
host: "localhost"
port: 6379
password: ""
database: 0
pool_size: 10
default_ttl: 5m
prefix: "dlc:"
use_dragonfly: true # Set to false to use KeyDB
```
### Phase 3: Rate Limiting Implementation
**Library Selection**: We will use **`github.com/ulule/limiter/v3`** because:
**Pros:**
- Multiple storage backends (in-memory, Redis, etc.)
- Sliding window algorithm
- Distributed rate limiting support
- Configurable rate limits
- Middleware support for Chi router
- Good performance
**Implementation Plan:**
```go
// Rate limit configuration
type RateLimitConfig struct {
Enabled bool `mapstructure:"enabled"`
RequestsPerHour int `mapstructure:"requests_per_hour"`
BurstLimit int `mapstructure:"burst_limit"`
IPWhitelist []string `mapstructure:"ip_whitelist"`
EndpointSpecific map[string]struct {
RequestsPerHour int `mapstructure:"requests_per_hour"`
BurstLimit int `mapstructure:"burst_limit"`
} `mapstructure:"endpoint_specific"`
}
// Rate limiter service
type RateLimiterService struct {
limiter *limiter.Limiter
store limiter.Store
config *RateLimitConfig
}
func NewRateLimiterService(config *RateLimitConfig) (*RateLimiterService, error) {
var store limiter.Store
// Use Redis if available, otherwise use in-memory
if config.UseRedis {
// Initialize Redis store
store, err = limiter.NewStoreRedisWithOptions(&limiter.StoreOptions{
Prefix: config.RedisPrefix,
// ... other Redis options
})
} else {
// Use in-memory store
store = limiter.NewStoreMemory()
}
if err != nil {
return nil, fmt.Errorf("failed to create rate limiter store: %w", err)
}
// Create rate limiter
rate := limiter.Rate{
Period: time.Hour,
Limit: int64(config.RequestsPerHour),
}
return &RateLimiterService{
limiter: limiter.New(store, rate),
store: store,
config: config,
}, nil
}
```
**Chi Middleware:**
```go
func RateLimitMiddleware(limiter *RateLimiterService) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Skip rate limiting for whitelisted IPs
clientIP := r.Header.Get("X-Real-IP")
if clientIP == "" {
clientIP = r.RemoteAddr
}
for _, allowedIP := range limiter.config.IPWhitelist {
if clientIP == allowedIP {
next.ServeHTTP(w, r)
return
}
}
// Get rate limit context
context, err := limiter.limiter.Get(r.Context(), clientIP)
if err != nil {
log.Error().Err(err).Str("ip", clientIP).Msg("Rate limit error")
http.Error(w, "Internal server error", http.StatusInternalServerError)
return
}
// Check if rate limit is exceeded
if context.Reached > 0 {
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
w.Header().Set("X-RateLimit-Remaining", "0")
w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
http.Error(w, "Too many requests", http.StatusTooManyRequests)
return
}
// Set rate limit headers
w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limiter.config.RequestsPerHour))
w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(limiter.config.RequestsPerHour-int(context.Reached)))
w.Header().Set("X-RateLimit-Reset", strconv.Itoa(int(context.Reset)))
next.ServeHTTP(w, r)
})
}
}
```
### Phase 4: Cache Invalidation Strategy
**Approach**: Hybrid cache invalidation with multiple strategies:
1. **Time-Based Expiration (TTL)**
- All cache entries have a TTL
- Automatic expiration prevents stale data
- Default TTL: 5 minutes for most data
2. **Event-Based Invalidation**
- Cache keys are invalidated on specific events
- Example: User data cache invalidated on user update
- Uses pub/sub pattern for distributed invalidation
3. **Versioned Cache Keys**
- Cache keys include data version
- When data changes, version increments
- Old cache entries naturally expire
4. **Write-Through Caching**
- Data written to database and cache simultaneously
- Ensures cache is always up-to-date
- Used for critical data that must be consistent
**Cache Key Strategy:**
```go
func GetCacheKey(prefix, entityType, entityID string) string {
return fmt.Sprintf("%s:%s:%s", prefix, entityType, entityID)
}
// Example: "dlc:user:123"
// Example: "dlc:jwt:validation:token_hash"
```
## Implementation Phases
### Phase 1: In-Memory Cache (Current Sprint)
- ✅ Research and select in-memory cache library
- ✅ Implement cache interface and in-memory service
- ✅ Add cache configuration to config package
- ✅ Implement basic cache operations (set, get, delete)
- ✅ Add TTL support and automatic cleanup
- ✅ Cache JWT validation results
- ✅ Add cache metrics and monitoring
### Phase 2: Redis-Compatible Cache (Next Sprint)
- ✅ Set up Dragonfly/KeyDB in development environment
- ✅ Implement Redis cache service
- ✅ Add configuration for Redis connection
- ✅ Implement cache fallback strategy (Redis → in-memory)
- ✅ Add health checks for Redis connection
- ✅ Implement distributed cache invalidation
### Phase 3: Rate Limiting (Following Sprint)
- ✅ Research and select rate limiting library
- ✅ Implement rate limiter service
- ✅ Add rate limit configuration
- ✅ Implement Chi middleware for rate limiting
- ✅ Add rate limit headers to responses
- ✅ Implement IP whitelisting
- ✅ Add endpoint-specific rate limits
### Phase 4: Advanced Features (Future)
- ✅ Cache warming for critical data
- ✅ Two-level caching (Redis + in-memory)
- ✅ Cache compression for large objects
- ✅ Rate limit exemptions for admin users
- ✅ Dynamic rate limit adjustment
- ✅ Cache analytics and usage patterns
## Configuration
```yaml
# Cache configuration
cache:
in_memory:
enabled: true
default_ttl: "5m"
cleanup_interval: "10m"
max_items: 10000
redis:
enabled: false
host: "localhost"
port: 6379
password: ""
database: 0
pool_size: 10
default_ttl: "5m"
prefix: "dlc:"
use_dragonfly: true
# Rate limiting configuration
rate_limiting:
enabled: true
requests_per_hour: 1000
burst_limit: 100
ip_whitelist:
- "127.0.0.1"
- "::1"
endpoint_specific:
"/api/v1/auth/login":
requests_per_hour: 100
burst_limit: 10
"/api/v1/auth/register":
requests_per_hour: 50
burst_limit: 5
```
## Monitoring and Metrics
**Cache Metrics:**
- Cache hit/miss ratio
- Average cache latency
- Cache size and memory usage
- Eviction rate
- TTL distribution
**Rate Limit Metrics:**
- Requests allowed vs rejected
- Rate limit exceeded events
- Top limited IPs
- Endpoint-specific rate limit usage
**Prometheus Metrics:**
```go
var (
cacheHits = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "cache_hits_total",
Help: "Number of cache hits",
}, []string{"cache_type", "entity_type"})
cacheMisses = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "cache_misses_total",
Help: "Number of cache misses",
}, []string{"cache_type", "entity_type"})
rateLimitExceeded = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "rate_limit_exceeded_total",
Help: "Number of rate limit exceeded events",
}, []string{"endpoint", "ip"})
)
```
## Security Considerations
1. **Cache Security:**
- Never cache sensitive user data (passwords, tokens)
- Use separate cache prefixes for different data types
- Implement cache key hashing for sensitive data
- Set appropriate TTLs to limit exposure
2. **Rate Limit Security:**
- Prevent rate limit bypass attacks
- Use X-Real-IP header for proper IP detection
- Implement rate limit for authentication endpoints
- Log rate limit violations for security monitoring
3. **Redis Security:**
- Use authentication if enabled
- Implement TLS for Redis connections
- Use separate database numbers for different environments
- Limit Redis commands to prevent abuse
## Performance Considerations
1. **Cache Performance:**
- Benchmark cache operations
- Monitor cache latency
- Optimize cache key size
- Use appropriate data structures
2. **Rate Limit Performance:**
- Use efficient rate limiting algorithm
- Minimize middleware overhead
- Cache rate limit decisions
- Batch rate limit checks where possible
3. **Memory Management:**
- Set reasonable cache size limits
- Monitor memory usage
- Implement cache eviction policies
- Use memory-efficient data structures
## Migration Strategy
### From No Cache to In-Memory Cache
1. Implement cache interface and in-memory service
2. Add cache configuration with sensible defaults
3. Gradually add caching to critical endpoints
4. Monitor cache performance and hit ratios
5. Adjust TTLs based on usage patterns
### From In-Memory to Redis Cache
1. Set up Dragonfly/KeyDB in development
2. Implement Redis cache service
3. Add fallback logic (Redis → in-memory)
4. Test with both caches enabled
5. Gradually migrate to Redis-only
6. Monitor distributed cache performance
### From No Rate Limiting to Rate Limiting
1. Implement rate limiter with generous limits
2. Add monitoring for rate limit events
3. Gradually tighten limits based on usage
4. Add IP whitelist for critical services
5. Implement endpoint-specific limits
6. Monitor and adjust as needed
## Alternatives Considered
### Cache Libraries
1. **`github.com/bluele/gcache`** - More features but more complex
2. **`github.com/allegro/bigcache`** - High performance but no TTL
3. **`github.com/coocood/freecache`** - Very fast but limited API
### Redis Alternatives
1. **Redis Enterprise** - Commercial, not open-source
2. **Memcached** - No persistence, simpler protocol
3. **Couchbase** - More complex, document-oriented
### Rate Limiting Libraries
1. **`golang.org/x/time/rate`** - Simple but no distributed support
2. **`github.com/juju/ratelimit`** - Good but limited features
3. **Custom implementation** - Too much development effort
## Success Metrics
1. **Cache Effectiveness:**
- Cache hit ratio > 80%
- Average cache latency < 1ms
- Memory usage within limits
2. **Rate Limiting Effectiveness:**
- < 1% of legitimate requests blocked
- Effective protection against abuse
- No impact on normal usage patterns
3. **System Stability:**
- Reduced database load by 50%
- Consistent response times under load
- No cache-related outages
## Risks and Mitigations
| Risk | Mitigation |
|------|------------|
| Cache stampede | Implement cache warming and fallback logic |
| Memory exhaustion | Set reasonable cache size limits and monitor usage |
| Redis failure | Implement fallback to in-memory cache |
| Rate limit false positives | Start with generous limits and monitor |
| Performance degradation | Benchmark before and after implementation |
| Cache inconsistency | Use appropriate invalidation strategies |
## Future Enhancements
1. **Cache Pre-warming** - Load frequently used data at startup
2. **Two-Level Caching** - Local cache + distributed cache
3. **Cache Compression** - For large cache objects
4. **Dynamic Rate Limits** - Adjust based on system load
5. **User-Specific Rate Limits** - Different limits for different user tiers
6. **Cache Analytics** - Detailed usage patterns and optimization
## References
- [go-cache documentation](https://github.com/patrickmn/go-cache)
- [Dragonfly documentation](https://www.dragonflydb.io/docs)
- [KeyDB documentation](https://keydb.dev/)
- [limiter/v3 documentation](https://github.com/ulule/limiter)
- [Chi middleware documentation](https://github.com/go-chi/chi)
## Decision Drivers
1. **Simplicity** - Easy to implement and maintain
2. **Performance** - Minimal impact on response times
3. **Scalability** - Support for horizontal scaling
4. **Reliability** - Graceful degradation on failures
5. **Open Source** - Preference for open-source solutions
6. **Community** - Active development and support
## Conclusion
This ADR proposes a comprehensive caching and rate limiting strategy that will significantly improve the performance, scalability, and reliability of the dance-lessons-coach application. The phased approach allows for gradual implementation and testing, minimizing risk while delivering value at each stage.
The combination of in-memory caching for single-instance deployments and Redis-compatible caching for distributed environments provides flexibility for different deployment scenarios. The rate limiting implementation will protect the application from abuse while maintaining a good user experience.
This strategy aligns with our architectural principles of simplicity, performance, and scalability while using well-established open-source technologies with strong community support.

View File

@@ -0,0 +1,264 @@
# Config Hot Reloading Strategy
* Status: Proposed
* Deciders: Gabriel Radureau, AI Agent
* Date: 2026-04-05
## Context and Problem Statement
The dance-lessons-coach application currently loads configuration once at startup using Viper, which supports file-based configuration, environment variables, and defaults. However, the current implementation does not support runtime configuration changes without restarting the application.
We need to determine whether and how to implement config hot reloading - the ability to detect changes to the optional `config.yaml` file and apply those changes without requiring a full application restart.
## Decision Drivers
* **Development convenience**: Hot reloading would allow developers to change configuration without restarting the server during development
* **Production flexibility**: Ability to adjust certain configuration parameters without downtime
* **Complexity**: Hot reloading adds significant complexity to the codebase
* **Safety**: Some configuration changes require careful handling to avoid runtime errors
* **Viper capabilities**: Viper already supports file watching through `viper.WatchConfig()`
* **Configuration scope**: Not all configuration parameters can or should be hot-reloaded
## Considered Options
### Option 1: Full Hot Reloading with Viper WatchConfig
Implement comprehensive hot reloading using Viper's built-in `WatchConfig()` functionality to monitor the config file and automatically reload when changes are detected.
### Option 2: Selective Hot Reloading
Only allow hot reloading for specific configuration sections that are safe to change at runtime (e.g., logging level, feature flags) while requiring restart for others (e.g., server host/port, database credentials).
### Option 3: Manual Reload Endpoint
Add an admin endpoint (e.g., `POST /api/admin/reload-config`) that triggers configuration reload when called, giving explicit control over when reloading happens.
### Option 4: No Hot Reloading
Maintain the current approach of loading configuration only at startup, requiring application restart for any configuration changes.
## Decision Outcome
Chosen option: **"Selective Hot Reloading"** because it provides the benefits of runtime configuration changes while maintaining safety and control. This approach:
* Allows safe configuration changes without restart
* Prevents dangerous runtime changes to critical parameters
* Leverages Viper's existing capabilities
* Provides a clear boundary between hot-reloadable and non-hot-reloadable settings
## Implementation Strategy
### Hot-Reloadable Configuration
The following configuration parameters will support hot reloading:
* **Logging level** (`logging.level`)
* **Feature flags** (`api.v2_enabled`)
* **Telemetry sampling** (`telemetry.sampler.type`, `telemetry.sampler.ratio`)
* **JWT TTL** (`auth.jwt.ttl`)
### Non-Hot-Reloadable Configuration
These parameters will require application restart:
* **Server settings** (`server.host`, `server.port`)
* **Database credentials** (`database.*`)
* **JWT secret** (`auth.jwt_secret`)
* **Admin credentials** (`auth.admin_master_password`)
### Implementation Plan
```go
// Add to config package
type ConfigManager struct {
config *Config
viper *viper.Viper
changeChan chan struct{}
stopChan chan struct{}
}
func NewConfigManager() (*ConfigManager, error) {
// Initialize Viper and load initial config
// Start file watcher if config file exists
}
func (cm *ConfigManager) StartWatching() {
if cm.viper != nil {
cm.viper.WatchConfig()
cm.viper.OnConfigChange(func(e fsnotify.Event) {
cm.handleConfigChange()
})
}
}
func (cm *ConfigManager) handleConfigChange() {
// Reload only safe configuration sections
// Update logging level if changed
// Update feature flags if changed
// Notify other components of changes
log.Info().Msg("Configuration reloaded (partial)")
}
// Safe getter methods that work with hot reloading
func (cm *ConfigManager) GetLogLevel() string {
// Return current value, potentially updated via hot reload
}
```
### Configuration File Monitoring
```go
// In main application setup
func main() {
configManager, err := config.NewConfigManager()
if err != nil {
log.Fatal().Err(err).Msg("Failed to initialize config")
}
// Start watching for config changes
configManager.StartWatching()
// Use configManager throughout application instead of direct config access
}
```
## Pros and Cons of the Options
### Option 1: Full Hot Reloading with Viper WatchConfig
* **Good**: Maximum flexibility for configuration changes
* **Good**: Leverages Viper's built-in capabilities
* **Good**: Good for development workflow
* **Bad**: High risk of runtime errors from unsafe changes
* **Bad**: Complex to implement safely
* **Bad**: Hard to debug configuration-related issues
### Option 2: Selective Hot Reloading (Chosen)
* **Good**: Safe approach with clear boundaries
* **Good**: Balances flexibility and stability
* **Good**: Easier to implement and maintain
* **Good**: Clear documentation of what can be changed
* **Bad**: More complex than no hot reloading
* **Bad**: Requires careful design of config access patterns
### Option 3: Manual Reload Endpoint
* **Good**: Explicit control over when reloading happens
* **Good**: Can be secured with authentication
* **Good**: Good for production environments
* **Bad**: Less convenient for development
* **Bad**: Requires additional API endpoint management
* **Bad**: Still needs same safety considerations as automatic reloading
### Option 4: No Hot Reloading
* **Good**: Simplest approach
* **Good**: No risk of runtime configuration errors
* **Good**: Easier to reason about application state
* **Bad**: Requires restart for any configuration change
* **Bad**: Less flexible for production adjustments
* **Bad**: Slower development iteration
## Configuration Change Handling
### Safe Change Pattern
```go
// Example: Logging level change
func (cm *ConfigManager) handleConfigChange() {
// Get new config values
newConfig := &Config{}
if err := cm.viper.Unmarshal(newConfig); err != nil {
log.Error().Err(err).Msg("Failed to unmarshal new config")
return
}
// Apply safe changes
if newConfig.Logging.Level != cm.config.Logging.Level {
if err := cm.applyLogLevelChange(newConfig.Logging.Level); err != nil {
log.Error().Err(err).Msg("Failed to apply log level change")
}
}
// Update other safe parameters...
}
func (cm *ConfigManager) applyLogLevelChange(newLevel string) error {
// Validate new level
level := parseLogLevel(newLevel)
// Apply change
zerolog.SetGlobalLevel(level)
cm.config.Logging.Level = newLevel
log.Info().Str("new_level", newLevel).Msg("Log level updated")
return nil
}
```
### Error Handling
* Invalid configuration changes are logged but don't crash the application
* Failed changes revert to previous known-good values
* Critical errors during reload trigger application shutdown
* All changes are logged for audit purposes
## Links
* [Viper WatchConfig Documentation](https://github.com/spf13/viper#watching-and-re-reading-config-files)
* [Viper OnConfigChange](https://github.com/spf13/viper#example-of-watching-a-config-file)
* [ADR-0006: Configuration Management](0006-configuration-management.md)
## Configuration File Example with Hot-Reloadable Settings
```yaml
# config.yaml - These settings can be hot-reloaded
server:
host: "0.0.0.0"
port: 8080
logging:
level: "info" # Can be changed without restart
json: false
output: ""
api:
v2_enabled: false # Can be changed without restart
telemetry:
enabled: false
sampler:
type: "parentbased_always_on" # Can be changed without restart
ratio: 1.0
```
## Migration Plan
1. **Phase 1**: Implement ConfigManager wrapper around existing config
2. **Phase 2**: Add selective hot reloading for logging level
3. **Phase 3**: Extend to feature flags and telemetry settings
4. **Phase 4**: Add documentation and examples
5. **Phase 5**: Update all components to use ConfigManager instead of direct config access
## Monitoring and Observability
* Log all configuration changes with timestamps
* Include previous and new values in change logs
* Add metrics for configuration reload events
* Provide admin endpoint to view current configuration
## Security Considerations
* Config file permissions should be restrictive
* Hot reloading should be disabled in production by default
* Configuration changes should be audited
* Sensitive parameters should never be hot-reloadable
## Future Enhancements
* Configuration change webhooks
* Configuration versioning and rollback
* Configuration validation before applying changes
* Multi-file configuration support

View File

@@ -0,0 +1,358 @@
# ADR 0024: BDD Test Organization and Isolation Strategy
## Status
**Proposed** 🟡
## Context
As the dance-lessons-coach project grows, our BDD test suite has encountered several challenges. While we initially followed basic Godog patterns, we need to evolve our organization to handle complex scenarios like config hot reloading while maintaining test reliability.
### Current Issues
1. **Test Interdependence**: Tests affect each other through shared state (config files, database)
2. **Timing Issues**: Config reloading and server restarts cause race conditions
3. **Cognitive Load**: Large test files with many scenarios are hard to maintain
4. **Flaky Tests**: Tests pass individually but fail when run together
5. **Edge Case Handling**: Special setup/teardown requirements for certain tests
### Godog Best Practices Alignment
According to [Godog documentation](https://github.com/cucumber/godog) and community best practices, our current organization partially follows recommendations but needs improvement in:
- **Feature Granularity**: Some files contain multiple unrelated features
- **Step Organization**: Steps could be better grouped by domain
- **Context Management**: Need better state isolation between scenarios
- **Tagging Strategy**: Currently missing tag-based test selection
## Decision
Adopt a **modular, isolated test suite architecture** with the following principles:
### 1. Test Organization by Feature (Godog-Aligned)
Following [Godog best practices](https://github.com/cucumber/godog), we organize tests by business domain with proper feature granularity:
```
features/
├── auth/ # Business domain
│ ├── authentication.feature # Single feature per file
│ ├── password_reset.feature # Single feature per file
│ └── user_management.feature # Single feature per file
├── config/ # Business domain
│ ├── hot_reloading.feature # Single feature per file
│ └── validation.feature # Single feature per file
├── greet/ # Business domain
│ ├── v1_greeting.feature # Single feature per file
│ └── v2_greeting.feature # Single feature per file
├── health/ # Business domain
│ └── health_check.feature # Single feature per file
└── jwt/ # Business domain
├── secret_rotation.feature # Single feature per file
└── retention_policy.feature # Single feature per file
```
**Key Improvements over current structure:**
-**Single responsibility**: One feature per file
-**Business alignment**: Grouped by domain, not technical concerns
-**Scalability**: Easy to add new features without bloating files
### 2. Isolation Strategies
#### A. Config File Isolation
- Each feature directory has its own config file pattern
- Config files are cleaned up after each feature test run
- Example: `features/auth/auth-test-config.yaml`
#### B. Database Isolation
- Use separate database schemas or suffixes per feature
- Example: `dance_lessons_coach_auth_test`, `dance_lessons_coach_greet_test`
#### C. Server Port Isolation
- Assign different ports to different test groups
- Prevents port conflicts during parallel testing
### 3. Test Execution Strategy
#### Option 1: Sequential Feature Testing (Recommended)
```bash
# Run tests by feature group
./scripts/test-feature.sh auth
./scripts/test-feature.sh config
./scripts/test-feature.sh greet
```
#### Option 2: Parallel Feature Testing (Advanced)
```bash
# Run features in parallel with isolation
./scripts/test-all-features-parallel.sh
```
### 4. Test Synchronization (Godog Best Practices)
#### A. Explicit Waits with Timeouts
Following Godog's [arrange-act-assert pattern](https://alicegg.tech/2019/03/09/gobdd.html):
```go
// Instead of fixed sleep times
func waitForServerReady(maxAttempts int, delay time.Duration) error {
for i := 0; i < maxAttempts; i++ {
if serverIsReady() {
return nil
}
time.Sleep(delay)
}
return fmt.Errorf("server not ready after %d attempts", maxAttempts)
}
```
#### B. Godog Context Management
Implement proper context structs as recommended by Godog:
```go
// Feature-specific context for isolation
type AuthContext struct {
client *testserver.Client
db *sql.DB
users map[string]UserData
}
func InitializeAuthContext() *AuthContext {
return &AuthContext{
client: testserver.NewClient(),
db: connectToFeatureDB("auth"),
users: make(map[string]UserData),
}
}
func CleanupAuthContext(ctx *AuthContext) {
// Cleanup resources
ctx.db.Close()
}
```
#### C. Tag-Based Test Selection
Add Godog tag support for selective test execution:
```go
// In feature files
@smoke @auth
Scenario: Successful user authentication
Given the server is running
When I authenticate with valid credentials
Then the authentication should be successful
// Run specific tags
go test ./features/... -tags=smoke
godog --tags=@auth features/
```
#### B. Event-Based Synchronization
```go
// Use server lifecycle events
func waitForConfigReload() error {
return waitForEvent("config_reloaded", 30*time.Second)
}
```
#### C. Test Hooks with Timeouts
```go
// In test setup
ctx.Step("^I wait for v2 API to be enabled$", func() error {
return waitForCondition(30*time.Second, func() bool {
return v2EndpointAvailable()
})
})
```
### 5. Test Lifecycle Management
#### Before Suite (Feature Level)
```go
func InitializeFeatureSuite(featureName string) {
// Setup feature-specific resources
initDatabaseForFeature(featureName)
createFeatureConfigFile(featureName)
startIsolatedServer(featureName)
}
```
#### After Suite (Feature Level)
```go
func CleanupFeatureSuite(featureName string) {
// Cleanup feature-specific resources
cleanupDatabaseForFeature(featureName)
removeFeatureConfigFile(featureName)
stopIsolatedServer(featureName)
}
```
### 6. Shell Script Integration
Create feature-specific test scripts:
```bash
# scripts/test-feature.sh
#!/bin/bash
FEATURE=$1
DATABASE="dance_lessons_coach_${FEATURE}_test"
CONFIG="features/${FEATURE}/${FEATURE}-test-config.yaml"
# Setup
setup_feature_environment() {
echo "🧪 Setting up ${FEATURE} feature tests..."
create_database ${DATABASE}
generate_config ${CONFIG}
}
# Run tests
run_feature_tests() {
echo "🚀 Running ${FEATURE} feature tests..."
DLC_DATABASE_NAME=${DATABASE} \
DLC_CONFIG_FILE=${CONFIG} \
go test ./features/${FEATURE}/... -v
}
# Teardown
cleanup_feature_environment() {
echo "🧹 Cleaning up ${FEATURE} feature tests..."
drop_database ${DATABASE}
remove_config ${CONFIG}
}
# Main execution
setup_feature_environment
run_feature_tests
cleanup_feature_environment
```
### 7. Configuration Management
#### Feature-Specific Config Files
```yaml
# features/auth/auth-test-config.yaml
server:
host: "127.0.0.1"
port: 9192 # Feature-specific port
database:
name: "dance_lessons_coach_auth_test" # Feature-specific database
api:
v2_enabled: true # Feature-specific settings
auth:
jwt:
ttl: 1h
```
### 8. Test Data Management
#### A. Feature-Scoped Data
- Each feature gets its own data namespace
- Example: `auth_user_*`, `greet_message_*` prefixes
#### B. Automatic Cleanup
```go
func CleanupFeatureData(featureName string) {
// Remove all data created by this feature
db.Exec(fmt.Sprintf("DELETE FROM %s_* WHERE feature = '%s'", featureName, featureName))
}
```
## Consequences
### Positive
1. **Improved Test Reliability**: Tests don't interfere with each other
2. **Better Maintainability**: Smaller, focused test files
3. **Faster Development**: Run only relevant tests during feature development
4. **Easier Debugging**: Isolate issues to specific features
5. **Parallel Testing**: Enable safe parallel execution
6. **SOLID Compliance**: Single responsibility for test files
### Negative
1. **Increased Complexity**: More moving parts in test infrastructure
2. **Resource Usage**: Multiple databases/servers consume more resources
3. **Setup Time**: Initial test runs may be slower due to setup
4. **Learning Curve**: Team needs to understand the isolation patterns
### Neutral
1. **Test Execution Time**: May increase or decrease depending on parallelization
2. **CI/CD Changes**: Pipeline needs adaptation for new test organization
## Implementation Plan
### Phase 1: Refactor Current Tests (1-2 weeks)
1. Split monolithic feature files into feature directories
2. Create feature-specific test scripts
3. Implement basic isolation (config files, database names)
### Phase 2: Enhance Test Infrastructure (2-3 weeks)
1. Add synchronization helpers to test framework
2. Implement server lifecycle management
3. Create comprehensive cleanup routines
### Phase 3: Parallel Testing (Optional)
1. Add parallel test execution capability
2. Implement port management for parallel runs
3. Add resource monitoring
## Alternatives Considered
### 1. Single Test Suite with Better Cleanup
**Rejected because**: Doesn't solve fundamental interdependence issues
### 2. Docker-Based Isolation
**Rejected because**: Too heavyweight for local development
### 3. Test Virtualization
**Rejected because**: Overkill for current project size
## Success Metrics
1. **Test Reliability**: >95% pass rate in CI/CD
2. **Test Isolation**: Ability to run any single feature test independently
3. **Developer Experience**: Feature tests run in <30 seconds locally
4. **Maintainability**: New team members can understand test structure in <1 hour
## References
### Godog Official Resources
- [Godog GitHub Repository](https://github.com/cucumber/godog)
- [Godog Documentation](https://pkg.go.dev/github.com/cucumber/godog)
### BDD Best Practices
- [BDD Best Practices](references/BDD_BEST_PRACTICES.md)
- [Alice GG • BDD in Golang](https://alicegg.tech/2019/03/09/gobdd.html)
- [Scrap Your TDD for BDD: Part II](https://medium.com/the-godev-corner/scrap-your-tdd-for-bdd-part-ii-heres-how-to-start-d2468dd46dda)
### Test Organization Patterns
- [Test Server Implementation](references/TEST_SERVER.md)
- [Optimizing Godog Test Execution](https://www.reddit.com/r/golang/comments/1llnlp2/optimizing_godog_bdd_test_execution_in_go_how_to/)
## Revision History
- **2026-04-09**: Initial draft based on BDD test challenges
- **2026-04-09**: Added implementation details and examples
## Decision Makers
- **Approved by**: Gabriel Radureau
- **Consulted**: AI Agent (Mistral Vibe)
- **Informed**: Development Team
## Future Considerations
1. **Test Impact Analysis**: Track which tests are affected by code changes
2. **Flaky Test Detection**: Automatically identify and quarantine flaky tests
3. **Performance Benchmarking**: Monitor test execution times over time
4. **Test Coverage Visualization**: Feature-level coverage reports
---
**Status**: 🟡 Proposed → Ready for team review and implementation
**Note**: This ADR complements ADR 0023 (Config Hot Reloading) by addressing the test organization aspects of hot reloading functionality.

View File

@@ -0,0 +1,341 @@
# ADR 0025: BDD Scenario Isolation Strategies
## Status
**Proposed** 🟡
## Context
As our BDD test suite grows, we're encountering **test pollution** issues where scenarios interfere with each other through shared state. This is particularly problematic for:
1. **Database state**: Scenarios create users, JWT secrets, config entries that persist across scenarios
2. **JWT secret rotation**: Multiple secrets accumulate, affecting subsequent scenario authentication
3. **Config file modifications**: Feature flag changes persist between tests
4. **Gherkin Background steps**: Data set up in Background is visible to all scenarios in the feature
Our current approach clears database tables after each scenario, but this has **race condition vulnerabilities** with concurrent scenario execution.
### Gherkin Background Consideration
Crucially, Gherkin's `Background` section runs **before each scenario** in a feature, not once before all scenarios. This means:
```gherkin
Feature: User registration
Background:
Given the database is empty
And a default admin user exists
Scenario: Register new user
When I register user "alice"
Then user "alice" should exist
Scenario: Register duplicate user
When I register user "alice"
Then I should see error "user already exists"
```
The second scenario fails because Background creates data that persists, and the first scenario's data isn't cleaned up. Background steps are re-executed before each scenario.
## Decision Drivers
* **Isolation**: Each scenario must start with a clean slate
* **Performance**: Cleanup must be fast enough for CI/CD pipelines
* **Concurrency**: Must work with parallel scenario execution
* **Compatibility**: Must work with Gherkin Background steps
* **Maintainability**: Solution should be simple to understand and debug
## Considered Options
### Option 1: Transaction Rollback (Rejected ❌)
Wrap each scenario in a database transaction, rollback at the end.
```go
BeforeScenario: BEGIN;
AfterScenario: ROLLBACK;
```
**Pros:**
- Simple implementation
- Fast - transaction rollback is nearly instant
- No data cleanup needed
**Cons:**
-**Fails if scenario commits**: Nested transaction problem - `COMMIT` inside scenario releases the transaction, parent `ROLLBACK` has no effect
- Cannot handle non-database state (JWT secrets in memory, config files)
- Doesn't solve JWT secret pollution
**Verdict: Not viable** - Too many scenarios use database transactions internally.
---
### Option 2: Clear Tables in Public Schema (Current ✅/⚠️)
Delete all rows from all tables after each scenario.
```go
AfterScenario: DELETE FROM table1; DELETE FROM table2; ...
```
**Pros:**
- Currently implemented
- Works with any scenario code
- Handles database state
**Cons:**
- ⚠️ **Race conditions**: Concurrent scenarios can interleave - Scenario A deletes data while Scenario B is still using it
- ⚠️ **Slow**: Must delete from all tables, reset sequences
-**Misses in-memory state**: JWT secrets, config changes persist
-**Doesn't handle Background**: Background data is shared across scenarios
**Verdict: Partially adequate** - Works for sequential execution but has parallel execution issues.
---
### Option 3: Schema-per-Scenario (Recommended ✅)
Create a unique PostgreSQL schema for each scenario, drop it after.
```go
BeforeScenario:
schema := "test_" + sha256(scenario.Name)[:8]
CREATE SCHEMA schema;
SET search_path = schema, public;
AfterScenario:
DROP SCHEMA schema CASCADE;
```
**Pros:**
-**True isolation**: Each scenario has its own database namespace
-**Works with transactions**: Scenario can commit freely - entire schema is dropped
-**Works with Background**: Background runs in scenario's schema, data is isolated
-**Fast**: Schema drop is instant (just metadata deletion)
-**Handles concurrent scenarios**: Different schemas = no conflicts
**Cons:**
- Requires `CREATE/DROP SCHEMA` database privileges in test environment
- Some ORMs may hardcode `public` schema - need to use `SET search_path` carefully
- Test DB must allow many schemas (typically fine for PostgreSQL)
- We need to handle `search_path` in connection pooling (each scenario needs its own connection)
**Implementation notes:**
- Use `Luego` (PostgreSQL schema prefix) approach: `test_{hash}`
- Hash: `sha256(feature_name + scenario_name)[:8]` for consistency across runs
- Execute Background steps in the scenario's schema context
- Set `search_path` at the connection level, not globally
---
### Option 4: Database-per-Feature ⚠️
Create a separate database for each feature file.
```go
BeforeFeature: CREATE DATABASE feature_auth;
AfterFeature: DROP DATABASE feature_auth;
```
**Pros:**
- Strong isolation between features
- Simple implementation
**Cons:**
-**Doesn't isolate scenarios within a feature** - Background data shared across scenarios
- Database creation is slower than schema creation
- Harder to manage in CI (more databases to create/cleanup)
- Still need table clearing between scenarios within a feature
**Verdict: Insufficient** - Doesn't solve intra-feature pollution.
---
### Option 5: Schema-per-Feature + Table Clearing per Scenario ⚠️
Create one schema per feature, clear tables between scenarios.
```go
BeforeFeature: CREATE SCHEMA feature_auth;
AfterFeature: DROP SCHEMA feature_auth;
AfterScenario: DELETE FROM all_tables;
```
**Pros:**
- Isolates features from each other
- Simpler than per-scenario schemas
**Cons:**
-**Scenarios within a feature share state** - Background data persists
- Still has race conditions with concurrent scenarios in same feature
- Requires table clearing overhead
**Verdict: Better than current but still has issues**.
---
## Decision Outcome
**Chosen option: Schema-per-Scenario + In-Memory State Reset + Per-Scenario Step State (Option 3 Enhanced)**
We will implement schema-per-scenario because it:
1. Provides **true isolation** for all database state
2. **Works with Gherkin Background** - Background runs in each scenario's schema
3. **Handles concurrent execution** - No race conditions
4. **Works with scenario transactions** - Scenarios can commit freely
5. Is **fast** - Schema operations are cheap
**However, we discovered a critical limitation:** PostgreSQL schemas only isolate **database tables**. In-memory state (application-level caches, user stores, JWT secret managers) **persists across scenarios** because they're stored in the shared `sharedServer` Go instance. Schema isolation does NOT solve this.
### Enhanced Strategy: Multi-Layer Isolation
To achieve **complete scenario isolation**, we need a **3-layer approach:**
| Layer | Component | Strategy | Status |
|-------|-----------|----------|--------|
| DB | PostgreSQL tables | Schema-per-scenario | ✅ Implemented |
| Memory | Server-level state (JWT secrets) | Reset to initial state | ✅ Implemented |
| Memory | Step-level state (tokens, user IDs) | Per-scenario state map | ✅ Implemented |
| Memory | User store | Reset/clear between scenarios | ⚠️ TODO |
| Memory | Auth cache | Reset/clear between scenarios | ⚠️ TODO |
| Cache | Redis/Memcached | Key prefix with schema hash | ⚠️ TODO |
### Layer 3: Per-Scenario Step State Isolation
**New insight from test failures:** Step definition structs (AuthSteps, GreetSteps, etc.) maintain state in their fields:
- `lastToken`, `firstToken` in AuthSteps
- `lastUserID` in AuthSteps
This state **spills across scenarios** even with schema isolation, because struct fields are shared across all scenarios in a test process.
**Solution:** Create a `ScenarioState` manager with per-scenario isolation:
```go
type ScenarioState struct {
LastToken string
FirstToken string
LastUserID uint
}
type scenarioStateManager struct {
mu sync.RWMutex
states map[string]*ScenarioState // keyed by scenario hash
}
// Usage in step definitions:
func (s *AuthSteps) iShouldReceiveAValidJWTToken() error {
state := steps.GetScenarioState(s.scenarioName)
state.LastToken = extractedToken
// ...
}
```
**Benefits:**
- ✅ Zero code changes to step definitions (with helper functions)
- ✅ Thread-safe (sync.RWMutex)
- ✅ Consistent state per scenario
- ✅ Automatic cleanup via BeforeScenario/AfterScenario hooks
- ✅ Works with random test order
**Status:** Implemented in `pkg/bdd/steps/scenario_state.go`
### Key Insight: Cache and In-Memory Store Isolation
**For caches (Redis, Memcached, in-process):**
- Use **schema hash as key prefix/suffix**: `cache_key_{schema_hash}` or `{schema_hash}_cache_key`
- This ensures each scenario gets isolated cache namespace
- Works even with external cache services
- Consistent with schema isolation philosophy
**For in-memory stores (user repository, etc.):**
- Add `Reset()` methods that clear all state
- Call in `AfterScenario` alongside schema teardown
- Or use schema-prefix approach for shared stores
### Alternative Approach: Background Explicit State Setup
**Considered but rejected:** Adding explicit "Given no user X exists" steps or heavy Background sections.
**Pros:** More readable, explicit about state
**Cons:**
- Error-prone (must remember for every entity)
- Verbose (many Given steps)
- Doesn't scale with many entities
- Still has race conditions with concurrent scenarios
**Verdict:** Automated cleanup (schema drop + memory reset) is more reliable than manual Background setup.
### Implementation Plan
**Phase 1: Foundation (✅ Complete)**
- Add scenario-aware schema management to test server
- Implement schema creation/drop in BeforeScenario/AfterScenario hooks
- Handle `search_path` configuration for each scenario's database connection
**Phase 2: In-Memory State Reset (🟡 TODO)**
- Add `ResetUsers()` method to clear in-memory user store
- Add `ResetCache()` method for auth/rateLimiting caches
- Call these in AfterScenario alongside JWT secret reset
- **Cache key strategy**: `key_{schema_hash}` for all cache operations
**Phase 3: Connection Pooling**
- Configure connection pool to respect per-scenario `search_path`
- Each scenario gets isolated connections
**Phase 4: Validation**
- Run full test suite to verify complete isolation
- Fix any hardcoded `public` schema references
### Schema Naming Convention
```
Schema name: test_{sha256(feature:scenario)[:8]}
Cache key prefix: {sha256(feature:scenario)[:8]}_
```
Example:
- Feature: `auth`, Scenario: `Successful user authentication`
- Hash: `sha256("auth:Successful user authentication")[:8]` = `a3f7b2c1`
- Schema: `test_a3f7b2c1`
- Cache key: `a3f7b2c1_user:newuser` instead of just `user:newuser`
Benefits:
- Unique per scenario
- Consistent across test runs (same scenario = same hash)
- Short (8 chars) - efficient for cache keys
- Identifiable for debugging
### Schema Naming Convention
```
Schema name: test_{sha256(feature + scenario)[:8]}
```
Example:
- Feature: `auth`, Scenario: `Successful user authentication`
- Hash: `sha256("auth_Successful user authentication")[:8]` = `a3f7b2c1`
- Schema: `test_a3f7b2c1`
Benefits:
- Unique per scenario
- Consistent across test runs (same scenario = same schema)
- Short (8 chars + prefix = 14 chars max)
- Identifiable for debugging
## Pros and Cons Summary
| Aspect | Schema-per-Scenario | Current (Clear Tables) | Transaction Rollback |
|--------|---------------------|----------------------|-------------------|
| Isolation | ✅ Strong | ⚠️ Medium | ❌ Weak |
| Works with Background | ✅ Yes | ⚠️ Partial | ❌ No |
| Concurrency safe | ✅ Yes | ❌ No | ❌ No |
| Works with TX | ✅ Yes | ✅ Yes | ❌ No |
| Speed | ✅ Fast | ⚠️ Slow | ✅ Fast |
| DB privileges | ⚠️ Needs CREATE | ✅ None | ✅ None |
| Complexity | ⚠️ Medium | ✅ Low | ✅ Low |
## Links
* [ADR 0008: BDD Testing](adr/0008-bdd-testing.md) - Original BDD adoption decision
* [ADR 0024: BDD Test Organization and Isolation](adr/0024-bdd-test-organization-and-isolation.md) - Feature isolation strategy
* [Godog Documentation](https://github.com/cucumber/godog) - BDD framework specifics
* [PostgreSQL Schemas](https://www.postgresql.org/docs/current/ddl-schemas.html) - Schema management

View File

@@ -2,6 +2,36 @@
This directory contains Architecture Decision Records (ADRs) for the dance-lessons-coach project.
## Index of ADRs
| Number | Title | Status |
|--------|-------|--------|
| 0001 | Go 1.26.1 Standard | ✅ Accepted |
| 0002 | Chi Router | ✅ Accepted |
| 0003 | Zerolog Logging | ✅ Accepted |
| 0004 | Interface-Based Design | ✅ Accepted |
| 0005 | Graceful Shutdown | ✅ Accepted |
| 0006 | Configuration Management | ✅ Accepted |
| 0007 | OpenTelemetry Integration | ✅ Accepted |
| 0008 | BDD Testing | ✅ Accepted |
| 0009 | Hybrid Testing Approach | ✅ Accepted |
| 0010 | CI/CD Pipeline Design | ✅ Accepted |
| 0011 | Trunk-Based Development | ✅ Accepted |
| 0012 | Commit Message Conventions | ✅ Accepted |
| 0013 | Version Management Lifecycle | ✅ Accepted |
| 0014 | Swagger Documentation | ✅ Accepted |
| 0015 | Rate Limiting Strategy | ✅ Accepted |
| 0016 | Cache Invalidation Strategy | ✅ Accepted |
| 0017 | JWT Secret Rotation | ✅ Accepted |
| 0018 | Configuration Hot Reloading | ✅ Accepted |
| 0019 | BDD Feature Structure | ✅ Accepted |
| 0020 | Database Migration Strategy | ✅ Accepted |
| 0021 | API Versioning Strategy | ✅ Accepted |
| 0022 | Rate Limiting and Cache Strategy | ✅ Accepted |
| 0023 | Config Hot Reloading | 🟡 Proposed |
| 0024 | BDD Test Organization and Isolation | 🟡 Proposed |
| 0025 | BDD Scenario Isolation Strategies | 🟡 Proposed |
## What is an ADR?
An ADR is a document that captures an important architectural decision made along with its context and consequences.
@@ -79,6 +109,10 @@ Chosen option: "[Option 1]" because [justification]
* [0018-user-management-auth-system.md](0018-user-management-auth-system.md) - User management and authentication system
* [0019-postgresql-integration.md](0019-postgresql-integration.md) - PostgreSQL database integration
* [0020-docker-build-strategy.md](0020-docker-build-strategy.md) - Docker Build Strategy: Traditional vs Buildx
* [0021-jwt-secret-retention-policy.md](0021-jwt-secret-retention-policy.md) - JWT Secret Retention Policy with Configurable TTL and Retention
* [0022-rate-limiting-cache-strategy.md](0022-rate-limiting-cache-strategy.md) - Rate Limiting and Cache Strategy with Multi-Phase Implementation
* [0023-config-hot-reloading.md](0023-config-hot-reloading.md) - Config Hot Reloading Strategy
* [0025-bdd-scenario-isolation-strategies.md](0025-bdd-scenario-isolation-strategies.md) - Schema-per-scenario isolation for BDD tests
## How to Add a New ADR