Implement graceful shutdown with readiness endpoints

Status: Accepted
Deciders: Gabriel Radureau, AI Agent
Date: 2026-04-03

Context and Problem Statement

We needed to implement a shutdown mechanism for dance-lessons-coach that provides:

Clean resource cleanup
Proper handling of in-flight requests
Kubernetes/service mesh compatibility
Minimal downtime for users
Proper orchestration signaling

Decision Drivers

Need for zero-data-loss shutdowns
Desire for Kubernetes compatibility
Requirement for proper resource cleanup
Need for minimal user impact
Desire for proper orchestration integration

Considered Options

Graceful shutdown with readiness endpoints - Kubernetes-style shutdown
Immediate shutdown - Simple but disruptive
Delayed shutdown with queue draining - Complex but thorough
Signal-based shutdown only - Basic graceful shutdown

Decision Outcome

Chosen option: "Graceful shutdown with readiness endpoints" because it provides the best combination of Kubernetes compatibility, proper resource cleanup, minimal user impact, and follows industry best practices for containerized services.

Pros and Cons of the Options

Graceful shutdown with readiness endpoints

Good, because Kubernetes/service mesh compatible
Good, because minimal user impact
Good, because proper resource cleanup
Good, because follows industry best practices
Good, because allows proper orchestration
Bad, because more complex to implement
Bad, because requires additional endpoints

Immediate shutdown

Good, because simplest to implement
Bad, because disruptive to users
Bad, because can lose in-flight requests
Bad, because no resource cleanup

Delayed shutdown with queue draining

Good, because very thorough
Good, because minimal data loss
Bad, because very complex
Bad, because overkill for simple services

Signal-based shutdown only

Good, because better than immediate shutdown
Good, because allows some cleanup
Bad, because not Kubernetes-compatible
Bad, because still somewhat disruptive

Implementation Details

// Readiness context management
readyCtx, readyCancel := context.WithCancel(context.Background())

// Readiness endpoint handler
func (s *Server) handleReadiness(w http.ResponseWriter, r *http.Request) {
    select {
    case <-s.readyCtx.Done():
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte(`{"ready":false}`))
    default:
        w.Write([]byte(`{"ready":true}`))
    }
}

// Shutdown sequence
func (s *Server) shutdown() {
    // Cancel readiness - stop accepting new requests
    readyCancel()
    
    // Wait for shutdown timeout
    shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    // Graceful server shutdown
    s.server.Shutdown(shutdownCtx)
}

Monitoring and Verification

# Check readiness during shutdown
while true; do curl -s http://localhost:8080/api/ready | jq; sleep 1; done

# Expected output during shutdown:
# {"ready":true}
# {"ready":true}
# {"ready":false}  # When shutdown starts
# {"ready":false}
# ... (connection refused)  # When server fully stopped

3.4 KiB Raw Blame History