Files
dance-lessons-coach/adr/0007-opentelemetry-integration.md
Gabriel Radureau db09d0ace1 📝 docs(adr): homogenize all 23 ADR headers to canonical format
Audit 2026-05-02 (Tâche 6 Phase A) had identified 3 inconsistent
formats across the ADR corpus :
- F1 list bullets : `* Status:` / `* Date:` / `* Deciders:` (11 ADRs)
- F2 bold fields : `**Status:**` / `**Date:**` / `**Authors:**` (9 ADRs)
- F3 dedicated section : `## Status\n**Value** ` (5 ADRs)

Mixed metadata names (Authors / Deciders / Decision Date / Implementation
Date / Implementation Status / Last Updated) and decorative emojis on
status values made the corpus hard to scan or template against.

Canonical format adopted (see adr/README.md for full template) :
    # NN. Title

    **Status:** <Proposed|Accepted|Implemented|Partially Implemented|
                  Approved|Rejected|Deferred|Deprecated|Superseded by ADR-NNNN>
    **Date:** YYYY-MM-DD
    **Authors:** Name(s)
    [optional **Field:** ... lines]

    ## Context...

Transformations applied (via /tmp/homogenize-adrs.py) :
- F1 list bullets → bold fields
- F2 cleanup : `**Deciders:**` → `**Authors:**`, strip status emojis
- F3 sections : `## Status\n**Value** ` → `**Status:** Value`
- Strip decorative emojis from `**Status:**` and `**Implementation Status:**`
- Convert any `* Implementation Status:` / `* Last Updated:` /
  `* Decision Drivers:` / `* Decision Date:` to bold equivalents
- Date typo fix : `2024-04-XX` → `2026-04-XX` for ADRs 0018, 0019
  (already noted in PR #17 but here re-applied since branch starts
  from origin/main pre-PR17)
- Normalize multiple blank lines after header (max 1)

21 / 23 ADRs modified. 0010 and 0012 were already conform.
0011 and 0014 do not exist in the repo (cf. README index update).

Body content of each ADR is preserved unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:27:42 +02:00

152 lines
4.4 KiB
Markdown

# Integrate OpenTelemetry for distributed tracing
**Status:** Accepted
**Authors:** Gabriel Radureau, AI Agent
**Date:** 2026-04-04
## Context and Problem Statement
We needed to add observability to dance-lessons-coach that provides:
- Distributed tracing capabilities
- Performance monitoring
- Request flow visualization
- Integration with existing monitoring systems
- Minimal impact on application performance
## Decision Drivers
* Need for distributed tracing in microservices architecture
* Desire for performance monitoring
* Requirement for request flow visualization
* Need for integration with monitoring tools
* Desire for minimal performance impact
## Considered Options
* OpenTelemetry - CNCF standard for observability
* Jaeger client - Direct Jaeger integration
* Zipkin - Alternative tracing system
* Custom solution - Build our own tracing
## Decision Outcome
Chosen option: "OpenTelemetry" because it provides industry-standard observability, good performance, flexibility for multiple backends, and is becoming the standard for distributed tracing.
## Pros and Cons of the Options
### OpenTelemetry
* Good, because CNCF standard with broad industry adoption
* Good, because supports multiple tracing backends (Jaeger, Zipkin, etc.)
* Good, because good performance characteristics
* Good, because active development and community
* Good, because vendor-neutral
* Bad, because more complex setup
* Bad, because larger dependency footprint
### Jaeger client
* Good, because direct integration with Jaeger
* Good, because simpler setup
* Bad, because vendor-locked to Jaeger
* Bad, because less flexible for future changes
### Zipkin
* Good, because established tracing system
* Good, because good ecosystem
* Bad, because less feature-rich than OpenTelemetry
* Bad, because declining popularity
### Custom solution
* Good, because tailored to our needs
* Good, because no external dependencies
* Bad, because time-consuming to develop
* Bad, because need to maintain ourselves
* Bad, because likely less feature-rich
## Implementation Approach
### Middleware-only approach
We chose a middleware-only approach using `otelhttp.NewHandler` rather than manual instrumentation:
```go
// In pkg/server/server.go
func (s *Server) getAllMiddlewares() []func(http.Handler) http.Handler {
middlewares := []func(http.Handler) http.Handler{
middleware.StripSlashes,
middleware.Recoverer,
}
if s.withOTEL {
middlewares = append(middlewares, func(next http.Handler) http.Handler {
return otelhttp.NewHandler(next, "")
})
}
return middlewares
}
```
### Benefits of middleware approach
* **Clean separation**: Tracing logic separate from business logic
* **Consistent instrumentation**: All endpoints automatically traced
* **Easy to enable/disable**: Single configuration flag
* **Maintainable**: No tracing boilerplate in service code
* **Upgradable**: Easy to change tracing implementation
## Configuration
```yaml
# config.yaml
telemetry:
enabled: true
otlp_endpoint: "localhost:4317"
service_name: "dance-lessons-coach"
insecure: true
sampler:
type: "parentbased_always_on"
ratio: 1.0
```
## Jaeger Integration
```bash
# Start Jaeger with OTLP support
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest
# Start server with OpenTelemetry
DLC_TELEMETRY_ENABLED=true ./scripts/start-server.sh start
# View traces at http://localhost:16686
```
## Links
* [OpenTelemetry GitHub](https://github.com/open-telemetry/opentelemetry-go)
* [OpenTelemetry Documentation](https://opentelemetry.io/docs/instrumentation/go/)
* [Jaeger Documentation](https://www.jaegertracing.io/docs/)
* [OTLP Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md)
## Sampler Types Supported
* `always_on` - Sample all traces
* `always_off` - Sample no traces
* `traceidratio` - Sample based on trace ID ratio
* `parentbased_always_on` - Sample based on parent span (always on)
* `parentbased_always_off` - Sample based on parent span (always off)
* `parentbased_traceidratio` - Sample based on parent span with ratio
## Performance Considerations
* OpenTelemetry adds minimal overhead when disabled
* Sampling can be used to reduce overhead in production
* Tracing data is sent asynchronously to minimize impact
* Context propagation is efficient using Go's context package