glossary

Distributed Tracing

A technique that follows a single request as it travels through multiple services, recording each hop as a timed span, so engineers can see exactly where latency and errors occur in a distributed system.

In depth

In a microservices architecture, one user action can touch dozens of services, and no single log file shows the full journey. Distributed tracing solves this by assigning each request a unique trace ID that is propagated through every service call via headers. Each unit of work, an HTTP call, a database query, a queue publish, is recorded as a span with start time, duration, attributes, and parent-child relationships. Assembled together, the spans form a waterfall diagram showing the entire request path and exactly where time was spent or errors occurred. Standards like W3C Trace Context and instrumentation via OpenTelemetry make propagation work across languages and frameworks, while backends such as Jaeger, Tempo, or Zipkin store and visualize traces. Because tracing every request can be expensive at scale, teams use sampling strategies, including tail-based sampling that keeps the interesting traces, the slow and failed ones, while discarding routine traffic.

Why it matters

Without tracing, debugging cross-service latency means correlating timestamps across log systems by hand. Tracing answers 'which service is slow?' in seconds, exposes hidden dependencies and N+1 call patterns, and is essential once an architecture grows past a handful of services.

Real-world example

example.txt

An API's p99 latency doubles after a release. A trace waterfall reveals the new code calls the user-profile service inside a loop, generating 50 sequential calls per request. The team batches the calls into one, and p99 latency drops back within an hour of the first trace being examined.

Tools related to Distributed Tracing

OpenTelemetryJaegerGrafana TempoZipkinDatadog APMAWS X-Ray

Interview questions

Explain trace, span, and context propagation.
How does a trace ID get passed between services?
What is the difference between head-based and tail-based sampling?
How would you add tracing to an existing microservices system with minimal code change?
How do traces, metrics, and logs complement each other during an incident?
What are span attributes and how do they help debugging?