glossary

Observability

The ability to understand a system's internal state from its external outputs — metrics, logs, and traces — so you can debug novel problems and ask new questions without shipping new code first.

In depth

Observability goes beyond traditional monitoring. Monitoring tells you when a known condition occurs, such as CPU above 90%; observability lets you investigate problems you never anticipated by exploring rich telemetry. The practice is commonly described through three pillars: metrics (numeric time series like request rates and latencies), logs (timestamped event records), and traces (the path of a single request across many services). Modern systems add a fourth dimension of high-cardinality events that can be sliced by any attribute, such as customer ID or app version, to isolate exactly who is affected. Instrumentation standards like OpenTelemetry let teams emit telemetry once and send it to any backend. In a world of distributed microservices, where no single log file tells the story, observability is what makes debugging tractable: you can follow one slow checkout request across ten services and find the one misbehaving database call.

Why it matters

Distributed systems fail in ways no dashboard author predicted. Observability cuts mean time to resolution from hours to minutes by letting engineers interrogate production directly, and it underpins SLOs, error budgets, and effective incident response. Companies with poor observability pay for it in long outages and burned-out on-call engineers.

Real-world example

example.txt

Users report sporadic checkout failures that no alert caught. An engineer queries traces filtered by error status and discovers all failures share one attribute: a specific payment-provider region. The trace waterfall shows a 10-second timeout in one downstream call, and the team fails over that region within twenty minutes of starting the investigation.

Tools related to Observability

OpenTelemetryGrafanaPrometheusJaegerDatadogHoneycomb

Interview questions

How does observability differ from monitoring?
Explain the three pillars of observability and when you would reach for each.
What is high-cardinality data and why does it matter for debugging?
How does OpenTelemetry work and why has it become the standard?
How would you design the observability stack for a new microservices platform?
How do you control observability costs as telemetry volume grows?