glossary

SLI (Service Level Indicator)

A quantitative measurement of some aspect of a service's behavior that users care about, such as request success rate, latency, or freshness. SLIs are the raw metrics on which SLOs and error budgets are built.

In depth

A Service Level Indicator is a carefully chosen metric that reflects the health of a service from the user's perspective. Common SLIs include availability (the fraction of requests that succeed), latency (the fraction of requests served faster than a threshold), throughput, durability for storage systems, and freshness for data pipelines. A good SLI is usually expressed as a ratio of good events to total events, which makes it easy to convert into an SLO target and an error budget. The art of SLIs lies in measuring what users experience rather than what is convenient: CPU utilization is not an SLI because users do not feel CPU, but the percentage of page loads completing within two seconds is. SLIs should be measured as close to the user as possible, for example at the load balancer or via client-side telemetry, since internal metrics can look healthy while users suffer.

Why it matters

Everything in SLO-based reliability engineering rests on choosing good SLIs. Pick the wrong indicator and your dashboards will glow green during a real outage. Well-chosen SLIs focus monitoring, alerting, and engineering effort on what actually affects users.

Real-world example

example.txt

A video platform initially monitors server CPU and memory, yet users complain about buffering that never triggers alerts. The team replaces those metrics with two SLIs measured at the edge: the ratio of successful video-start events and the fraction of playback sessions with rebuffering under 1%. The next regression is caught within minutes.

Tools related to SLI (Service Level Indicator)

PrometheusOpenTelemetryDatadogGrafanaCloudflare Analytics

Interview questions

What makes a good SLI versus a vanity metric?
Why should SLIs be measured as close to the user as possible?
Give example SLIs for an API, a batch pipeline, and a storage system.
How would you express latency as a ratio-based SLI?
Why is CPU utilization a poor SLI?
How many SLIs should a single service have, and why?