glossary

SRE (Site Reliability Engineering)

An engineering discipline, pioneered at Google, that applies software engineering to operations problems, using SLOs, error budgets, and automation to keep services reliable while still enabling fast change.

In depth

Site Reliability Engineering treats operations as a software problem. SRE teams write code to automate toil such as manual deployments, capacity management, and incident response, instead of doing the same work by hand repeatedly. The discipline is built around measurable reliability: teams define Service Level Indicators (SLIs) that capture user-facing health, set Service Level Objectives (SLOs) as targets, and use error budgets to decide when to prioritize reliability work over new features. SREs also own incident management, on-call rotations, capacity planning, and postmortem culture. A common rule is that SREs should spend no more than half their time on operational toil, with the rest devoted to engineering that removes that toil. While DevOps describes a broad culture, SRE is often described as a concrete implementation of DevOps principles with specific practices and roles.

Why it matters

SRE gives teams an objective, data-driven way to balance release velocity against stability instead of arguing about it. It reduces burnout by capping toil and turning repetitive work into software. SRE roles are consistently among the highest-paid infrastructure positions.

Real-world example

example.txt

An SRE team at a payments company defines an SLO of 99.95% successful API requests over 30 days. When a string of bad deploys burns most of the error budget, the policy automatically pauses feature releases and the product team agrees to spend the sprint on reliability fixes, restoring the budget before resuming launches.

Tools related to SRE (Site Reliability Engineering)

PrometheusGrafanaPagerDutyDatadogTerraformKubernetes

Interview questions

How does SRE differ from traditional operations and from DevOps?
Define SLI, SLO, and SLA and explain how they relate.
What is an error budget and how would you enforce an error budget policy?
What is toil, and how do you reduce it?
Walk me through how you would handle a major production incident as on-call SRE.
How do you decide what reliability target a service should have?