glossary

Error Budget

The amount of unreliability a service is allowed within its SLO period. If the SLO is 99.9% availability, the error budget is the remaining 0.1%; teams can spend it on risky changes and must slow down when it runs out.

In depth

An error budget is the inverse of a Service Level Objective. If a service promises 99.9% successful requests over 30 days, then 0.1% of requests are allowed to fail in that window; that allowance is the error budget. The concept, popularized by Google SRE, turns reliability from a vague aspiration into a spendable resource. As long as budget remains, teams are free to deploy frequently, run experiments, and take calculated risks. When incidents or bad releases consume the budget, an error budget policy kicks in: feature releases may freeze, and engineering effort shifts to reliability work until the budget recovers. This creates a self-correcting feedback loop and ends the classic tug-of-war between developers who want to ship and operators who want stability. Error budgets also make explicit that 100% reliability is the wrong target, because the cost of each extra nine grows enormously while users stop noticing.

Why it matters

Error budgets replace opinion-based arguments about release risk with a shared, measurable contract between product and engineering. They give teams explicit permission to move fast when things are healthy and a clear trigger to invest in reliability when they are not.

Real-world example

example.txt

A streaming service has a 99.95% availability SLO, giving roughly 21 minutes of allowed downtime per month. A botched database migration consumes 18 of those minutes in week one. Per the error budget policy, the team freezes feature deploys, adds migration canary checks, and resumes normal releases once the 30-day window rolls past the incident.

Tools related to Error Budget

PrometheusGrafanaDatadogNobl9Google Cloud Monitoring

Interview questions

How do you calculate an error budget from an SLO?
What should happen when a team exhausts its error budget?
Why is targeting 100% reliability considered a mistake?
How would you implement error budget burn-rate alerting?
How do error budgets resolve conflict between feature velocity and stability?
What is a multi-window, multi-burn-rate alert and why use one?