glossary

SLA (Service Level Agreement)

A formal contract between a provider and its customers that defines promised service levels, such as 99.9% uptime, along with consequences like service credits or penalties if the promise is broken.

In depth

A Service Level Agreement is the external, legally meaningful counterpart to internal SLOs. It specifies measurable commitments a provider makes to customers, most commonly availability percentages, support response times, or data durability, and spells out remedies when commitments are missed, typically service credits or refunds. Because breaching an SLA costs money and reputation, providers set SLAs looser than their internal SLOs; a team might run to an internal 99.95% SLO while contractually promising only 99.9%, creating a safety margin. SLAs also define precisely how the metric is measured, what counts as downtime, exclusions such as scheduled maintenance, and the claims process. For engineers, the practical implication is that SLOs are the operational target and the SLA is the line you must never cross. Cloud providers publish SLAs for nearly every service, and understanding them is essential when designing architectures and committing to your own customers.

Why it matters

SLAs translate reliability into business and legal terms, shaping pricing, contracts, and architecture decisions. Engineers who understand the gap between SLA, SLO, and actual performance can design systems with appropriate redundancy and avoid promising customers levels of reliability the platform beneath them cannot support.

Real-world example

example.txt

A SaaS company signs enterprise contracts promising 99.9% monthly uptime with 10% service credits per breach. Internally the SRE team targets a 99.95% SLO so alerts fire and releases freeze well before contractual penalties are at risk. When AWS has a regional outage, the architecture's multi-AZ failover keeps the company inside its SLA.

Tools related to SLA (Service Level Agreement)

DatadogPagerDutyStatuspageBetter StackPingdom

Interview questions

How does an SLA differ from an SLO, and which should be stricter?
What typically happens when a provider breaches an SLA?
How would you design a system on AWS to meet a 99.9% uptime SLA?
Why do SLAs usually exclude scheduled maintenance windows?
How do dependencies' SLAs affect the SLA you can offer your customers?
What should an engineering team review before sales commits to an SLA?