Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

SRE 90-Day Learning Path

Master SRE in 90 days — SLIs, SLOs, error budgets, toil reduction, and blameless postmortems. Build the reliability engineering skills Google pioneered and every modern ops team needs.

What SRE means

Site Reliability Engineering (SRE) applies software engineering principles to infrastructure and operations problems. SREs define service level indicators and objectives, manage error budgets to balance reliability with velocity, and systematically eliminate toil through automation. The practice emphasizes blameless culture, data-driven decision-making, and treating operational work as a software problem to be solved.

Who should follow this path

Operations engineers moving to software-driven reliability
Software engineers taking on ops responsibilities
DevOps engineers formalizing reliability practices
Engineering managers building SRE teams

Prerequisites

Solid Linux and systems administration experience
Scripting proficiency in Python or Go
Familiarity with monitoring and alerting tools
Understanding of distributed systems fundamentals

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

SRE principles from the Google SRE book
Service Level Indicators (SLIs) definition and selection
Service Level Objectives (SLOs) and SLA relationships
Error budget concept, calculation, and policy design
Toil identification, classification, and prioritization

Outcome: Define SLIs and SLOs for a real service and calculate its error budget.

Days 16–30: Core concepts

Monitoring strategy: RED and USE methods
Prometheus metrics design and labeling best practices
Grafana dashboard design for SRE workflows
Alerting philosophy: symptom-based vs cause-based
On-call rotation design and escalation policies

Outcome: Instrument a service with SLO-aligned alerts using Prometheus and Grafana.

Days 31–45: Tools and workflows

Incident management lifecycle (detect, respond, resolve)
Incident commander and communications roles
Runbook and playbook authoring for common failures
Blameless postmortem process and template design
Five whys and contributing factor analysis

Outcome: Facilitate a blameless postmortem and produce an actionable improvement plan.

Days 46–60: Hands-on projects

Toil reduction through automation in Python and Go
Capacity planning and demand forecasting
Load testing with k6 or Locust
Reliability patterns: circuit breakers, retry, and bulkhead
Chaos engineering basics for SRE validation

Outcome: Automate a toil-heavy operational task and conduct a structured load test.

Days 61–75: Advanced practices

Error budget policies and release gates
SLO burn rate alerting with multi-window rules
Distributed tracing for reliability investigation
Service dependency mapping and blast radius analysis
Error budget reporting cadence and stakeholder communication

Outcome: Implement burn rate alerting and use distributed tracing to investigate reliability issues.

Days 76–90: Portfolio, interview & certification prep

Portfolio: SRE playbook for a production service
SRE interview question preparation
Google SRE and CRE certification paths
Presenting error budget reports to stakeholders
Contributing to OpenSLO and SLO tooling

Outcome: Present a complete SRE runbook portfolio and pass SRE technical interviews with confidence.

Weekly outcomes at a glance

Phase	Outcome
Days 1–15	Define SLIs and SLOs for a real service and calculate its error budget.
Days 16–30	Instrument a service with SLO-aligned alerts using Prometheus and Grafana.
Days 31–45	Facilitate a blameless postmortem and produce an actionable improvement plan.
Days 46–60	Automate a toil-heavy operational task and conduct a structured load test.
Days 61–75	Implement burn rate alerting and use distributed tracing to investigate reliability issues.
Days 76–90	Present a complete SRE runbook portfolio and pass SRE technical interviews with confidence.

Tools to learn

Prometheus
Grafana
PagerDuty
OpsGenie
OpenTelemetry
Datadog
k6
Chaos Toolkit
Python
Go

Labs to practice

Mini projects

Define and implement SLOs for a three-tier web application with error budget reporting
Build a blameless postmortem system with automated incident timeline generation
Automate a recurring operational task and measure toil reduction in hours saved

Interview questions to prepare

Explain the difference between an SLI, SLO, and SLA.
How do you set an error budget policy that balances reliability and feature velocity?
What is toil and how do you prioritize toil reduction?
Describe a blameless postmortem you ran and what the most valuable outcome was.
How does a burn rate alert differ from a threshold alert?
What reliability patterns would you apply to a microservice that calls ten downstream services?
How would you approach capacity planning for a service with seasonal traffic spikes?
What is the SRE engagement model and when does a service get SRE support?

Certification suggestions

Google Professional Cloud DevOps Engineer — Google Cloud
AWS Certified DevOps Engineer – Professional — AWS
Certified Kubernetes Administrator (CKA) — CNCF

Browse the full certification registry for exam details and official links.

Free resources

// instructor-led option

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗