Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01
SRE 90-Day Learning Path
Master SRE in 90 days — SLIs, SLOs, error budgets, toil reduction, and blameless postmortems. Build the reliability engineering skills Google pioneered and every modern ops team needs.
What SRE means
Site Reliability Engineering (SRE) applies software engineering principles to infrastructure and operations problems. SREs define service level indicators and objectives, manage error budgets to balance reliability with velocity, and systematically eliminate toil through automation. The practice emphasizes blameless culture, data-driven decision-making, and treating operational work as a software problem to be solved.
Who should follow this path
- Operations engineers moving to software-driven reliability
- Software engineers taking on ops responsibilities
- DevOps engineers formalizing reliability practices
- Engineering managers building SRE teams
Prerequisites
- Solid Linux and systems administration experience
- Scripting proficiency in Python or Go
- Familiarity with monitoring and alerting tools
- Understanding of distributed systems fundamentals
The 90-day plan
Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.
Days 1–15: Foundation
- SRE principles from the Google SRE book
- Service Level Indicators (SLIs) definition and selection
- Service Level Objectives (SLOs) and SLA relationships
- Error budget concept, calculation, and policy design
- Toil identification, classification, and prioritization
Outcome: Define SLIs and SLOs for a real service and calculate its error budget.
Days 16–30: Core concepts
- Monitoring strategy: RED and USE methods
- Prometheus metrics design and labeling best practices
- Grafana dashboard design for SRE workflows
- Alerting philosophy: symptom-based vs cause-based
- On-call rotation design and escalation policies
Outcome: Instrument a service with SLO-aligned alerts using Prometheus and Grafana.
Days 31–45: Tools and workflows
- Incident management lifecycle (detect, respond, resolve)
- Incident commander and communications roles
- Runbook and playbook authoring for common failures
- Blameless postmortem process and template design
- Five whys and contributing factor analysis
Outcome: Facilitate a blameless postmortem and produce an actionable improvement plan.
Days 46–60: Hands-on projects
- Toil reduction through automation in Python and Go
- Capacity planning and demand forecasting
- Load testing with k6 or Locust
- Reliability patterns: circuit breakers, retry, and bulkhead
- Chaos engineering basics for SRE validation
Outcome: Automate a toil-heavy operational task and conduct a structured load test.
Days 61–75: Advanced practices
- Error budget policies and release gates
- SLO burn rate alerting with multi-window rules
- Distributed tracing for reliability investigation
- Service dependency mapping and blast radius analysis
- Error budget reporting cadence and stakeholder communication
Outcome: Implement burn rate alerting and use distributed tracing to investigate reliability issues.
Days 76–90: Portfolio, interview & certification prep
- Portfolio: SRE playbook for a production service
- SRE interview question preparation
- Google SRE and CRE certification paths
- Presenting error budget reports to stakeholders
- Contributing to OpenSLO and SLO tooling
Outcome: Present a complete SRE runbook portfolio and pass SRE technical interviews with confidence.
Weekly outcomes at a glance
| Phase | Outcome |
|---|---|
| Days 1–15 | Define SLIs and SLOs for a real service and calculate its error budget. |
| Days 16–30 | Instrument a service with SLO-aligned alerts using Prometheus and Grafana. |
| Days 31–45 | Facilitate a blameless postmortem and produce an actionable improvement plan. |
| Days 46–60 | Automate a toil-heavy operational task and conduct a structured load test. |
| Days 61–75 | Implement burn rate alerting and use distributed tracing to investigate reliability issues. |
| Days 76–90 | Present a complete SRE runbook portfolio and pass SRE technical interviews with confidence. |
Tools to learn
- Prometheus
- Grafana
- PagerDuty
- OpsGenie
- OpenTelemetry
- Datadog
- k6
- Chaos Toolkit
- Python
- Go
Labs to practice
- Prometheus and Grafana Monitoring
- OpenTelemetry Tracing
- Chaos Engineering Experiment
- CI/CD with GitHub Actions
Mini projects
- Define and implement SLOs for a three-tier web application with error budget reporting
- Build a blameless postmortem system with automated incident timeline generation
- Automate a recurring operational task and measure toil reduction in hours saved
Interview questions to prepare
- Explain the difference between an SLI, SLO, and SLA.
- How do you set an error budget policy that balances reliability and feature velocity?
- What is toil and how do you prioritize toil reduction?
- Describe a blameless postmortem you ran and what the most valuable outcome was.
- How does a burn rate alert differ from a threshold alert?
- What reliability patterns would you apply to a microservice that calls ten downstream services?
- How would you approach capacity planning for a service with seasonal traffic spikes?
- What is the SRE engagement model and when does a service get SRE support?
Certification suggestions
- Google Professional Cloud DevOps Engineer — Google Cloud
- AWS Certified DevOps Engineer – Professional — AWS
- Certified Kubernetes Administrator (CKA) — CNCF
Browse the full certification registry for exam details and official links.
Free resources
- Google SRE Book (free online)
- Google SRE Workbook (free online)
- OpenSLO Specification
- roadmap.sh SRE Roadmap
- Prometheus Documentation
Related roadmaps
Related tool categories
- SRE Tools
- Monitoring Tools
- Alerting Tools
- Incident Management Tools
- Observability Tools
- Chaos Engineering Tools
// instructor-led option
Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.
Explore paid training on DevOpsSchool.com ↗