roadmap updated 2026-06-01
SRE Engineer Roadmap
Learn SLOs, error budgets, incident management, and reliability engineering principles. This roadmap guides you from operations fundamentals to advanced production reliability practices used at scale.
Phase 1 — Beginner
Understand reliability concepts, on-call practices, and the foundational metrics (SLIs, SLOs, SLAs) that underpin reliability engineering.
PrometheusGrafanaPagerDutyGitLinux
Phase 2 — Intermediate
Define and track SLOs in production, lead incident response, and build automation to eliminate toil.
DatadogOpenTelemetryJaegerAnsibleTerraform
Phase 3 — Advanced
Architect reliability programs across engineering organizations, drive SLO culture, and lead multi-team incident management frameworks.
Argo CDIstioGrafana SLOHoneycombGremlin
The path: Beginner → Intermediate → Advanced
Beginner
Focus: Understand reliability concepts, on-call practices, and the foundational metrics (SLIs, SLOs, SLAs) that underpin reliability engineering.
Skills to build
- Understanding SLI, SLO, and SLA definitions and differences
- Linux systems administration and troubleshooting
- Basic monitoring: metric collection and dashboarding
- Incident response fundamentals and on-call hygiene
- Writing runbooks and post-mortems
- TCP/IP networking and distributed systems basics
- Git and infrastructure version control
- Introduction to Kubernetes operations
Tools to learn
- Prometheus
- Grafana
- PagerDuty
- Git
- Linux
- kubectl
Intermediate
Focus: Define and track SLOs in production, lead incident response, and build automation to eliminate toil.
Skills to build
- Defining meaningful SLIs and calculating error budgets
- Designing alerting strategies that reduce alert fatigue
- Incident command and blameless post-mortem facilitation
- Toil identification and elimination through automation
- Distributed tracing and log correlation across services
- Capacity planning and traffic modeling
- Chaos engineering experiments and failure injection
- Service level agreement negotiation with stakeholders
Tools to learn
- Datadog
- OpenTelemetry
- Jaeger
- Ansible
- Terraform
- Chaos Toolkit
- Slack
Advanced
Focus: Architect reliability programs across engineering organizations, drive SLO culture, and lead multi-team incident management frameworks.
Skills to build
- Multi-service SLO architecture and dependency mapping
- Production readiness reviews and launch checklists
- Designing resilient distributed systems with circuit breakers
- Advanced capacity planning and traffic load testing at scale
- Engineering-wide reliability culture change management
- Cross-team incident command for major outages
- Platform reliability engineering and self-healing systems
- SRE team building, hiring, and maturity model definition
Tools to learn
- Argo CD
- Istio
- Grafana SLO
- Honeycomb
- Gremlin
- Backstage
- VictorOps
Labs to practice
Interview questions to prepare
- How do you calculate an error budget for a service with a 99.95% SLO?
- Describe the steps you take when paged for a P0 incident at 3am.
- What is the difference between an SLI and an SLO, and how do they relate to SLAs?
- How do you identify and quantify toil in your team’s workload?
- Walk me through a blameless post-mortem you conducted — what was the outcome?
- How would you introduce SLOs to an engineering team that has never used them?
- What does a healthy on-call rotation look like and how do you prevent burnout?
- How do you decide when to halt feature development to address reliability debt?
Certification suggestions
- Google Professional Cloud DevOps Engineer — Google Cloud
- Certified Kubernetes Administrator (CKA) — CNCF
- AWS Certified SysOps Administrator – Associate — Amazon Web Services
- Datadog Fundamentals Certification — Datadog
- ITIL 4 Foundation — DevOps School
See exam formats, costs and official links in the certification registry.
Free resources
- Site Reliability Engineering Book — Google
- SRE Workbook — Google
- Prometheus Getting Started Guide
- OpenTelemetry Documentation
- Awesome SRE — GitHub
Portfolio project ideas
- Implement SLO tracking for a sample microservices app using Prometheus recording rules and Grafana SLO dashboards
- Build an automated incident response playbook triggered by PagerDuty alerts with runbook links and Slack notifications
- Design and run a chaos engineering experiment on a Kubernetes service using Chaos Toolkit and document the blast radius
- Create a toil-elimination automation that reduces a repetitive on-call task from 30 minutes to under 2 minutes
Mistakes to avoid
- Setting SLOs too tight (99.999%) for systems that don’t need it, burning error budget on planned maintenance
- Alerting on symptoms rather than user-visible impact, causing alert fatigue and missed real incidents
- Writing post-mortems that assign blame instead of identifying systemic causes and preventive actions
- Underestimating toil — not tracking and reducing it leads to SRE burnout and attrition
- Confusing availability with reliability — a service can be ‘up’ but still unreliable from a user perspective
Keep going
- Follow the structured SRE 90-Day Learning Path
- Explore Monitoring Tools
- Explore Observability Tools
- Explore Incident Management Tools
- Explore SRE Tools
- Explore Alerting Tools
- Want guided, instructor-led training? See DevOpsSchool.com courses (paid).