roadmap updated 2026-06-01

SRE Engineer Roadmap

Learn SLOs, error budgets, incident management, and reliability engineering principles. This roadmap guides you from operations fundamentals to advanced production reliability practices used at scale.

Phase 1 — Beginner

Understand reliability concepts, on-call practices, and the foundational metrics (SLIs, SLOs, SLAs) that underpin reliability engineering.

PrometheusGrafanaPagerDutyGitLinux

Phase 2 — Intermediate

Define and track SLOs in production, lead incident response, and build automation to eliminate toil.

DatadogOpenTelemetryJaegerAnsibleTerraform

Phase 3 — Advanced

Architect reliability programs across engineering organizations, drive SLO culture, and lead multi-team incident management frameworks.

Argo CDIstioGrafana SLOHoneycombGremlin

The path: Beginner → Intermediate → Advanced

Beginner

Focus: Understand reliability concepts, on-call practices, and the foundational metrics (SLIs, SLOs, SLAs) that underpin reliability engineering.

Skills to build

Understanding SLI, SLO, and SLA definitions and differences
Linux systems administration and troubleshooting
Basic monitoring: metric collection and dashboarding
Incident response fundamentals and on-call hygiene
Writing runbooks and post-mortems
TCP/IP networking and distributed systems basics
Git and infrastructure version control
Introduction to Kubernetes operations

Tools to learn

Prometheus
Grafana
PagerDuty
Git
Linux
kubectl

Intermediate

Focus: Define and track SLOs in production, lead incident response, and build automation to eliminate toil.

Skills to build

Defining meaningful SLIs and calculating error budgets
Designing alerting strategies that reduce alert fatigue
Incident command and blameless post-mortem facilitation
Toil identification and elimination through automation
Distributed tracing and log correlation across services
Capacity planning and traffic modeling
Chaos engineering experiments and failure injection
Service level agreement negotiation with stakeholders

Tools to learn

Datadog
OpenTelemetry
Jaeger
Ansible
Terraform
Chaos Toolkit
Slack

Advanced

Focus: Architect reliability programs across engineering organizations, drive SLO culture, and lead multi-team incident management frameworks.

Skills to build

Multi-service SLO architecture and dependency mapping
Production readiness reviews and launch checklists
Designing resilient distributed systems with circuit breakers
Advanced capacity planning and traffic load testing at scale
Engineering-wide reliability culture change management
Cross-team incident command for major outages
Platform reliability engineering and self-healing systems
SRE team building, hiring, and maturity model definition

Tools to learn

Argo CD
Istio
Grafana SLO
Honeycomb
Gremlin
Backstage
VictorOps

Labs to practice

Interview questions to prepare

How do you calculate an error budget for a service with a 99.95% SLO?
Describe the steps you take when paged for a P0 incident at 3am.
What is the difference between an SLI and an SLO, and how do they relate to SLAs?
How do you identify and quantify toil in your team’s workload?
Walk me through a blameless post-mortem you conducted — what was the outcome?
How would you introduce SLOs to an engineering team that has never used them?
What does a healthy on-call rotation look like and how do you prevent burnout?
How do you decide when to halt feature development to address reliability debt?

Certification suggestions

Google Professional Cloud DevOps Engineer — Google Cloud
Certified Kubernetes Administrator (CKA) — CNCF
AWS Certified SysOps Administrator – Associate — Amazon Web Services
Datadog Fundamentals Certification — Datadog
ITIL 4 Foundation — DevOps School

See exam formats, costs and official links in the certification registry.

Free resources

Portfolio project ideas

Implement SLO tracking for a sample microservices app using Prometheus recording rules and Grafana SLO dashboards
Build an automated incident response playbook triggered by PagerDuty alerts with runbook links and Slack notifications
Design and run a chaos engineering experiment on a Kubernetes service using Chaos Toolkit and document the blast radius
Create a toil-elimination automation that reduces a repetitive on-call task from 30 minutes to under 2 minutes

Mistakes to avoid

Setting SLOs too tight (99.999%) for systems that don’t need it, burning error budget on planned maintenance
Alerting on symptoms rather than user-visible impact, causing alert fatigue and missed real incidents
Writing post-mortems that assign blame instead of identifying systemic causes and preventive actions
Underestimating toil — not tracking and reducing it leads to SRE burnout and attrition
Confusing availability with reliability — a service can be ‘up’ but still unreliable from a user perspective

Keep going

Follow the structured SRE 90-Day Learning Path
Explore Monitoring Tools
Explore Observability Tools
Explore Incident Management Tools
Explore SRE Tools
Explore Alerting Tools
Want guided, instructor-led training? See DevOpsSchool.com courses (paid).