Skip to content

roadmap updated 2026-06-01

SRE Engineer Roadmap

Learn SLOs, error budgets, incident management, and reliability engineering principles. This roadmap guides you from operations fundamentals to advanced production reliability practices used at scale.

Phase 1 — Beginner

Understand reliability concepts, on-call practices, and the foundational metrics (SLIs, SLOs, SLAs) that underpin reliability engineering.

PrometheusGrafanaPagerDutyGitLinux

Phase 2 — Intermediate

Define and track SLOs in production, lead incident response, and build automation to eliminate toil.

DatadogOpenTelemetryJaegerAnsibleTerraform

Phase 3 — Advanced

Architect reliability programs across engineering organizations, drive SLO culture, and lead multi-team incident management frameworks.

Argo CDIstioGrafana SLOHoneycombGremlin

The path: Beginner → Intermediate → Advanced

Beginner

Focus: Understand reliability concepts, on-call practices, and the foundational metrics (SLIs, SLOs, SLAs) that underpin reliability engineering.

Skills to build

  • Understanding SLI, SLO, and SLA definitions and differences
  • Linux systems administration and troubleshooting
  • Basic monitoring: metric collection and dashboarding
  • Incident response fundamentals and on-call hygiene
  • Writing runbooks and post-mortems
  • TCP/IP networking and distributed systems basics
  • Git and infrastructure version control
  • Introduction to Kubernetes operations

Tools to learn

  • Prometheus
  • Grafana
  • PagerDuty
  • Git
  • Linux
  • kubectl

Intermediate

Focus: Define and track SLOs in production, lead incident response, and build automation to eliminate toil.

Skills to build

  • Defining meaningful SLIs and calculating error budgets
  • Designing alerting strategies that reduce alert fatigue
  • Incident command and blameless post-mortem facilitation
  • Toil identification and elimination through automation
  • Distributed tracing and log correlation across services
  • Capacity planning and traffic modeling
  • Chaos engineering experiments and failure injection
  • Service level agreement negotiation with stakeholders

Tools to learn

  • Datadog
  • OpenTelemetry
  • Jaeger
  • Ansible
  • Terraform
  • Chaos Toolkit
  • Slack

Advanced

Focus: Architect reliability programs across engineering organizations, drive SLO culture, and lead multi-team incident management frameworks.

Skills to build

  • Multi-service SLO architecture and dependency mapping
  • Production readiness reviews and launch checklists
  • Designing resilient distributed systems with circuit breakers
  • Advanced capacity planning and traffic load testing at scale
  • Engineering-wide reliability culture change management
  • Cross-team incident command for major outages
  • Platform reliability engineering and self-healing systems
  • SRE team building, hiring, and maturity model definition

Tools to learn

  • Argo CD
  • Istio
  • Grafana SLO
  • Honeycomb
  • Gremlin
  • Backstage
  • VictorOps

Labs to practice

Interview questions to prepare

  1. How do you calculate an error budget for a service with a 99.95% SLO?
  2. Describe the steps you take when paged for a P0 incident at 3am.
  3. What is the difference between an SLI and an SLO, and how do they relate to SLAs?
  4. How do you identify and quantify toil in your team’s workload?
  5. Walk me through a blameless post-mortem you conducted — what was the outcome?
  6. How would you introduce SLOs to an engineering team that has never used them?
  7. What does a healthy on-call rotation look like and how do you prevent burnout?
  8. How do you decide when to halt feature development to address reliability debt?

Certification suggestions

  • Google Professional Cloud DevOps Engineer — Google Cloud
  • Certified Kubernetes Administrator (CKA) — CNCF
  • AWS Certified SysOps Administrator – Associate — Amazon Web Services
  • Datadog Fundamentals Certification — Datadog
  • ITIL 4 Foundation — DevOps School

See exam formats, costs and official links in the certification registry.

Free resources

Portfolio project ideas

  • Implement SLO tracking for a sample microservices app using Prometheus recording rules and Grafana SLO dashboards
  • Build an automated incident response playbook triggered by PagerDuty alerts with runbook links and Slack notifications
  • Design and run a chaos engineering experiment on a Kubernetes service using Chaos Toolkit and document the blast radius
  • Create a toil-elimination automation that reduces a repetitive on-call task from 30 minutes to under 2 minutes

Mistakes to avoid

  • Setting SLOs too tight (99.999%) for systems that don’t need it, burning error budget on planned maintenance
  • Alerting on symptoms rather than user-visible impact, causing alert fatigue and missed real incidents
  • Writing post-mortems that assign blame instead of identifying systemic causes and preventive actions
  • Underestimating toil — not tracking and reducing it leads to SRE burnout and attrition
  • Confusing availability with reliability — a service can be ‘up’ but still unreliable from a user perspective

Keep going