Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01

IncidentOps 90-Day Learning Path

Master IncidentOps in 90 days — incident lifecycle management, on-call practices, blameless postmortems, and incident automation. Reduce MTTR and build a learning culture from failures.

What IncidentOps means

IncidentOps encompasses the processes, tools, and culture for detecting, managing, and learning from service disruptions. It spans the full incident lifecycle — alert triage, incident declaration, response coordination, stakeholder communication, resolution, and postmortem analysis. Effective IncidentOps reduces mean time to recovery and converts every incident into an organizational learning opportunity.

Who should follow this path

SREs and operations engineers handling production incidents
Engineering managers designing on-call programs
Platform engineers building incident tooling
DevOps engineers integrating alerting into delivery pipelines

Prerequisites

Experience with production on-call responsibilities
Familiarity with monitoring and alerting tools
Basic understanding of SLOs and service reliability
Exposure to incident management tools like PagerDuty or OpsGenie

The 90-day plan

Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.

Days 1–15: Foundation

Incident lifecycle: detect, triage, escalate, resolve, review
Severity classification and incident priority frameworks
On-call rotation design and escalation policies
Alert quality: signal-to-noise ratio improvement
Incident declaration criteria and war room setup

Outcome: Design a severity classification framework and on-call rotation policy for a team.

Days 16–30: Core concepts

PagerDuty and OpsGenie configuration and routing
Alertmanager routing trees and inhibition rules
Runbook creation and maintenance workflows
Incident channel management with Slack workflows
Status page management with Statuspage.io

Outcome: Configure end-to-end alert routing from Alertmanager through PagerDuty to Slack.

Days 31–45: Tools and workflows

Incident commander role and responsibilities
Communications templates for internal and external stakeholders
Customer impact assessment methodologies
Incident timeline reconstruction techniques
Real-time incident coordination tools

Outcome: Lead a simulated production incident as incident commander with effective communications.

Days 46–60: Hands-on projects

Blameless postmortem principles and psychological safety
Postmortem template design and facilitation
Action item tracking and follow-through systems
Root cause vs contributing factor analysis
Postmortem metrics: time-to-postmortem and action completion rate

Outcome: Facilitate a blameless postmortem and drive action item completion within SLA.

Days 61–75: Advanced practices

Incident automation: auto-remediation workflows
Machine learning for anomaly detection and alerting
Incident analytics: MTTR, MTTD, and recurrence rates
Game day and fire drill design
Incident review program governance

Outcome: Implement auto-remediation for a class of incidents and measure MTTR improvement.

Days 76–90: Portfolio, interview & certification prep

Portfolio: incident management program design
IncidentOps interview preparation
PagerDuty Certified Associate exam prep
ITIL 4 Specialist Create Deliver Support exam prep
Building an incident review culture presentation

Outcome: Design a complete incident management program and present it to an engineering leadership audience.

Weekly outcomes at a glance

Phase	Outcome
Days 1–15	Design a severity classification framework and on-call rotation policy for a team.
Days 16–30	Configure end-to-end alert routing from Alertmanager through PagerDuty to Slack.
Days 31–45	Lead a simulated production incident as incident commander with effective communications.
Days 46–60	Facilitate a blameless postmortem and drive action item completion within SLA.
Days 61–75	Implement auto-remediation for a class of incidents and measure MTTR improvement.
Days 76–90	Design a complete incident management program and present it to an engineering leadership audience.

Tools to learn

PagerDuty
OpsGenie
Slack
Prometheus Alertmanager
Grafana
Statuspage
Jira
Confluence
Datadog
FireHydrant

Labs to practice

Mini projects

Build an end-to-end incident response platform with automated channel creation and runbook lookup
Design a game day exercise program with pre and post metrics
Create a postmortem tracking system with action item SLA monitoring

Interview questions to prepare

How do you define incident severity levels and who has authority to escalate?
What makes a postmortem blameless and why does it matter?
How do you reduce alert fatigue without missing critical signals?
Describe a complex incident you managed — what went well and what would you change?
What is the incident commander role and when is it activated?
How do you measure the effectiveness of your incident response program?
What is a game day and how do you design one?

Certification suggestions

PagerDuty Certified Associate — PagerDuty
ITIL 4 Foundation — DevOps School
Google Professional Cloud DevOps Engineer — Google Cloud

Browse the full certification registry for exam details and official links.

Free resources

// instructor-led option

Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.

Explore paid training on DevOpsSchool.com ↗