Reliability & Operations 90 days 2-3 hours/day updated 2026-06-01
IncidentOps 90-Day Learning Path
Master IncidentOps in 90 days — incident lifecycle management, on-call practices, blameless postmortems, and incident automation. Reduce MTTR and build a learning culture from failures.
What IncidentOps means
IncidentOps encompasses the processes, tools, and culture for detecting, managing, and learning from service disruptions. It spans the full incident lifecycle — alert triage, incident declaration, response coordination, stakeholder communication, resolution, and postmortem analysis. Effective IncidentOps reduces mean time to recovery and converts every incident into an organizational learning opportunity.
Who should follow this path
- SREs and operations engineers handling production incidents
- Engineering managers designing on-call programs
- Platform engineers building incident tooling
- DevOps engineers integrating alerting into delivery pipelines
Prerequisites
- Experience with production on-call responsibilities
- Familiarity with monitoring and alerting tools
- Basic understanding of SLOs and service reliability
- Exposure to incident management tools like PagerDuty or OpsGenie
The 90-day plan
Daily study recommendation: 2-3 hours/day, six days a week. Consistency beats intensity — block the time in your calendar like a meeting.
Days 1–15: Foundation
- Incident lifecycle: detect, triage, escalate, resolve, review
- Severity classification and incident priority frameworks
- On-call rotation design and escalation policies
- Alert quality: signal-to-noise ratio improvement
- Incident declaration criteria and war room setup
Outcome: Design a severity classification framework and on-call rotation policy for a team.
Days 16–30: Core concepts
- PagerDuty and OpsGenie configuration and routing
- Alertmanager routing trees and inhibition rules
- Runbook creation and maintenance workflows
- Incident channel management with Slack workflows
- Status page management with Statuspage.io
Outcome: Configure end-to-end alert routing from Alertmanager through PagerDuty to Slack.
Days 31–45: Tools and workflows
- Incident commander role and responsibilities
- Communications templates for internal and external stakeholders
- Customer impact assessment methodologies
- Incident timeline reconstruction techniques
- Real-time incident coordination tools
Outcome: Lead a simulated production incident as incident commander with effective communications.
Days 46–60: Hands-on projects
- Blameless postmortem principles and psychological safety
- Postmortem template design and facilitation
- Action item tracking and follow-through systems
- Root cause vs contributing factor analysis
- Postmortem metrics: time-to-postmortem and action completion rate
Outcome: Facilitate a blameless postmortem and drive action item completion within SLA.
Days 61–75: Advanced practices
- Incident automation: auto-remediation workflows
- Machine learning for anomaly detection and alerting
- Incident analytics: MTTR, MTTD, and recurrence rates
- Game day and fire drill design
- Incident review program governance
Outcome: Implement auto-remediation for a class of incidents and measure MTTR improvement.
Days 76–90: Portfolio, interview & certification prep
- Portfolio: incident management program design
- IncidentOps interview preparation
- PagerDuty Certified Associate exam prep
- ITIL 4 Specialist Create Deliver Support exam prep
- Building an incident review culture presentation
Outcome: Design a complete incident management program and present it to an engineering leadership audience.
Weekly outcomes at a glance
| Phase | Outcome |
|---|---|
| Days 1–15 | Design a severity classification framework and on-call rotation policy for a team. |
| Days 16–30 | Configure end-to-end alert routing from Alertmanager through PagerDuty to Slack. |
| Days 31–45 | Lead a simulated production incident as incident commander with effective communications. |
| Days 46–60 | Facilitate a blameless postmortem and drive action item completion within SLA. |
| Days 61–75 | Implement auto-remediation for a class of incidents and measure MTTR improvement. |
| Days 76–90 | Design a complete incident management program and present it to an engineering leadership audience. |
Tools to learn
- PagerDuty
- OpsGenie
- Slack
- Prometheus Alertmanager
- Grafana
- Statuspage
- Jira
- Confluence
- Datadog
- FireHydrant
Labs to practice
Mini projects
- Build an end-to-end incident response platform with automated channel creation and runbook lookup
- Design a game day exercise program with pre and post metrics
- Create a postmortem tracking system with action item SLA monitoring
Interview questions to prepare
- How do you define incident severity levels and who has authority to escalate?
- What makes a postmortem blameless and why does it matter?
- How do you reduce alert fatigue without missing critical signals?
- Describe a complex incident you managed — what went well and what would you change?
- What is the incident commander role and when is it activated?
- How do you measure the effectiveness of your incident response program?
- What is a game day and how do you design one?
Certification suggestions
- PagerDuty Certified Associate — PagerDuty
- ITIL 4 Foundation — DevOps School
- Google Professional Cloud DevOps Engineer — Google Cloud
Browse the full certification registry for exam details and official links.
Free resources
- PagerDuty Incident Response Guide
- Google SRE Book: Incident Management
- Atlassian Incident Management Handbook
- Blameless Postmortem Guide
- FireHydrant Documentation
Related roadmaps
Related tool categories
// instructor-led option
Prefer live, guided training with mentors and certification support? DevOpsSchool.com runs paid instructor-led programs that pair well with this free path.
Explore paid training on DevOpsSchool.com ↗