glossary

Incident Management

The structured process for detecting, responding to, resolving, and learning from service disruptions — covering on-call, alerting, severity levels, coordinated response roles, and blameless postmortems.

In depth

Incident management is how organizations turn chaos into a repeatable process when production breaks. It begins with detection: monitoring and SLO-based alerts page an on-call engineer through a tool like PagerDuty. Incidents are classified by severity so a full-site outage mobilizes more people than a minor degradation. During response, clear roles keep coordination sane, typically an incident commander who directs the response, a communications lead who updates stakeholders and status pages, and subject-matter experts who debug. Responders prioritize mitigation, rolling back or failing over to stop user impact, before hunting root causes. Afterward, a blameless postmortem reconstructs the timeline, identifies contributing factors, and produces action items that actually get tracked to completion. Mature organizations measure MTTA and MTTR, rehearse with game days, and treat every incident as fuel for making the system and the process better.

Why it matters

Outages are inevitable; prolonged, chaotic outages are not. Good incident management directly reduces downtime cost and customer churn, while blameless culture ensures engineers surface problems instead of hiding them. It is also a core competency interviewers probe for in every SRE and DevOps role.

Real-world example

example.txt

At 2 a.m. a burn-rate alert pages the on-call engineer: checkout errors are spiking. She declares a SEV-1, pages an incident commander, and the team rolls back the evening's deploy within 15 minutes while the comms lead updates the status page. The next day's postmortem finds a missing integration test and adds a canary stage to the pipeline.

Tools related to Incident Management

PagerDutyOpsgenieincident.ioSlackStatuspageFireHydrant

Interview questions

Walk me through your first ten minutes after being paged for a major outage.
What does an incident commander do, and why separate that role from debugging?
What makes a postmortem blameless, and why does that matter?
How do you define incident severity levels?
What is the difference between MTTA and MTTR, and how would you improve each?
Why should mitigation usually come before root-cause analysis?