Jump to

Skill · Incident response

Incident response.

Manage an active incident from detection to resolution.

Manage an active production incident through five phases: detection, triage, mitigation, communication, and resolution. Structured roles keep the response coordinated, a severity rubric sets the level of response, and a communication cadence keeps every audience informed.

The governing principle is to stop user impact first and analyze cause second. When in doubt, act: a wrong action that can be rolled back beats inaction while users are suffering.

Audience: engineering and ops teams handling an active outage or production issue, or building the incident-response procedures and on-call rotations before one happens.

The framework

Five phases from alert to all-clear.

An incident moves through five phases. Severity, set in triage, decides how heavy the response is at each.

  1. 01Detection: the incident becomes known through an alert, a customer report, or internal observation. Acknowledge within the target time, assess severity, page the on-call, and open the incident channel.
  2. 02Triage: establish severity and impact against the rubric, from SEV-1 (critical, all-hands) to SEV-4 (low, tracked as a bug). Re-evaluate as more information emerges.
  3. 03Mitigation: stop the bleeding before fixing the cause. Rollback, flip a feature flag off, fail over, scale up, throttle, degrade gracefully, or use maintenance mode as a last resort.
  4. 04Communication: update the internal team every 15 minutes, stakeholders every 30 to 60, and the status page every 30, acknowledging before you have answers and never speculating on cause.
  5. 05Resolution: a verified fix, customers restored, a final status update posted, and the incident closed, with an after-action report scheduled within one to two weeks.

Who does what

Five roles, even if one person holds several.

Structured roles prevent the parallel-debugging chaos of no clear owner. On a small team one person can hold several, but each role's responsibility stays explicit.

  1. 01

    Incident commander (IC)

    Owns the response, calls decisions, and assigns work. Not necessarily the most technical person; the job is coordination.

  2. 02

    Communications lead

    Owns internal and external messaging, taking the communication burden off the incident commander.

  3. 03

    Operations lead

    Drives the technical investigation and the mitigation, often the most senior on-call engineer.

  4. 04

    Scribe

    Captures the timeline as the incident unfolds, which the after-action report depends on.

  5. 05

    Subject matter experts

    Service, database, or security experts pulled in as needed, rather than standing members of the response.

How to run it

Mitigate first, one commander, communicate on schedule.

Stop user impact first and analyze cause second. The fastest mitigations (a rollback, a feature flag flipped off, a failover, throttling, graceful degradation) often beat a full fix, and debating root cause while users are actively impacted is a failure mode. When in doubt, act, because a wrong action that can be rolled back beats inaction while users suffer.

One incident commander owns the response and calls the decisions, which avoids both the parallel-debugging chaos of no clear owner and the death-by-committee of too many. The IC is not necessarily the most technical person; the role is coordination, with the communications lead and the scribe taking the messaging and the timeline off the IC's plate.

Communicate on a schedule, even when there is nothing new to report. Acknowledge before you have answers, never speculate publicly about cause, and confirm resolution only after verifying the actual user flow rather than trusting a dashboard. And keep it blameless, because a blame culture makes people hide mistakes and incidents take longer to resolve.

Reference files

The reference that goes alongside the SKILL.md.

  • references/incident-playbook.md

    Severity definitions, roles, status-page templates, and the decision rubrics for an active incident.

Browse all reference files on GitHub

Bridges to other skills

Before, around, and after the incident.

This skill is for the active event. These cover the detection that triggers it, the planned work it is not, and the retro that follows.

  • Detection

    monitoring-and-alerting

    The paging alerts that surface an incident come from monitoring. It decides what wakes the on-call; this skill is the response once they are awake.

  • After resolution

    after-action-report

    Once the incident closes, the retrospective takes over: a blameless timeline, root cause, and owned action items, scheduled within one to two weeks.

  • Planned changes

    launch-runbook

    A planned launch is forward-looking work, not an incident. The runbook plans the go-live and its rollback; this skill handles the unplanned break.

  • Pre-launch triage

    qa-testing

    Catching issues before they ship is QA's job. Incident response is for what reaches production despite it.

Open source under MIT

Read the SKILL.md on GitHub.

The skill source lives in the rampstackco/claude-skills repository alongside dozens of other skills covering the full lifecycle of brand and product work. This page is a structured overview; the SKILL.md is the source. MIT licensed.

Frequently asked questions.

What are the five phases?
Detection (the incident becomes known and is acknowledged), triage (severity and impact are assessed against the rubric), mitigation (user impact is stopped), communication (every audience is kept informed on a cadence), and resolution (a verified fix, customers restored, the incident closed, and an after-action report scheduled). Severity, set in triage, scales the response: a SEV-1 is all-hands with an incident commander and public communication, while a SEV-4 is tracked as a bug in the normal queue.
Should I mitigate or find the root cause first?
Mitigate first. Stop user impact before cause analysis, because users keep suffering while engineers debug. The fastest mitigations are often quicker than a full fix: a rollback of a recent deploy, flipping a feature flag off, failing over to a healthy replica, scaling up, throttling traffic, degrading gracefully, or maintenance mode as a last resort. Discussing root cause while users are actively impacted is a recognized failure mode; the cause analysis belongs in the after-action report.
What are the severity levels?
SEV-1 (critical): major customer-facing functionality broken, data integrity at risk, or a security breach, calling for all-hands, an incident commander, and public communication. SEV-2 (major): significant degradation affecting some customers, with an IC assigned and active response. SEV-3 (minor): limited impact with a workaround, handled by a single owner on standard on-call. SEV-4 (low): cosmetic or edge-case, tracked as a bug in the normal queue. Severity can change as more information emerges, so re-evaluate it during the incident.
Who does what during an incident?
An incident commander owns the response and calls the decisions, and is not necessarily the most technical person, because the job is coordination. A communications lead owns internal and external messaging, an operations lead drives the technical investigation and mitigation, and a scribe captures the timeline that the after-action report will need, with subject matter experts pulled in as required. On a small team or a low-severity incident one person can hold several roles, but each role's responsibilities should still be explicit so nothing falls through.
How should I communicate during an incident?
Acknowledge before you have answers ('we are aware and investigating'), update on schedule even when there is no new information, never speculate publicly about the cause, and confirm resolution only after verifying the user flow rather than trusting a dashboard. Status-page updates use plain language, name the affected scope, and carry a time commitment for the next update. The patterns to avoid are vague language ('experiencing some issues'), a missing scope, and 'should be resolved soon' before verification.