Skip to content

Incident war room

Convenes within minutes of a P0/P1 fire. Three roles — commander, communicator, investigator — that never combine, even on a one-person on-call. Contain before diagnose. The session that decides how much harm is done.

When

  • P0 — within 15 minutes of detection. War room open within 30 minutes. All other work stops.
  • P1 — within 1 hour. War room open if unresolved within 2 hours.
  • P2/P3 — no war room. Normal flow.

Who

  • Incident commander — decides what gets done next. Holds the timeline. Does not investigate.
  • Communicator — runs the status page, client comms, internal channel. Updates every 30 minutes minimum.
  • Investigator — looks at the data. Reports findings to the commander.

On a one-person on-call: the hats switch in time"Right now I am the commander. I will not investigate for the next ten minutes." The roles never collapse.

Time-box

The incident runs as long as it runs — the war room session is the duration of containment + resolution. Status updates every 30 minutes; de-escalation explicit.

Inputs

  • The alert that fired.
  • The relevant runbook(s).
  • The dashboards.
  • The on-call rotation (so escalation is to a known name).
  • The pre-written client comm template (for >15 min incidents).

Agenda

The war room is a structured response, not a free-form meeting.

PhaseTimeAction
Detectt=0Alert fires. Page arrives.
Containt+0 to t+5 minDisable the flag. Roll the deploy. Rollback the migration. Four levers in order. Contain before you diagnose.
Communicatet+15 min if not resolvedPre-written template to client. Internal channel updated. Status page automatic on P0 / manual on P1.
DiagnoseAfter containmentInvestigator works the runbook; commander tracks the timeline; communicator keeps updates flowing.
ResolveWhen fix is deployedRoot cause identified, fix tested in staging, deployed through normal pipeline. Flag re-enabled only after staging confirms.
De-escalateWhen monitoring confirms recoveryExplicit stand-down. Status page resolved. War room archived. Check on the people who took the page.

Outputs

  • Timeline — recorded during the incident, not reconstructed later.
  • Status page entries — the public record.
  • Client communication trail — what was sent, when.
  • Resolution time — measured from detect to resolved.
  • A scheduled Postmortem — within 48h.

What good looks like

The JWT outage's response: 4 min to detect, 8 min to identify root cause, 5 min to revert. 44 minutes total including the soak. The speed was good. The prevention — the missing token-compatibility test — was the gap. That's a war room functioning.

Communications flow before the client asks. Within 15 minutes of detection, the client knows there's an incident, what's known, and when the next update will land. The "no surprises" rule is held.

Anti-pattern

The commander investigates. They drift into the data while the timeline goes unwritten. The communicator wonders what to send because the commander isn't surfacing decisions. Fix: the three roles are real even on a one-person on-call. Hat switches happen in time, not in attention.

A second anti-pattern: silence to the client during the incident. The team is "still investigating" — for two hours — without updates. The client learns from their own users that something is wrong. Fix: an honest "we're investigating, will update in 30 min" beats silence. Always.

A third: de-escalation by drift. The incident slowly stops being one; the war room channel goes quiet; no one declares it over. The on-call stays anxious into the next cycle. Fix: the commander stands the team down explicitly, archives the channel, schedules the postmortem, checks on the people who took the page.

See also

200apps · How We Work · NWIRE