session · incidents

Incident war room

Convenes within minutes of a P0/P1 fire. Three roles — commander, communicator, investigator — that never combine, even on a one-person on-call. Contain before diagnose. The session that decides how much harm is done.

When

P0 — within 15 minutes of detection. War room open within 30 minutes. All other work stops.
P1 — within 1 hour. War room open if unresolved within 2 hours.
P2/P3 — no war room. Normal flow.

Who

Incident commander — decides what gets done next. Holds the timeline. Does not investigate.
Communicator — runs the status page, client comms, internal channel. Updates every 30 minutes minimum.
Investigator — looks at the data. Reports findings to the commander.

On a one-person on-call: the hats switch in time — "Right now I am the commander. I will not investigate for the next ten minutes." The roles never collapse.

Time-box

The incident runs as long as it runs — the war room session is the duration of containment + resolution. Status updates every 30 minutes; de-escalation explicit.

Inputs

The alert that fired.
The relevant runbook(s).
The dashboards.
The on-call rotation (so escalation is to a known name).
The pre-written client comm template (for >15 min incidents).

Agenda

The war room is a structured response, not a free-form meeting.

Phase	Time	Action
Detect	t=0	Alert fires. Page arrives.
Contain	t+0 to t+5 min	Disable the flag. Roll the deploy. Rollback the migration. Four levers in order. Contain before you diagnose.
Communicate	t+15 min if not resolved	Pre-written template to client. Internal channel updated. Status page automatic on P0 / manual on P1.
Diagnose	After containment	Investigator works the runbook; commander tracks the timeline; communicator keeps updates flowing.
Resolve	When fix is deployed	Root cause identified, fix tested in staging, deployed through normal pipeline. Flag re-enabled only after staging confirms.
De-escalate	When monitoring confirms recovery	Explicit stand-down. Status page resolved. War room archived. Check on the people who took the page.

Outputs

Timeline — recorded during the incident, not reconstructed later.
Status page entries — the public record.
Client communication trail — what was sent, when.
Resolution time — measured from detect to resolved.
A scheduled Postmortem — within 48h.

What good looks like

The JWT outage's response: 4 min to detect, 8 min to identify root cause, 5 min to revert. 44 minutes total including the soak. The speed was good. The prevention — the missing token-compatibility test — was the gap. That's a war room functioning.

Communications flow before the client asks. Within 15 minutes of detection, the client knows there's an incident, what's known, and when the next update will land. The "no surprises" rule is held.

Anti-pattern

The commander investigates. They drift into the data while the timeline goes unwritten. The communicator wonders what to send because the commander isn't surfacing decisions. Fix: the three roles are real even on a one-person on-call. Hat switches happen in time, not in attention.

A second anti-pattern: silence to the client during the incident. The team is "still investigating" — for two hours — without updates. The client learns from their own users that something is wrong. Fix: an honest "we're investigating, will update in 30 min" beats silence. Always.

A third: de-escalation by drift. The incident slowly stops being one; the war room channel goes quiet; no one declares it over. The on-call stays anxious into the next cycle. Fix: the commander stands the team down explicitly, archives the channel, schedules the postmortem, checks on the people who took the page.

Incident war room ​

When ​

Who ​

Time-box ​

Inputs ​

Agenda ​

Outputs ​

What good looks like ​

Anti-pattern ​

See also ​