Skip to content

Postmortem

Within 48 hours of incident resolution. Blameless in tone, structural in output. The single question: which chain level was missing? The session that turns an incident into a structural fix — or fails to, and produces a feeling that evaporates.

When

  • P0: within 24 hours.
  • P1: within 48 hours.
  • P2: postmortem optional (but recommended if recurring class).
  • P3: no postmortem.

Calendar-locked at war-room close. Not negotiated later.

Who

  • Incident commander — leads.
  • On-call who responded — names what they saw, what they did, what was confusing.
  • Investigator(s) — name the root cause(s) and the chain levels traced.
  • PO — owns the structural-fix follow-through.
  • Leadership reviewer (for P0) — reads the output; signs.

Time-box

60 minutes. Long enough for the five-Whys walk; short enough to force structural fixes over feelings.

Inputs

  • The war-room timeline (recorded during, not reconstructed).
  • Runbooks used (or not used) — and why.
  • The relevant ADRs.
  • The customer comms record.
  • Monitoring/logs/traces from the incident window.

Agenda

TimeWhat
0–5 minRe-state the incident. Severity, duration, users affected. Read the timeline aloud.
5–15 minDetection and containment. When did the system know? When did a person know? Were the runbooks adequate? Detection gap and containment efficacy surface here.
15–35 minRoot cause via 5 Whys — anchored to chain levels, not individuals. The question is not why did the developer not see this but why did the chain not catch it before this developer? Walk to a structural cause.
35–45 minWhich chain level was missing? Strategy / Discovery / Scope / Execution / Operation. The chain-aware label is named here. scenario-gap? adr-drift? integration-gap? observation-mismatch? configuration?
45–55 minStructural fix — one. Owned. Dated. Tracked. Not "we'll be more careful." A specific change to a checklist, runbook, test, brief template, CI step, or ADR.
55–60 minSign. TL, on-call, PO; leadership for P0. The postmortem is filed; the structural fix is in the backlog with a date.

Outputs

  • The postmortem document (template) — filed in the repo alongside runbooks.
  • One structural fix — owned, dated, tracked. Test: zero recurrences of this class in the next release.
  • Updated artefact(s) — runbook gains a step, release-gate checklist gains an item, ADR updated, etc.

What good looks like

The postmortem produces a named change to a chain artefact. "The amigos template had no prompt for configuration discoverability. We are adding one. Owner: Maya. Check: zero scenario-gap bugs of this class in the next release." Specific. Testable. Dated.

The fix is at a level, not on a person. "The developer should have been more careful" has failed; it stops the trace at the human and misses the structural gap they fell into. "The pipeline did not validate runtime behaviour" keeps tracing.

Anti-pattern

The postmortem produces a feeling. "We need to be more careful with configuration changes going forward and improve communication." That feeling evaporates before the next incident. If the same class of incident happens twice, the postmortem did not fix the system — it documented it. Documentation is not repair.

A second anti-pattern: blame creeps in. Names get used; the team gets defensive; structural causes aren't surfaced. Fix: the chain-aware root-cause labels (scenario-gap, adr-drift, etc.) exist precisely so the trace stops at a level, not a person.

A third: three structural fixes proposed. The team noticed many things. Fix: one. The rest go to retro or to the next postmortem if recurring. Three fixes implemented zero is feelings; one implemented is repair.

See also

200apps · How We Work · NWIRE