session · incidents

Postmortem

Within 48 hours of incident resolution. Blameless in tone, structural in output. The single question: which chain level was missing? The session that turns an incident into a structural fix — or fails to, and produces a feeling that evaporates.

When

P0: within 24 hours.
P1: within 48 hours.
P2: postmortem optional (but recommended if recurring class).
P3: no postmortem.

Calendar-locked at war-room close. Not negotiated later.

Who

Incident commander — leads.
On-call who responded — names what they saw, what they did, what was confusing.
Investigator(s) — name the root cause(s) and the chain levels traced.
PO — owns the structural-fix follow-through.
Leadership reviewer (for P0) — reads the output; signs.

Time-box

60 minutes. Long enough for the five-Whys walk; short enough to force structural fixes over feelings.

Inputs

The war-room timeline (recorded during, not reconstructed).
Runbooks used (or not used) — and why.
The relevant ADRs.
The customer comms record.
Monitoring/logs/traces from the incident window.

Agenda

Time	What
0–5 min	Re-state the incident. Severity, duration, users affected. Read the timeline aloud.
5–15 min	Detection and containment. When did the system know? When did a person know? Were the runbooks adequate? Detection gap and containment efficacy surface here.
15–35 min	Root cause via 5 Whys — anchored to chain levels, not individuals. The question is not why did the developer not see this but why did the chain not catch it before this developer? Walk to a structural cause.
35–45 min	Which chain level was missing? Strategy / Discovery / Scope / Execution / Operation. The chain-aware label is named here. `scenario-gap`? `adr-drift`? `integration-gap`? `observation-mismatch`? `configuration`?
45–55 min	Structural fix — one. Owned. Dated. Tracked. Not "we'll be more careful." A specific change to a checklist, runbook, test, brief template, CI step, or ADR.
55–60 min	Sign. TL, on-call, PO; leadership for P0. The postmortem is filed; the structural fix is in the backlog with a date.

Outputs

The postmortem document (template) — filed in the repo alongside runbooks.
One structural fix — owned, dated, tracked. Test: zero recurrences of this class in the next release.
Updated artefact(s) — runbook gains a step, release-gate checklist gains an item, ADR updated, etc.

What good looks like

The postmortem produces a named change to a chain artefact. "The amigos template had no prompt for configuration discoverability. We are adding one. Owner: Maya. Check: zero scenario-gap bugs of this class in the next release." Specific. Testable. Dated.

The fix is at a level, not on a person. "The developer should have been more careful" has failed; it stops the trace at the human and misses the structural gap they fell into. "The pipeline did not validate runtime behaviour" keeps tracing.

Anti-pattern

The postmortem produces a feeling. "We need to be more careful with configuration changes going forward and improve communication." That feeling evaporates before the next incident. If the same class of incident happens twice, the postmortem did not fix the system — it documented it. Documentation is not repair.

A second anti-pattern: blame creeps in. Names get used; the team gets defensive; structural causes aren't surfaced. Fix: the chain-aware root-cause labels (scenario-gap, adr-drift, etc.) exist precisely so the trace stops at a level, not a person.

A third: three structural fixes proposed. The team noticed many things. Fix: one. The rest go to retro or to the next postmortem if recurring. Three fixes implemented zero is feelings; one implemented is repair.

Postmortem ​

When ​

Who ​

Time-box ​

Inputs ​

Agenda ​

Outputs ​

What good looks like ​

Anti-pattern ​

See also ​