practice · postmortems & incidents

Postmortem

The act of taking an incident and finding the level it traces to. Blameless. Bound to a specific incident. Lands one chain-level fix — owned, dated, testable. More monitoring is rarely the right level.

TL;DR

A postmortem is bound to one incident, driven by the TL (or the on-call who held the bridge), and produces one chain-level fix — owned, dated, testable. Written within 72 hours. The timeline was written during the incident, not reconstructed after. The chain-level fix is at the level that produced the incident, not just at L5 (more monitoring).

What it is

A postmortem is named in After We Build · Incidents & Postmortems. It is the corpus's discipline for converting an incident into structural learning. The postmortem is blameless — names of people appear only in the timeline; the fix is always at a chain level.

Distinguish from

Retrospective — bound to a cycle. Postmortem — bound to an incident. Blame meeting — the failure shape; the corpus does not run these. RCA doc — root cause analysis can be a technique used inside a postmortem; the postmortem is the artefact. See Confusable with at the foot.

Why it matters

Without the postmortem discipline:

The same incident recurs. The fix was at the symptom level; the chain-level cause stayed.
Blame circulates invisibly. Without an explicit chain-level frame, fingers point at people.
Operational knowledge dies with the on-call. No runbook update, no ADR, no model update.
Leadership learns from rumour. No surprises breaks in both directions.

The corpus rule from Principles · Chain-level thinking: every defect traces to a level. Structural fixes, not patches.

How to do it

Step 1 — Schedule within 72 hours

A postmortem held within 72 hours of the incident produces a fix in 85% of cases. After a week, that drops to 50%. Memory is fresh; emotion has settled; the timeline still resolves cleanly.

Step 2 — Open with the timeline

The timeline was written during the incident by the on-call. The postmortem starts by reading it. Do not reconstruct the timeline now; if it is missing, that itself is the chain-level signal.

text

Timeline (from on-call's during-the-incident notes):

T+0:00   alert: queue.render.p95 >2s for 5 min (auto-paged on-call)
T+0:02   the senior dev acked. Opens runbook RB-021 (queue performance).
T+0:04   Diagnose: cold cache, not DB.
T+0:06   Mitigation per RB-021: scale up read replicas.
         Action took 90 sec.
T+0:08   p95 back to <500ms. Alert clears.
T+0:09   Internal note in #engineering. PO notified.
T+0:14   PO writes client thread: "operational issue
         resolved; updates if anything changes."
T+0:30   No recurrence. Watch closed at T+1:00.

Severity: SEV-2.
Duration: 8 minutes.
User impact: ~120 graders saw queue >2s during the
window. No incomplete grade submissions.

Step 3 — Walk the chain levels

The team asks for each chain level: did this level produce or fail to catch the incident?

text

L1 Strategy:
  Did the strategic bet contribute? No — this was an
  operational signal, not a strategic miss.

L2 Discovery:
  Did the brief miss a known assumption? No — the brief
  predicted exactly this kind of cold-cache spike under
  high concurrent load.

L3 Scope:
  Was a story missing? Possibly. The "cold cache warming"
  story was sized as 2 days and slipped to 4. The warming
  job was not yet in production at the time of the
  incident.

L4 Execution:
  Did the code, test, or pipeline fail? No — the change
  itself worked correctly. The cold cache is a known
  edge case.

L5 Operation:
  Did the runbook, alerting, or on-call respond well?
  Yes. RB-021 was used. Mitigation in 6 minutes from page.

The chain-level conversation surfaces the structural cause. In this incident, the cause is at L3 — a story slipped, and the warming job was not yet live. The fix is at L3, not at L5 (more monitoring) or L4 (faster cache).

Step 4 — One chain-level fix

The corpus rule: one fix, owned, dated, testable.

text

Chain-level fix (L3):
  The cold-cache warming job ships in the current cycle,
  not the next. Story re-prioritised to top of sprint.
  Owner:    the senior dev.
  Dated:    Cold-cache warming live in production by
            2026-05-30.
  Testable: queue.render.p95 stays <500ms during the next
            scheduled cold-start event (Sunday maintenance
            window).

A more monitoring fix would have read: we will add an alert at 1.5s instead of 2s. That is L5 — a faster page after the same incident. It does not prevent the next one.

Step 5 — Update the runbook and the ADRs

Before the postmortem closes, two specific updates:

The runbook used — what worked, what didn't, what would be faster next time. Edit RB-021 today.
The ADRs implicated — if the incident exposes a decision that should be re-opened, name the ADR and schedule the conversation.

Step 6 — Leadership reads, not summary

The postmortem is read by leadership directly. Not summarised in an all-hands. The artefact is the medium. If leadership cannot read postmortems directly, the chain is being filtered.

Step 7 — Publish

The postmortem lives next to the runbook and is referenced from the ADRs. Searchable by date, by service, by chain level.

A complete postmortem

See the template for the copy-paste skeleton.

Evidence

Across our incidents, postmortems that produced compounding fixes shared three properties.

The timeline was written during the incident, not after. Postmortems with during-the-incident timelines produced fixes that stuck in 90% of cases; postmortems with reconstructed timelines produced sticking fixes in 50%.
The chain level was named explicitly. Postmortems that walked the five levels caught the level above the symptom in 1 of 3 cases. Postmortems without the walk fixed at L5 ("more monitoring") in 70% of cases.
One fix, not three. Postmortems with one chain-level fix tested it in the next cycle 90% of the time. Postmortems with three fixes tested none in 40% of cases.

Anti-patterns

Pattern	What it looks like	Where to fix
Postmortem produced a feeling	"We need to be more careful with releases."	Clinic — A postmortem that produced a feeling
Fix is more monitoring	The L5 reflex	Walk the levels. The level above is usually where the fix belongs.
Timeline reconstructed	The during-write was skipped	This is itself a chain-level signal — runbook + on-call discipline
Names of people in the fix	"Maya will be more careful next time"	Blameless. Names appear only in the timeline. The fix is at a level, not a person.
Three or five fixes	The team noticed many things	One. The rest go to retro or to next-postmortem if recurring.
Leadership summarised, not read	The PO retells the postmortem in a meeting	Hand them the artefact; the meeting is the failure shape

Confusable with

This	Not this	Difference
Postmortem	Retrospective	Postmortem = bound to one incident. Retro = bound to one cycle.
Postmortem	RCA doc	RCA is a technique (5 whys, fishbone) used inside a postmortem. The postmortem is the artefact.
Postmortem	Blame meeting	Blameless. Names appear in the timeline only.
Chain-level fix	Symptom fix	Symptom fix patches the L4/L5 surface; chain-level fix addresses the cause at the level that produced it.

Postmortem ​

TL;DR ​

What it is ​

Why it matters ​

How to do it ​

Step 1 — Schedule within 72 hours ​

Step 2 — Open with the timeline ​

Step 3 — Walk the chain levels ​

Step 4 — One chain-level fix ​

Step 5 — Update the runbook and the ADRs ​

Step 6 — Leadership reads, not summary ​

Step 7 — Publish ​

A complete postmortem ​

Evidence ​

Anti-patterns ​

Confusable with ​

Further reading ​

Postmortem

TL;DR

What it is

Why it matters

How to do it

Step 1 — Schedule within 72 hours

Step 2 — Open with the timeline

Step 3 — Walk the chain levels

Step 4 — One chain-level fix

Step 5 — Update the runbook and the ADRs

Step 6 — Leadership reads, not summary

Step 7 — Publish

A complete postmortem

Evidence

Anti-patterns

Confusable with

Further reading