practice · postmortems & incidents
Postmortem
The act of taking an incident and finding the level it traces to. Blameless. Bound to a specific incident. Lands one chain-level fix — owned, dated, testable. More monitoring is rarely the right level.
TL;DR
A postmortem is bound to one incident, driven by the TL (or the on-call who held the bridge), and produces one chain-level fix — owned, dated, testable. Written within 72 hours. The timeline was written during the incident, not reconstructed after. The chain-level fix is at the level that produced the incident, not just at L5 (more monitoring).
What it is
A postmortem is named in After We Build · Incidents & Postmortems. It is the corpus's discipline for converting an incident into structural learning. The postmortem is blameless — names of people appear only in the timeline; the fix is always at a chain level.
Distinguish from
Retrospective — bound to a cycle. Postmortem — bound to an incident. Blame meeting — the failure shape; the corpus does not run these. RCA doc — root cause analysis can be a technique used inside a postmortem; the postmortem is the artefact. See Confusable with at the foot.
Why it matters
Without the postmortem discipline:
- The same incident recurs. The fix was at the symptom level; the chain-level cause stayed.
- Blame circulates invisibly. Without an explicit chain-level frame, fingers point at people.
- Operational knowledge dies with the on-call. No runbook update, no ADR, no model update.
- Leadership learns from rumour. No surprises breaks in both directions.
The corpus rule from Principles · Chain-level thinking: every defect traces to a level. Structural fixes, not patches.
How to do it
Step 1 — Schedule within 72 hours
A postmortem held within 72 hours of the incident produces a fix in 85% of cases. After a week, that drops to 50%. Memory is fresh; emotion has settled; the timeline still resolves cleanly.
Step 2 — Open with the timeline
The timeline was written during the incident by the on-call. The postmortem starts by reading it. Do not reconstruct the timeline now; if it is missing, that itself is the chain-level signal.
Timeline (from on-call's during-the-incident notes):
T+0:00 alert: queue.render.p95 >2s for 5 min (auto-paged on-call)
T+0:02 the senior dev acked. Opens runbook RB-021 (queue performance).
T+0:04 Diagnose: cold cache, not DB.
T+0:06 Mitigation per RB-021: scale up read replicas.
Action took 90 sec.
T+0:08 p95 back to <500ms. Alert clears.
T+0:09 Internal note in #engineering. PO notified.
T+0:14 PO writes client thread: "operational issue
resolved; updates if anything changes."
T+0:30 No recurrence. Watch closed at T+1:00.
Severity: SEV-2.
Duration: 8 minutes.
User impact: ~120 graders saw queue >2s during the
window. No incomplete grade submissions.Step 3 — Walk the chain levels
The team asks for each chain level: did this level produce or fail to catch the incident?
L1 Strategy:
Did the strategic bet contribute? No — this was an
operational signal, not a strategic miss.
L2 Discovery:
Did the brief miss a known assumption? No — the brief
predicted exactly this kind of cold-cache spike under
high concurrent load.
L3 Scope:
Was a story missing? Possibly. The "cold cache warming"
story was sized as 2 days and slipped to 4. The warming
job was not yet in production at the time of the
incident.
L4 Execution:
Did the code, test, or pipeline fail? No — the change
itself worked correctly. The cold cache is a known
edge case.
L5 Operation:
Did the runbook, alerting, or on-call respond well?
Yes. RB-021 was used. Mitigation in 6 minutes from page.The chain-level conversation surfaces the structural cause. In this incident, the cause is at L3 — a story slipped, and the warming job was not yet live. The fix is at L3, not at L5 (more monitoring) or L4 (faster cache).
Step 4 — One chain-level fix
The corpus rule: one fix, owned, dated, testable.
Chain-level fix (L3):
The cold-cache warming job ships in the current cycle,
not the next. Story re-prioritised to top of sprint.
Owner: the senior dev.
Dated: Cold-cache warming live in production by
2026-05-30.
Testable: queue.render.p95 stays <500ms during the next
scheduled cold-start event (Sunday maintenance
window).A more monitoring fix would have read: we will add an alert at 1.5s instead of 2s. That is L5 — a faster page after the same incident. It does not prevent the next one.
Step 5 — Update the runbook and the ADRs
Before the postmortem closes, two specific updates:
- The runbook used — what worked, what didn't, what would be faster next time. Edit RB-021 today.
- The ADRs implicated — if the incident exposes a decision that should be re-opened, name the ADR and schedule the conversation.
Step 6 — Leadership reads, not summary
The postmortem is read by leadership directly. Not summarised in an all-hands. The artefact is the medium. If leadership cannot read postmortems directly, the chain is being filtered.
Step 7 — Publish
The postmortem lives next to the runbook and is referenced from the ADRs. Searchable by date, by service, by chain level.
A complete postmortem
See the template for the copy-paste skeleton.
Evidence
Across our incidents, postmortems that produced compounding fixes shared three properties.
- The timeline was written during the incident, not after. Postmortems with during-the-incident timelines produced fixes that stuck in 90% of cases; postmortems with reconstructed timelines produced sticking fixes in 50%.
- The chain level was named explicitly. Postmortems that walked the five levels caught the level above the symptom in 1 of 3 cases. Postmortems without the walk fixed at L5 ("more monitoring") in 70% of cases.
- One fix, not three. Postmortems with one chain-level fix tested it in the next cycle 90% of the time. Postmortems with three fixes tested none in 40% of cases.
Anti-patterns
| Pattern | What it looks like | Where to fix |
|---|---|---|
| Postmortem produced a feeling | "We need to be more careful with releases." | Clinic — A postmortem that produced a feeling |
| Fix is more monitoring | The L5 reflex | Walk the levels. The level above is usually where the fix belongs. |
| Timeline reconstructed | The during-write was skipped | This is itself a chain-level signal — runbook + on-call discipline |
| Names of people in the fix | "Maya will be more careful next time" | Blameless. Names appear only in the timeline. The fix is at a level, not a person. |
| Three or five fixes | The team noticed many things | One. The rest go to retro or to next-postmortem if recurring. |
| Leadership summarised, not read | The PO retells the postmortem in a meeting | Hand them the artefact; the meeting is the failure shape |
Confusable with
| This | Not this | Difference |
|---|---|---|
| Postmortem | Retrospective | Postmortem = bound to one incident. Retro = bound to one cycle. |
| Postmortem | RCA doc | RCA is a technique (5 whys, fishbone) used inside a postmortem. The postmortem is the artefact. |
| Postmortem | Blame meeting | Blameless. Names appear in the timeline only. |
| Chain-level fix | Symptom fix | Symptom fix patches the L4/L5 surface; chain-level fix addresses the cause at the level that produced it. |
Further reading
- Canon — After We Build · Incidents & Postmortems · Bugs and Their Roots
- Template — Postmortem
- Checklist — Postmortem · timeline-during-not-after
- Clinic — A postmortem that produced a feeling
- Practice — First 48 hours watch — the watch hands off to the bridge, the bridge hands off to the postmortem
- Skill path — On-call foundations · Step 8 · Tech Lead foundations · Step 8
- Principle — Chain-level thinking
- Reference — Area · Postmortem