template · postmortem
Postmortem template
Copy-paste skeleton. Bound to one incident. Blameless. Walks the chain levels. Lands one fix — owned, dated, testable.
How to use
The timeline below is the on-call's during-the-incident notes — pasted, not rewritten. The chain-levels walk is the discipline. More monitoring is rarely the right level; push at least one level up.
text
# Postmortem — [one-line title naming the symptom]
Incident ID: [INC-NNN]
Severity: SEV-1 | SEV-2 | SEV-3
Date: YYYY-MM-DD
Duration: [start → end]
Postmortem date: YYYY-MM-DD (within 72h of incident)
Driver: [Name (role) — usually TL]
Attendees: [Trio + on-call who held the bridge + relevant
developer]
## Summary
[Two sentences. What broke, who saw it, what was done.]
## User impact
[Specific. Who, how many, what they couldn't do, what they
lost. If no user impact, name that explicitly.]
## Timeline (from on-call's during-the-incident notes)
T+00:00 [Alert / first signal]
T+00:02 [Acknowledge]
T+00:04 [Diagnose action]
T+00:06 [Mitigation action]
T+00:08 [Recovery signal observed]
T+00:09 [Internal comm]
T+00:14 [Client comm if relevant]
T+00:30 [Watch closes]
## Chain-level walk
For each level, did this level produce or fail to catch the
incident? Be honest. *No* is a valid answer.
L1 Strategy: [yes/no, with one-line reason]
L2 Discovery: [yes/no]
L3 Scope: [yes/no]
L4 Execution: [yes/no]
L5 Operation: [yes/no — usually yes; the system responded]
## Structural cause
[The level that most produced the incident, in one paragraph.
Specific. Not "we need to be more careful". Trace.]
## One chain-level fix
Fix: [Specific, structural, at the level the cause
traces to. Not "more monitoring".]
Owner: [Named person]
Dated: [Specific date when this is in production / merged]
Testable: [Observable that confirms the fix landed]
## Runbook update
[The runbook(s) used during the incident. What worked. What
didn't. Edit them today; don't wait for a follow-up cycle.]
- RB-NNN: [edits made / scheduled]
## ADRs implicated
[Any ADRs the incident touched. Are any to be re-opened?]
- ADR-NNN: [status — still standing / drifting / re-open]
## What we are NOT doing
[Pre-empt the "we should also…" thread. Name the symptom
fixes the team will NOT do because they don't address
the chain level.]
## Sign-off
TL: [Name] · [date]
On-call: [Name] · [date]
PO: [Name] · [date]
Leadership reviewed: [Name] · [date]Worked example — the JWT outage postmortem
markdown
# Postmortem · 2026-04-22 · JWT outage
Severity: P0
Duration: 09:42 → 10:26 (44 min)
Users affected: 12,400 (all API consumers)
Author: Esti (TL)
## Timeline (recorded during)
09:38 Gateway policy deployed to all envs simultaneously
09:42 First 403 errors observed
09:43 PagerDuty fires — Esti commands, Maya investigates, Alex comms
09:50 Root cause identified — policy requires claim tokens don't have
09:56 Revert initiated
10:01 Revert complete, error rate falling
10:26 Status page resolved; war room archived
## What was missed (chain levels)
L1 Strategy: No.
L2 Discovery: No.
L3 Scope: No — change shaped trivially.
L4 Execution: YES — pipeline validated XML syntax not runtime behaviour.
L5 Operation: YES — deploy went to all envs at once; no staging soak.
## 5 Whys
Why all calls fail? Policy required a claim tokens didn't have.
Why wasn't this caught? Pipeline validated XML syntax, not runtime.
Why no staging first? Pipeline deploys to all envs in one step.
Why? Gateway policies were historically low-risk.
Systemic cause: changes that can cause production outages need
the same deployment rigour as application code.
## Structural fix (ONE)
Owner: Esti (TL)
Date: 2026-05-06 (release-gate update)
Test: Token-compatibility smoke test in pipeline; zero
gateway-policy bypasses in next release.
## Labels
severity/critical · impact/blocker · P0 · type/security ·
area/auth · root-cause/configuration · Chain: integration-gapWhere this lives in your project
Postmortems live in the repo, alongside runbooks — docs/postmortems/ or similar. Numbered sequentially. Searchable by date, severity, service, and chain level. Linked from the runbook used and from the ADRs implicated.
What to do if a section resists
| Resistance | What it means | Where to go |
|---|---|---|
| Timeline is being reconstructed | The during-the-incident write was skipped | This is itself the chain-level finding |
| All five levels say yes | The team is over-attributing | Be honest. Usually 1–2 levels actually produced the incident. |
| Fix is "more monitoring" | The L5 reflex | Push one level up. What at L3 or L4 would prevent the next incident? |
| Three or five fixes | The team noticed many things | One. The rest go to retro or next-postmortem if recurring. |
| Names of people in the fix | Drifting into blame | The fix is at a level. Rewrite. |
See also
- Practice — Postmortem
- Checklist — Postmortem · timeline-during-not-after
- Canon — After We Build · Incidents & Postmortems
- Clinic — A postmortem that produced a feeling
- Template — Runbook