template · postmortem

Postmortem template

Copy-paste skeleton. Bound to one incident. Blameless. Walks the chain levels. Lands one fix — owned, dated, testable.

How to use

The timeline below is the on-call's during-the-incident notes — pasted, not rewritten. The chain-levels walk is the discipline. More monitoring is rarely the right level; push at least one level up.

text

# Postmortem — [one-line title naming the symptom]

Incident ID:     [INC-NNN]
Severity:        SEV-1 | SEV-2 | SEV-3
Date:            YYYY-MM-DD
Duration:        [start → end]
Postmortem date: YYYY-MM-DD (within 72h of incident)
Driver:          [Name (role) — usually TL]
Attendees:       [Trio + on-call who held the bridge + relevant
                  developer]

## Summary
[Two sentences. What broke, who saw it, what was done.]

## User impact
[Specific. Who, how many, what they couldn't do, what they
 lost. If no user impact, name that explicitly.]

## Timeline   (from on-call's during-the-incident notes)
T+00:00  [Alert / first signal]
T+00:02  [Acknowledge]
T+00:04  [Diagnose action]
T+00:06  [Mitigation action]
T+00:08  [Recovery signal observed]
T+00:09  [Internal comm]
T+00:14  [Client comm if relevant]
T+00:30  [Watch closes]

## Chain-level walk
For each level, did this level produce or fail to catch the
incident? Be honest. *No* is a valid answer.

L1 Strategy:    [yes/no, with one-line reason]
L2 Discovery:   [yes/no]
L3 Scope:       [yes/no]
L4 Execution:   [yes/no]
L5 Operation:   [yes/no — usually yes; the system responded]

## Structural cause
[The level that most produced the incident, in one paragraph.
 Specific. Not "we need to be more careful". Trace.]

## One chain-level fix
Fix:        [Specific, structural, at the level the cause
             traces to. Not "more monitoring".]
Owner:      [Named person]
Dated:      [Specific date when this is in production / merged]
Testable:   [Observable that confirms the fix landed]

## Runbook update
[The runbook(s) used during the incident. What worked. What
 didn't. Edit them today; don't wait for a follow-up cycle.]
- RB-NNN: [edits made / scheduled]

## ADRs implicated
[Any ADRs the incident touched. Are any to be re-opened?]
- ADR-NNN: [status — still standing / drifting / re-open]

## What we are NOT doing
[Pre-empt the "we should also…" thread. Name the symptom
 fixes the team will NOT do because they don't address
 the chain level.]

## Sign-off
TL:        [Name] · [date]
On-call:   [Name] · [date]
PO:        [Name] · [date]
Leadership reviewed: [Name] · [date]

Worked example — the JWT outage postmortem

markdown

# Postmortem · 2026-04-22 · JWT outage

Severity:     P0
Duration:     09:42 → 10:26 (44 min)
Users affected: 12,400 (all API consumers)
Author:       Esti (TL)

## Timeline (recorded during)
09:38  Gateway policy deployed to all envs simultaneously
09:42  First 403 errors observed
09:43  PagerDuty fires — Esti commands, Maya investigates, Alex comms
09:50  Root cause identified — policy requires claim tokens don't have
09:56  Revert initiated
10:01  Revert complete, error rate falling
10:26  Status page resolved; war room archived

## What was missed (chain levels)
L1 Strategy:    No.
L2 Discovery:   No.
L3 Scope:       No — change shaped trivially.
L4 Execution:   YES — pipeline validated XML syntax not runtime behaviour.
L5 Operation:   YES — deploy went to all envs at once; no staging soak.

## 5 Whys
Why all calls fail?            Policy required a claim tokens didn't have.
Why wasn't this caught?        Pipeline validated XML syntax, not runtime.
Why no staging first?          Pipeline deploys to all envs in one step.
Why?                           Gateway policies were historically low-risk.

Systemic cause: changes that can cause production outages need
the same deployment rigour as application code.

## Structural fix (ONE)
Owner:     Esti (TL)
Date:      2026-05-06 (release-gate update)
Test:      Token-compatibility smoke test in pipeline; zero
           gateway-policy bypasses in next release.

## Labels
severity/critical · impact/blocker · P0 · type/security ·
area/auth · root-cause/configuration · Chain: integration-gap

Where this lives in your project

Postmortems live in the repo, alongside runbooks — docs/postmortems/ or similar. Numbered sequentially. Searchable by date, severity, service, and chain level. Linked from the runbook used and from the ADRs implicated.

What to do if a section resists

Resistance	What it means	Where to go
Timeline is being reconstructed	The during-the-incident write was skipped	This is itself the chain-level finding
All five levels say yes	The team is over-attributing	Be honest. Usually 1–2 levels actually produced the incident.
Fix is "more monitoring"	The L5 reflex	Push one level up. What at L3 or L4 would prevent the next incident?
Three or five fixes	The team noticed many things	One. The rest go to retro or next-postmortem if recurring.
Names of people in the fix	Drifting into blame	The fix is at a level. Rewrite.

Postmortem template ​

Worked example — the JWT outage postmortem ​

Where this lives in your project ​

What to do if a section resists ​

See also ​

Postmortem template

Worked example — the JWT outage postmortem

Where this lives in your project

What to do if a section resists

See also