Skip to content

A postmortem that produced a feeling

The incident was SEV-1. The team met. People shared what happened. Everyone left feeling closer. Two weeks later, the same incident fired. The postmortem produced a feeling, not a fix.

The artefact

Excerpt — Postmortem, INC-058 "Grader queue down for 22 minutes", April 2026

Summary: On Monday morning, the grader queue was unavailable for 22 minutes. We responded quickly and restored service.

What went well:

  • Strong team response.
  • Good communication during the incident.
  • Customer was understanding.

What didn't go well:

  • We should have caught this earlier.
  • Communication could have been faster.
  • We need to be more careful with releases.

Action items:

  • We will be more vigilant.
  • We will improve our release process.
  • Leadership will review on-call coverage.
  • The team will discuss this in the next retro.

Lessons learned:

  • Be careful with release timing.
  • Communicate proactively.
  • Trust the on-call rotation.

The postmortem ran 90 minutes. Eight people attended. The notes are honest, in their way. Everyone left feeling that the team had grown closer.

Two weeks later, the same scenario fired again. Same root cause. Same 22 minutes of downtime. The on-call paged through to the same runbook. The same actions worked. The same feelings of closer team followed.

A month after that, leadership asked: what changed after the first postmortem? Nobody could name a specific change. Nothing landed because nothing was named at a chain level.

What's wrong?

Stop. Find three things wrong before reading the diagnosis.

Diagnosis (open when ready)

1. The timeline is missing

The postmortem has Summary, What went well, What didn't go well — but no timeline. The corpus rule from Practice · Postmortem · Step 2: the timeline was written during the incident by the on-call, and the postmortem starts by reading it.

Without the timeline:

  • The chain-level walk has no anchor. The team can't say at T+4 we noticed X — they can only say somebody noticed something.
  • The "what didn't go well" drifts to feelings. We should have caught this earlier — when, specifically? In which minute of the timeline was the signal available?
  • The fix becomes generic. We need to be more careful is the L5 reflex disguised as commitment.

If the timeline is missing, that itself is the postmortem's most important finding: the on-call did not write during the incident. The first chain-level fix is the during-the-incident write discipline.

2. There is no chain-level walk

The postmortem has what went well and what didn't go well — but no L1/L2/L3/L4/L5 walk. The team did not ask, for each level, did this level produce or fail to catch the incident?

Walking the levels for this incident would have surfaced:

  • L1 (Strategy): no — operational signal.
  • L2 (Discovery): no — the cycle didn't fail to anticipate this; it's a known failure mode.
  • L3 (Scope): yes. The cycle that introduced the cold-cache risk did not include a warming story. The warming story was deprioritised three cycles ago. Each cycle since has paid interest in this risk.
  • L4 (Execution): no — the code is correct.
  • L5 (Operation): yes, the response. The mitigation worked. But the structural cause is upstream.

The chain-level walk turns we need to be more careful into the warming story has been deprioritised three cycles in a row; the L3 fix is to commit the warming story this cycle. That is testable. That can land.

3. The fixes are feelings dressed as action items

We will be more vigilant.We will improve our release process.Leadership will review on-call coverage.The team will discuss this in the next retro.

None of these is owned, dated, or testable. They are the language of feeling, not of structural change. Compare to a corpus-compliant fix:

text
Chain-level fix (L3):
  Cold-cache warming job ships in the current cycle.
  Story re-prioritised to top of sprint.
  Owner:    the senior dev.
  Dated:    Live in production by 2026-05-30.
  Testable: queue.render.p95 <500ms during the next
            scheduled cold-start event (Sunday window).

One fix. Owned. Dated. Testable. The next incident is unlikely to recur because the structural cause was addressed.

The four action items in the artefact also fail because there are four of them. The corpus's discipline: one chain-level fix. The rest is either notes for the retro, or unfocused energy that produces no change.

The fix

The same conversation, run with the corpus's discipline:

text
# Postmortem — INC-058 "Grader queue down for 22 minutes"

Severity:        SEV-1
Date:            2026-04-10
Duration:        22 minutes (09:00–09:22 UTC)
Postmortem date: 2026-04-11 (within 24h)
Driver:          the TL

## Summary
Cold-cache cascade on Monday morning load spike. Queue
unavailable for 22 minutes. Mitigation: scaled read
replicas; queue recovered.

## User impact
~140 graders (full grading population) saw queue errors
during a 22-minute window 09:00-09:22 UTC. 23 incomplete
grade submissions retried successfully after recovery.
No data loss.

## Timeline
T+00:00  Page: queue.error_rate >50% for 60s.
T+00:01  the senior dev acks. Opens RB-021.
T+00:03  RB-021 step 1 not applicable (this is cold-cache,
         not DB).
T+00:08  the senior dev consults the TL. Decides on read-replica scale.
T+00:11  Scale-up runs. Takes 4 min.
T+00:15  Queue recovery starts.
T+00:22  p95 back to normal. Alert clears.

## Chain-level walk
L1 Strategy:    no
L2 Discovery:   no — known failure mode
L3 Scope:       YES. Cold-cache warming story has been
                deprioritised in cycles Q1, Q2, Q3.
                Each cycle increased the risk window.
L4 Execution:   no
L5 Operation:   yes — RB-021 needed an "if cold cache"
                branch.

## Structural cause
The cold-cache warming story has been deprioritised in
favour of feature work for three consecutive cycles.
The risk is known; it was named in the relevant TDB.
The current incident is the materialisation of accepted
risk — but the acceptance has never been re-evaluated.

## One chain-level fix
Fix:      Cold-cache warming job ships this cycle. Story
          re-prioritised to top of current sprint.
Owner:    the senior dev
Dated:    Live in production by 2026-05-30.
Testable: queue.render.p95 <500ms during the next
          scheduled cold-start event (Sunday window).

## Runbook update
RB-021: added "Cold-cache cascade" branch. Mitigation
sequence updated; clear if-this-then-that.

## ADRs implicated
ADR-018 (read-replica auto-scaling): still standing.
Implicit assumption "we will eventually ship warming" is
no longer implicit; warming is named in current cycle.

## What we are NOT doing
- Adding tighter alert thresholds (this would page
  earlier on the same incident; L5 patch, doesn't
  prevent recurrence).
- Adding a "release timing best practices" doc (L5 reflex
  disguised as L1).

## Sign-off
TL:        the TL · 2026-04-11
On-call:   the senior dev · 2026-04-11
PO:        Alex · 2026-04-11
Leadership reviewed: Leadership · 2026-04-12

The postmortem ran 60 minutes. The fix landed on schedule. The next cold-start event tested it; p95 stayed under 500ms. The structural cause was addressed.

Where this comes from in the chain

This failure traces to Operation (Level 5) — the postmortem itself is an operational practice. When the practice degrades to feelings, every other level pays interest.

The structural fix is in the postmortem template + the chain-level walk discipline. The postmortem checklist is the gate. The timeline-during-not-after line is the deepest gate.

See also

200apps · How We Work · NWIRE