Skip to content

A runbook that doesn't run

The page fired at 02:14. The on-call opened the runbook. By 02:21, they were running a different mitigation than the one written. The runbook had been authored a year ago against a system that had drifted; it described actions for buttons that no longer existed and metrics that were never wired. The on-call solved the incident on instinct and the runbook stayed in the docs, unchanged, looking helpful.

The artefact

Excerpt — RB-018, "Grader queue 5xx spike", authored 2025-03-12

Severity: SEV-2 Owner: queue-platform team Last tested: 2025-04-01

When to use this runbook: Grader queue returning 5xx errors at >5% for >2 minutes.

Diagnose (≤2 minutes):

  1. Open the Grafana board: grader-platform / queue-health.
  2. Check the queue.upstream.latency panel.
    • If >800ms: this is an upstream issue. Go to RB-019.
    • If <800ms: this is a queue-internal issue. Continue.

Mitigate (≤5 minutes):

  1. Open the queue admin UI at /admin/queue/scale.
  2. Click "Scale up replicas — emergency".
  3. Verify the green banner appears in the admin UI.
  4. Watch the queue.error_rate panel return to <1% within 90 seconds.

Communicate:

  • Internal: post in #engineering.
  • Client (if SEV-1): not applicable.

The runbook looks complete. Severity is marked. Owner is named. Steps are numbered. The on-call who paged in at 02:14 opened it.

By 02:18 they had found:

  • The Grafana board grader-platform / queue-health was deprecated six months ago when the team migrated dashboards. There is now a different board, named differently, with the panel queue.upstream.latency renamed to queue.dependency.latency_p95 — but the runbook still names the old one.
  • The queue admin UI at /admin/queue/scale no longer has a "Scale up replicas — emergency" button. After the 2025-Q4 refactor, scaling moved to a Terraform-driven config repo. The admin UI now shows a read-only state.
  • The queue.error_rate panel exists but reads from a different metric source than the alert that fired — the alert is on queue.error_rate.5m_rate and the panel is on queue.error_rate.30s_rate. They diverge in the first 90 seconds of any incident.

The on-call ran a Terraform apply by memory of the prior incident. The queue recovered. The postmortem was clean. The runbook stayed unchanged in the docs.

Three months later, a different on-call paged on the same alert at 03:40 and opened RB-018 cold. They followed the steps, hit the read-only admin UI, lost five minutes, then escalated to a TL who happened to be on shift across timezones. The incident ran 23 minutes instead of 8.

What's wrong?

Stop. Find three things wrong before reading the diagnosis.

Diagnosis (open when ready)

1. The last tested date is older than the system

Last tested: 2025-04-01

The runbook was authored 2025-03-12 and last tested 2025-04-01 — that's the entry where the discipline should have died and didn't. Twelve months since last test. During those twelve months: the Grafana migration happened, the admin UI was refactored, and the metric alerting changed.

The corpus rule from Template · Runbook: if a runbook has not been used in 6 months, either delete it or run a game day to verify it still works. RB-018 fell into neither bucket — it was kept "in case", and nobody ran a game day.

A senior on-call walks the runbook library quarterly, looking at last tested dates. Any runbook >6 months untested is either tested, killed, or merged with a current one. The discipline is the same as flag cleanup: operational artefacts that outlive their context become hidden traps.

2. The runbook describes a dashboard that no longer exists by that name

Open the Grafana board: grader-platform / queue-health.

This is the most-common silent failure. When the platform team migrated dashboards in 2025-Q3, they updated the current dashboards in the documentation. They did not walk the runbook library checking which runbooks named which old dashboards.

The fix is structural, not editorial. A runbook should reference dashboards by URL (or stable identifier), not by name. When the dashboard moves, the URL update is one PR. When the dashboard is referenced by name in a dozen runbooks, the rename is a dozen forgettable edits.

Better yet: link to a dashboards.md index in the same docs folder. Renaming a dashboard touches one file; all runbooks resolve correctly.

3. The runbook describes a button that no longer exists

Click "Scale up replicas — emergency" in the admin UI.

This is the worst-case failure. The mitigation step references a UI affordance that was removed. The on-call at 02:18 has no graceful path — the runbook tells them to click a button that isn't there. The only path is to abandon the runbook.

When the admin UI was refactored in 2025-Q4, the team that did the refactor:

  • Updated the internal architecture docs.
  • Updated the deploy playbook.
  • Did not update the runbooks that referenced the old UI.

There is a structural fix: any UI or CLI change that removes a documented operational affordance triggers a runbook audit before merge. The PR template includes the question: does this remove or change an operational entry point that a runbook references? The answer determines whether the PR ships alone or with runbook updates.

In RB-018's case, the 2025-Q4 refactor PR would have surfaced the dependency and either updated RB-018 in the same PR — or killed it as no-longer-applicable.

4. (Bonus) The alert and the panel reference different metrics

Watch the queue.error_rate panel return to <1% within 90 seconds.

The alert fires on queue.error_rate.5m_rate. The runbook tells the on-call to watch queue.error_rate.30s_rate. In the first 90 seconds of any incident, these will disagree. The on-call sees the 30s rate drop quickly and believes the runbook — clearing the alert mentally before the 5m alert clears in fact. They close the incident; the alert keeps firing in monitoring.

Runbooks should reference the same metric the alert references. If the alert is error_rate.5m_rate, the runbook watches error_rate.5m_rate. The discipline is about closing the same loop the alert opened.

The fix

text
# RB-018 — Grader queue 5xx spike

Severity:       SEV-2
Owner:          queue-platform team (current: Itai TL,
                rotation primary: see /ops/oncall-roster)
Last tested:    2026-04-22 (game day, SEV-2 simulation)
Linked ADRs:    ADR-029 (scaling moved to Terraform), ADR-018
                (read-replica auto-scaling)

## When to use this runbook
queue.error_rate.5m_rate >5% for >2 minutes. (This is the
alert; this runbook responds to it.)

## Diagnose (≤2 minutes)
1. Open dashboard: [Queue Health 2025-Q3](https://grafana.../d/abc123).
   (Linked by URL, not by name.)
2. Check panel "Dependency latency p95".
   - If >800ms: upstream issue. Open [RB-019](/runbooks/RB-019).
   - If <800ms: queue-internal. Continue.

## Mitigate (≤5 minutes)
1. Apply Terraform scale-up override:
   - Repo: 200apps/infra-prod
   - File: services/grader-queue/scale.tf
   - Change `desired_count` from current to `desired_count * 2`
   - PR: tagged @oncall, auto-merge after CI green (~90s)
2. Verify on the SAME panel the alert reads:
   queue.error_rate.5m_rate returns to <1%.
   (NOT the 30s_rate panel — that drops faster but the alert
   reads the 5m_rate.)

If mitigation does not work in 5 minutes, declare incident
and follow [Incident response runbook](/runbooks/incident).

## Communicate
- Internal: #engineering with "RB-018 opened: [reason]"
- Client: not applicable for SEV-2 below 30m duration
- On-call handoff: if shift change, note current scale_count
  in handover

## Resolve (after mitigation)
1. Watch for 30 minutes that error_rate stays <1%.
2. After 30 minutes clean: revert Terraform scale-up
   (separate PR, normal review timing — not emergency).

## Postmortem trigger
SEV-2 lasting >30m, or repeat within 14 days.

## See also
- ADR-029 (Terraform scaling decision)
- ADR-018 (read-replica auto-scaling)
- Dashboards index: /ops/dashboards.md

Three structural changes:

  1. Dashboards referenced by URL. Renames don't break the runbook.
  2. The Terraform path replaces the dead UI button. Mitigation is currently-true.
  3. The runbook watches the same metric the alert reads. No first-90-seconds divergence.

And the operational change: Last tested updated and a game day run. When it next fires, the runbook works.

Where this comes from in the chain

This failure traces to Operation (Level 5). The runbook is an operational artefact; its decay is an operational failure. The structural fix is also operational:

  • Runbook audit cadence — quarterly, the on-call lead walks the runbook library against last tested dates.
  • PR template questiondoes this remove or change an operational entry point a runbook references?
  • Game day discipline — every runbook is exercised at least every 6 months, in staging or by simulation.

A senior on-call catches a decayed runbook in 60 seconds — the last tested date is the first read. Older than 6 months: don't trust until verified.

The deepest learning from this clinic: a runbook that says it's helpful but isn't is more dangerous than no runbook at all. No runbook forces the on-call to think structurally from the alert; a decayed runbook leads them down a path that doesn't exist. Kill decayed runbooks if you cannot afford to maintain them — silence is safer than a misleading map.

See also

200apps · How We Work · NWIRE