template · runbook

Runbook template

Copy-paste skeleton. Reads at 2am, in adrenaline, in 30 seconds. The reader can act before they finish reading.

How to use

Runbooks are not documentation. They are operational scripts for the worst moment a service can have. Cut every word that does not help the on-call act. If a teammate cannot read this cold and act, rewrite. If a runbook has not been used in 6 months, either delete it or run a game day to verify it still works.

text

# Runbook — [Symptom that triggers this runbook]

Owner:        [Team / rotation]
Severity:     SEV-1 | SEV-2 | SEV-3
Last tested:  YYYY-MM-DD (game day or real incident)
Linked ADRs:  [ADR-NNN, ADR-MMM]

## When to use this runbook
[The exact symptom or alert that means "open this runbook now".
 One sentence. Specific enough that an on-call paging in at 2am
 knows this is the right file. Example: "name.display.fallback
 rate >5% for 5 minutes."]

## Diagnose (≤2 minutes)
1. Open dashboard: [link]
2. Check: [first thing to verify]
   - If [condition]: go to *Mitigate*.
   - If [other condition]: this is a different runbook — open
     [link to other runbook].

## Mitigate (≤5 minutes)
The action that reduces user impact NOW. Often a flag flip,
a config change, or scaling.

1. [Action 1 — exact command or UI step]
2. [Action 2]
3. Verify: [observable that confirms mitigation worked]

If mitigation does not work in 5 minutes, declare incident and
follow [Incident response runbook](/runbooks/incident).

## Communicate
- Internal: [where to post status · which channel]
- Client (if SEV-1): [contact · timing · who writes]
- On-call handoff (if shift change): [what to hand over]

## Resolve (after mitigation)
[Steps to fully resolve, not just mitigate. Often takes longer
 and is less time-pressured.]

## Postmortem trigger
This runbook firing produces a postmortem if any of:
  - SEV-1, regardless of duration.
  - SEV-2 lasting >30 minutes.
  - Repeat within 30 days of the previous firing.

Postmortem template: [/templates/postmortem]

## See also
- Related runbook: [link]
- Linked ADR: [ADR-NNN]
- Dashboard:   [link]

Worked example — the runbook on-call actually reads

markdown

# Runbook · Grading submission p95 > 800ms for 5 min

Trigger:        Grading-API p95 latency > 800ms sustained 5 min
Severity:       P1
Linked SLO:     Grading.submit p95 < 800ms (99%)
Last tested:    2026-05-10 (Esti, game day)

## Diagnose (≤ 2 min)
1. Open dashboard:    /d/grading-overview
2. Check three plots: latency-by-region · DB-write-time · queue-depth
3. Decide:
   - If DB-write spikes → DB issue → step A
   - If queue-depth climbs → consumer slow → step B
   - If region-specific  → network → step C

## Mitigate (≤ 5 min)
A · DB issue
   - `kubectl exec ... -- psql ... -c 'select pg_stat_activity'`
   - If long-running query: cancel; if locks: alert DB owner
B · Consumer slow
   - Scale grading-worker deployment: `kubectl scale --replicas=10`
C · Network
   - Page network on-call; route around the affected region
   - Set flag `grading.failover-region: true`

## Resolve
- Confirm p95 back below 800ms for 10 min
- Update status page to "monitoring"
- File bug if root cause is code/data
- 24h later: status page resolved + postmortem if it exceeded 30 min

## See also
- Related runbook: ./grading-error-rate-high.md
- Linked ADR: ADR-014 (idempotency keys)
- Dashboard: https://grafana/d/grading-overview

Where this lives in your project

Runbooks live in the repo alongside the service code, typically runbooks/ or ops/runbooks/. Each runbook is one file. No two runbooks should cover the same scenario — if two do, merge or kill. The on-call's home page lists the runbooks alphabetically by triggering symptom.

What to do if a section resists

Resistance	What it means	Where to go
Cannot name the triggering symptom precisely	The alert is too generic — re-tune the alert before writing the runbook
Diagnose takes longer than 2 minutes	The runbook is doing too much — split it
Mitigate requires running a script that doesn't exist	The mitigation is not actually available — write the script first
Last-tested date is unknown	The runbook is a fiction — run a game day or delete
Two runbooks cover the same trigger	Operations have drifted; consolidate
The runbook is 5 pages	It does not read at 2am. Cut.

Runbook template ​

Worked example — the runbook on-call actually reads ​

Where this lives in your project ​

What to do if a section resists ​

See also ​

Runbook template

Worked example — the runbook on-call actually reads

Where this lives in your project

What to do if a section resists

See also