template · runbook
Runbook template
Copy-paste skeleton. Reads at 2am, in adrenaline, in 30 seconds. The reader can act before they finish reading.
How to use
Runbooks are not documentation. They are operational scripts for the worst moment a service can have. Cut every word that does not help the on-call act. If a teammate cannot read this cold and act, rewrite. If a runbook has not been used in 6 months, either delete it or run a game day to verify it still works.
text
# Runbook — [Symptom that triggers this runbook]
Owner: [Team / rotation]
Severity: SEV-1 | SEV-2 | SEV-3
Last tested: YYYY-MM-DD (game day or real incident)
Linked ADRs: [ADR-NNN, ADR-MMM]
## When to use this runbook
[The exact symptom or alert that means "open this runbook now".
One sentence. Specific enough that an on-call paging in at 2am
knows this is the right file. Example: "name.display.fallback
rate >5% for 5 minutes."]
## Diagnose (≤2 minutes)
1. Open dashboard: [link]
2. Check: [first thing to verify]
- If [condition]: go to *Mitigate*.
- If [other condition]: this is a different runbook — open
[link to other runbook].
## Mitigate (≤5 minutes)
The action that reduces user impact NOW. Often a flag flip,
a config change, or scaling.
1. [Action 1 — exact command or UI step]
2. [Action 2]
3. Verify: [observable that confirms mitigation worked]
If mitigation does not work in 5 minutes, declare incident and
follow [Incident response runbook](/runbooks/incident).
## Communicate
- Internal: [where to post status · which channel]
- Client (if SEV-1): [contact · timing · who writes]
- On-call handoff (if shift change): [what to hand over]
## Resolve (after mitigation)
[Steps to fully resolve, not just mitigate. Often takes longer
and is less time-pressured.]
## Postmortem trigger
This runbook firing produces a postmortem if any of:
- SEV-1, regardless of duration.
- SEV-2 lasting >30 minutes.
- Repeat within 30 days of the previous firing.
Postmortem template: [/templates/postmortem]
## See also
- Related runbook: [link]
- Linked ADR: [ADR-NNN]
- Dashboard: [link]Worked example — the runbook on-call actually reads
markdown
# Runbook · Grading submission p95 > 800ms for 5 min
Trigger: Grading-API p95 latency > 800ms sustained 5 min
Severity: P1
Linked SLO: Grading.submit p95 < 800ms (99%)
Last tested: 2026-05-10 (Esti, game day)
## Diagnose (≤ 2 min)
1. Open dashboard: /d/grading-overview
2. Check three plots: latency-by-region · DB-write-time · queue-depth
3. Decide:
- If DB-write spikes → DB issue → step A
- If queue-depth climbs → consumer slow → step B
- If region-specific → network → step C
## Mitigate (≤ 5 min)
A · DB issue
- `kubectl exec ... -- psql ... -c 'select pg_stat_activity'`
- If long-running query: cancel; if locks: alert DB owner
B · Consumer slow
- Scale grading-worker deployment: `kubectl scale --replicas=10`
C · Network
- Page network on-call; route around the affected region
- Set flag `grading.failover-region: true`
## Resolve
- Confirm p95 back below 800ms for 10 min
- Update status page to "monitoring"
- File bug if root cause is code/data
- 24h later: status page resolved + postmortem if it exceeded 30 min
## See also
- Related runbook: ./grading-error-rate-high.md
- Linked ADR: ADR-014 (idempotency keys)
- Dashboard: https://grafana/d/grading-overviewWhere this lives in your project
Runbooks live in the repo alongside the service code, typically runbooks/ or ops/runbooks/. Each runbook is one file. No two runbooks should cover the same scenario — if two do, merge or kill. The on-call's home page lists the runbooks alphabetically by triggering symptom.
What to do if a section resists
| Resistance | What it means | Where to go |
|---|---|---|
| Cannot name the triggering symptom precisely | The alert is too generic — re-tune the alert before writing the runbook | |
| Diagnose takes longer than 2 minutes | The runbook is doing too much — split it | |
| Mitigate requires running a script that doesn't exist | The mitigation is not actually available — write the script first | |
| Last-tested date is unknown | The runbook is a fiction — run a game day or delete | |
| Two runbooks cover the same trigger | Operations have drifted; consolidate | |
| The runbook is 5 pages | It does not read at 2am. Cut. |
See also
- Canon — As We Build · Runbooks & Rollback
- Practice — First 48 hours watch — what calls a runbook
- Skill path — On-call foundations · Step 9
- Clinic — A flag that never got cleaned up — a runbook adjacent failure