Skip to content

On-call foundations

A six-week, ten-step path. By the end of it you have shadowed a shift, walked every runbook, held a bridge, written a timeline during not after, and driven one postmortem to a chain-level fix. You are not a senior on-call. You are an on-call whose first shift left a record the next shift can read.

TIP

A skill is read → practice → check → reflect, anchored to your real cycle. Real shifts where possible; game days where not.

Mastery looks like

When you finish this path, you can:

  • Write an incident timeline during, not after.
  • Hold a bridge for two hours without the team's clarity decaying.
  • Use or write a runbook every shift.
  • Communicate to engineering, leadership, and client without anyone learning from anyone else first.
  • Hand a postmortem material it can read cold.

Self-rating before you start

1 — Never3 — Sometimes5 — Default
I write the timeline during the incident
I use or update a runbook every shift
Comms threads are honest and orienting
The postmortem can read my notes cold
I have stopped a bridge cleanly

Step 1 — Orient in the chain

Read: As We Build · Runbooks & Rollback · After We Build · The First 48 Hours · The Map.

Practice prompt: in three sentences, name the difference between first-contact noise and signal worth following up on.


Step 2 — Shadow a shift

Practice prompt: sit on-call alongside an experienced rotation for at least one shift. Do not act. Watch the timeline, the comms, the runbook usage.

Output: a one-page note on what surprised you.


Step 3 — Walk every runbook

Read: As We Build · Runbooks & Rollback.

Practice prompt: open every runbook the team has. For each: when was it last used, does it still work, who would understand it cold?

Output: a list of runbooks to rewrite, kill, or merge.


Step 4 — Read the observability surfaces

Read: As We Build · Observability.

Practice prompt: walk the dashboards. For each chart, ask: what level does this catch? What would I do if it spiked at 2am?

Mini-check: if a chart can't answer "what would I do", it isn't an on-call dashboard — it's analytics.


Step 5 — Hold a first-48h watch

Read: After We Build · The First 48 Hours.

Practice prompt: drive the watch on the next release. Watch dashboards (not tickets). Note signals; do not act on first-hour noise.

Output: a first-48h note distinguishing noise from signal.


Step 6 — Handle a real incident (or game day)

Read: After We Build · Incidents & Postmortems.

Practice prompt: when an incident fires, run the bridge. Timeline during, not after. Comms threads honest. Follow the runbook or write down why you didn't.

Pair task: if no real incident, hold a game day. Simulate a Sev 2.


Step 7 — Write the during-the-incident timeline

Practice prompt: during the next incident, write the timeline as it happens. Minute-by-minute. Resist the urge to wait until "things settle".

Mini-check: does another engineer, reading your timeline cold, know what happened? If not, rewrite while it's still fresh.


Step 8 — Drive the postmortem

Read: After We Build · Incidents & Postmortems.

Practice prompt: drive the postmortem for the incident or game day. Walk the timeline. Identify the chain level. Land one change — owned, dated, testable.

Mini-check: if the fix is "more monitoring", you stopped at L5. Push one level up.


Step 9 — Update or write a runbook

Practice prompt: rewrite the runbook your incident used (or wrote when it didn't exist). It should read at 2am, in adrenaline, in 30 seconds.

Pair task: have a teammate read the runbook cold. If they can't act on it without you, rewrite again.


Step 10 — Teach back, contribute back

Practice prompt: write a one-page guide for the next new on-call joining your team. What I wish I'd known before my first shift. One page.

Authoring contribution: open a PR — sharpen one thing in the corpus about incident response or timeline writing.

Self-rating after: rate yourself again. Bring the gap to your next cycle.


After this path

  • On-call · Practitioner (coming) — multi-service rotations, escalation design.
  • On-call · Advanced (coming) — SRE-level program leadership.

Stuck?

If you got stuck atRead
Step 5 — first-hour noise looked like a fireThe First 48 Hours — note, don't act
Step 7 — timeline kept getting written after the factThe during-write is the discipline; protect it
Step 8 — fix was "more monitoring"Push one level up — was there a missing ADR?
Step 9 — runbook didn't read at 2amCut it shorter; assume adrenaline

200apps · How We Work · NWIRE