how we build · part eight
The Release
Runbooks rehearsed before the incident. The release gate — every item checked, no exceptions. Rollback discipline at four levels. The release moment. Client and CS communication before the feature is live.
Events in this phase. Runbook rehearsal — scheduled in staging before each release. Release gate review — short meeting, checklist walked through, go/no-go decided. Both at the slice boundary, not during regular flow.
Runbooks — written before the incident
This is where meaning meets operational reality. A runbook is written before any feature that touches a critical path goes live. Any flow where a failure has significant user or financial consequences needs a runbook before the release gate passes.
- Trigger — the exact monitoring condition. Not "error rate is high" — "the exam-submit error rate exceeds 5% for 5 consecutive minutes."
- Steps — numbered, specific, timed. Not "investigate" — "check the sync error rate dashboard; if above threshold, proceed to step 2."
- Rollback — almost always: disable the feature flag. The flag name, who has access, confirmed rollback time from rehearsal.
- Communication template — pre-written message for the client if the incident exceeds 15 minutes.
Runbooks live in the repository, versioned alongside the code. Before every release, the runbook is rehearsed in staging — someone runs through the steps, the rollback is executed, the time is recorded.
Rollback rehearsed: confirmed 6 minutes. Ran by: Maya + Ran. Date: 14 April. Flag disabled, staging back to baseline.
This note is a release gate condition. "Rollback possible" is not. A confirmed time is.
Rollback discipline — four levels
- Flag rollback — disable the flag. Seconds. Users return to the previous behaviour. No code change needed. This is the primary rollback for flagged features.
- Deploy rollback — revert to the previous deployment. Minutes. The pipeline redeploys the last known-good build. Used when the issue is not flag-specific.
- Migration rollback — reverse a schema change. Hours. Only possible if the migration was designed to be reversible. This is why backward-compatible migrations matter.
- Data rollback — restore data to a previous state. May not be possible if writes have occurred. Plan for this before it happens — the plan is either "we have point-in-time recovery" or "we accept this risk and here's why."