after we build · part four · worked example
Worked example — the JWT outage
A 6-line XML policy change to the API gateway. Deployed to all environments simultaneously. Every API call returns 403 Forbidden. 12,400 users locked out. 44 minutes to resolution. 147 support tickets. Two enterprise SLA breaches. Zero data loss — but the trust cost is real and unmeasured.
The timeline that matters
| Time | Event |
|---|---|
| 09:38 | Policy deployed to all environments at once. |
| 09:42 | First 403 errors. |
| 09:43 | PagerDuty fires. |
| 09:50 | Root cause identified — gateway requires a JWT claim tokens don't have. |
| 09:56 | Revert initiated. |
| 10:01 | Revert complete. |
Detection: 4 minutes. Root cause: 8 minutes. Revert: 5 minutes. The speed was good. The prevention was absent.
The 5 Whys — where the chain failed
- Why did all calls fail? The policy required a claim tokens didn't have.
- Why wasn't this caught? The pipeline validated XML syntax, not runtime behaviour.
- Why no staging first? The pipeline deploys to all environments in one step.
- Why? Gateway policies were historically low-risk.
The systemic cause: any change that can cause a production outage deserves the same deployment rigour as application code.
The classification and structural fixes
Labels
severity/critical · impact/blocker · P0 · type/security ·
area/auth · root-cause/configuration · Chain: integration-gapFixes
- Environment-gated deployments with 30-minute staging soak.
- Token compatibility smoke test in pipeline.
- Auto status-page update on P0.
- All gateway changes through change advisory board.
Configuration changes are code changes. If a 6-line XML change can lock out 12,400 users, it deserves the same pipeline, the same gates, and the same review as application code.