after we build · part four · worked example

Worked example — the JWT outage

A 6-line XML policy change to the API gateway. Deployed to all environments simultaneously. Every API call returns 403 Forbidden. 12,400 users locked out. 44 minutes to resolution. 147 support tickets. Two enterprise SLA breaches. Zero data loss — but the trust cost is real and unmeasured.

The timeline that matters

Time	Event
09:38	Policy deployed to all environments at once.
09:42	First 403 errors.
09:43	PagerDuty fires.
09:50	Root cause identified — gateway requires a JWT claim tokens don't have.
09:56	Revert initiated.
10:01	Revert complete.

Detection: 4 minutes. Root cause: 8 minutes. Revert: 5 minutes. The speed was good. The prevention was absent.

The 5 Whys — where the chain failed

Why did all calls fail? The policy required a claim tokens didn't have.
Why wasn't this caught? The pipeline validated XML syntax, not runtime behaviour.
Why no staging first? The pipeline deploys to all environments in one step.
Why? Gateway policies were historically low-risk.

The systemic cause: any change that can cause a production outage deserves the same deployment rigour as application code.

The classification and structural fixes

Labels

text

severity/critical · impact/blocker · P0 · type/security ·
area/auth · root-cause/configuration · Chain: integration-gap

Fixes

Environment-gated deployments with 30-minute staging soak.
Token compatibility smoke test in pipeline.
Auto status-page update on P0.
All gateway changes through change advisory board.

Configuration changes are code changes. If a 6-line XML change can lock out 12,400 users, it deserves the same pipeline, the same gates, and the same review as application code.

Part 5 — The Retrospective →

✦ Why We Build

◐ Before We Build

◑ What We Build

● How We Build

◔ After We Build

◕ Did We Serve? (legacy)

Worked example — the JWT outage

The timeline that matters

The 5 Whys — where the chain failed

The classification and structural fixes

Worked example — the JWT outage ​

The timeline that matters ​

The 5 Whys — where the chain failed ​

The classification and structural fixes ​

Worked example — the JWT outage

The timeline that matters

The 5 Whys — where the chain failed

The classification and structural fixes