Skip to content

Worked example — the JWT outage

A 6-line XML policy change to the API gateway. Deployed to all environments simultaneously. Every API call returns 403 Forbidden. 12,400 users locked out. 44 minutes to resolution. 147 support tickets. Two enterprise SLA breaches. Zero data loss — but the trust cost is real and unmeasured.

The timeline that matters

TimeEvent
09:38Policy deployed to all environments at once.
09:42First 403 errors.
09:43PagerDuty fires.
09:50Root cause identified — gateway requires a JWT claim tokens don't have.
09:56Revert initiated.
10:01Revert complete.

Detection: 4 minutes. Root cause: 8 minutes. Revert: 5 minutes. The speed was good. The prevention was absent.

The 5 Whys — where the chain failed

  1. Why did all calls fail? The policy required a claim tokens didn't have.
  2. Why wasn't this caught? The pipeline validated XML syntax, not runtime behaviour.
  3. Why no staging first? The pipeline deploys to all environments in one step.
  4. Why? Gateway policies were historically low-risk.

The systemic cause: any change that can cause a production outage deserves the same deployment rigour as application code.

The classification and structural fixes

Labels

text
severity/critical · impact/blocker · P0 · type/security ·
area/auth · root-cause/configuration · Chain: integration-gap

Fixes

  • Environment-gated deployments with 30-minute staging soak.
  • Token compatibility smoke test in pipeline.
  • Auto status-page update on P0.
  • All gateway changes through change advisory board.

Configuration changes are code changes. If a 6-line XML change can lock out 12,400 users, it deserves the same pipeline, the same gates, and the same review as application code.

Part 5 — The Retrospective →

200apps · How We Work · NWIRE