Skip to content

First-48-hours watch

48 hours after flag-on. On-call active. PO watches dashboards loose-then-sharp. Dashboards, not tickets — tickets lag reality by hours. Act on three conditions; log everything else.

When

  • Begins at flag enablement — that was a release-gate condition.
  • Continues for 48 hours — first hour noisiest, settles by hour 24, last 24 hours are normal-cadence.

Who

  • On-call — primary. Owns the alerts; pages if needed.
  • PO — watches the leading-signal dashboards (adoption, completion, error encounter).
  • Tech Lead — available; responds to on-call escalation.

Time-box

The watch is continuous in calendar, not in seat-time. The PO checks the dashboard hourly for the first 4 hours, then every 4 hours until hour 24, then twice daily to hour 48.

Inputs

  • The release brief (so the watcher knows what was promised).
  • The named SLIs and SLO thresholds.
  • The runbook for each named alert.
  • The leading-signals dashboard (adoption, completion, error encounter, time-on-state).

Agenda

Not a meeting — a discipline. What the PO watches each check:

  1. Are dashboards within SLO? If yes, log.
  2. Is error rate above SLO threshold and trending up? If yes, Incident war room.
  3. Are leading signals telling a story? Adoption stalling? Completion rate weird? Error encounter at unexpected state? Note for the Signal reading session.
  4. Are helpdesk tickets correlating with anything visible? Pattern → flag to Helpdesk reading for the week.

Three conditions warrant immediate action:

  • SLO threshold crossed for >5 min → open runbook, start from step one.
  • Any data integrity concern → disable the flag, investigate in staging.
  • Any security-relevant behaviour → disable the flag, full stop.

Everything else: log, prioritise via the bug taxonomy, address in normal flow.

Outputs

  • A 48-hour watch note — what was observed, what was acted on, what was logged. Filed alongside the release brief.
  • The baseline data that the Signal reading session will draw on.
  • Early signals of unexpected patterns that feed the next cycle's brief.

What good looks like

The PO does not act on first-hour noise. People click things in unexpected orders; errors that are not bugs appear; the discipline is not reacting incorrectly. Acting early is not a sign of control; acting correctly is.

By hour 48, the noisy first-contact patterns settle. The team has a first honest picture — not yet the prediction check, but the data the check will draw from.

Anti-pattern

Watching tickets, not dashboards. Tickets are a lagging, narrative-shaped signal; they tell the team what users complained about, hours after the metric showed the same thing. Fix: dashboards first; tickets second; the helpdesk reading later in the week interprets the ticket layer.

A second anti-pattern: acting on every signal. The team panics at first-hour noise, disables the flag, re-enables, disables again. Fix: the three named conditions are the only triggers for action. Everything else is logged for triage.

A third: the watch ends quietly because nothing happened. No baseline captured, no first-impression note, no input to the signal reading. Fix: "nothing happened" is itself the observation — capture it. The signal reading will reference the baseline.

See also

200apps · How We Work · NWIRE