← Debug Guides

DEBUG GUIDE · INCIDENTS · SRE PLAYBOOK

Incident Triage & Cascading Failures.

incident sre reliability oncall
In an incident: restore service first, root-cause later. Scope the blast radius, stop the bleeding (rollback / shed load / failover), communicate, then investigate. Cascades (retry storms, thundering herds) turn small faults into outages — recognize them.

The first 5 minutes

  1. Confirm & scope. Real? What % of users? Which service/region?
  2. Declare if user-facing — assign an incident commander, open a channel.
  3. What changed? 80% follow a deploy/config/traffic change. Check the timeline.
  4. Stop the bleeding before diagnosing: rollback, disable the flag, fail over, shed load.
mitigate before diagnose Don't root-cause a live outage. If a deploy lines up, roll back now, investigate after.

Scope the blast radius

  • One service or many? Up vs down the dependency graph?
  • One region/tenant or global?
  • Correlate errors, latency, saturation with the change timeline.

Cascading failure patterns

PatternWhat happensMitigation
Retry stormFailures → retries → more loadBackoff + jitter, retry budgets, breakers
Thundering herdCache expiry → everyone hammers at onceJittered TTLs, request coalescing
Resource exhaustionOne slow dep ties up all threadsTimeouts, bulkheads, concurrency limits
Death spiralRestarts under load never catch upShed load, scale, slow-start

Rollback vs forward-fix

Default to rollback — fastest known-good. Forward-fix only when rollback is impossible (irreversible migration) or the fix is trivial and certain.

Communication

  • Status updates on a cadence — silence breeds escalation.
  • One source of truth; IC coordinates, others execute.
  • Record the timeline as you go — postmortem skeleton.

Blameless postmortem

Timeline, impact, root cause (5 whys), what went well/poorly, action items with owners. Target systems and gaps, not people.

Triage checklist

# 1. confirm + scope (errors, latency, saturation)
# 2. what changed? (deploys, flags, config, traffic)
# 3. mitigate: rollback / disable flag / failover / shed load
# 4. communicate on a cadence
# 5. verify recovery, then root-cause + postmortem
← prev: Database all debug guides →
© cvam — written in plaintext, served warm