Incident Triage & Cascading Failures

In an incident: restore service first, root-cause later. Scope the blast radius, stop the bleeding (rollback / shed load / failover), communicate, then investigate. Cascades (retry storms, thundering herds) turn small faults into outages — recognize them.

The first 5 minutes

Confirm & scope. Real? What % of users? Which service/region?
Declare if user-facing — assign an incident commander, open a channel.
What changed? 80% follow a deploy/config/traffic change. Check the timeline.
Stop the bleeding before diagnosing: rollback, disable the flag, fail over, shed load.

mitigate before diagnose Don't root-cause a live outage. If a deploy lines up, roll back now, investigate after.

Scope the blast radius

One service or many? Up vs down the dependency graph?
One region/tenant or global?
Correlate errors, latency, saturation with the change timeline.

Cascading failure patterns

Pattern	What happens	Mitigation
Retry storm	Failures → retries → more load	Backoff + jitter, retry budgets, breakers
Thundering herd	Cache expiry → everyone hammers at once	Jittered TTLs, request coalescing
Resource exhaustion	One slow dep ties up all threads	Timeouts, bulkheads, concurrency limits
Death spiral	Restarts under load never catch up	Shed load, scale, slow-start

Rollback vs forward-fix

Default to rollback — fastest known-good. Forward-fix only when rollback is impossible (irreversible migration) or the fix is trivial and certain.

Communication

Status updates on a cadence — silence breeds escalation.
One source of truth; IC coordinates, others execute.
Record the timeline as you go — postmortem skeleton.

Blameless postmortem

Timeline, impact, root cause (5 whys), what went well/poorly, action items with owners. Target systems and gaps, not people.

Triage checklist

# 1. confirm + scope (errors, latency, saturation)
# 2. what changed? (deploys, flags, config, traffic)
# 3. mitigate: rollback / disable flag / failover / shed load
# 4. communicate on a cadence
# 5. verify recovery, then root-cause + postmortem