In an incident: restore service first, root-cause later. Scope the blast radius, stop the bleeding (rollback / shed load / failover), communicate, then investigate. Cascades (retry storms, thundering herds) turn small faults into outages — recognize them.
The first 5 minutes
- Confirm & scope. Real? What % of users? Which service/region?
- Declare if user-facing — assign an incident commander, open a channel.
- What changed? 80% follow a deploy/config/traffic change. Check the timeline.
- Stop the bleeding before diagnosing: rollback, disable the flag, fail over, shed load.
mitigate before diagnose
Don't root-cause a live outage. If a deploy lines up, roll back now, investigate after.
Scope the blast radius
- One service or many? Up vs down the dependency graph?
- One region/tenant or global?
- Correlate errors, latency, saturation with the change timeline.
Cascading failure patterns
| Pattern | What happens | Mitigation |
|---|---|---|
| Retry storm | Failures → retries → more load | Backoff + jitter, retry budgets, breakers |
| Thundering herd | Cache expiry → everyone hammers at once | Jittered TTLs, request coalescing |
| Resource exhaustion | One slow dep ties up all threads | Timeouts, bulkheads, concurrency limits |
| Death spiral | Restarts under load never catch up | Shed load, scale, slow-start |
Rollback vs forward-fix
Default to rollback — fastest known-good. Forward-fix only when rollback is impossible (irreversible migration) or the fix is trivial and certain.
Communication
- Status updates on a cadence — silence breeds escalation.
- One source of truth; IC coordinates, others execute.
- Record the timeline as you go — postmortem skeleton.
Blameless postmortem
Timeline, impact, root cause (5 whys), what went well/poorly, action items with owners. Target systems and gaps, not people.
Triage checklist
# 1. confirm + scope (errors, latency, saturation) # 2. what changed? (deploys, flags, config, traffic) # 3. mitigate: rollback / disable flag / failover / shed load # 4. communicate on a cadence # 5. verify recovery, then root-cause + postmortem