Symptom → likely cause → fix. Step-by-step playbooks for the failures DevOps and SRE engineers actually hit in production — reproduce, isolate, diagnose, resolve. 19 playbooks — a master guide, platform and server playbooks, failure-mode runbooks, plus a GPU / inference / training track.
Cluster-level: events, NotReady nodes, control plane, service/DNS/ingress, PVC pending, RBAC, evictions, debug toolkit.
dockerEngine-level: daemon down, disk/log bloat, build failures, networking, volumes/perms, image pull, inspection.
postgresDeep Postgres: MVCC + bloat/autovacuum, EXPLAIN, indexes, locks/deadlocks, pools, XID wraparound, replication, cache.
awsCan't SSH to EC2, IAM AccessDenied, VPC no internet, ELB unhealthy, RDS connect, Lambda errors — SG/IAM first.
linuxCan't SSH, service won't start (systemd/journalctl), boot failures, resource pressure, netplan, apt, where the logs are.
windowsEvent Viewer, services (sc/PowerShell), RDP, high CPU/mem, networking, IIS, updates — PowerShell diagnostics.
Saturation vs throttling, cgroup limits, run-queue, load average, flame graphs, top/pidstat workflow.
memoryRSS vs cache, OOM killer, heap growth, dmesg evidence, limits vs requests, leak isolation.
diskNo space left, inode exhaustion, hidden deleted-but-open files, iostat/iotop, log rotation gone wrong.
Consumer lag, rebalancing storms, under-replicated partitions, ISR shrink, broker down, offset resets. Debugging on Docker & K8s.
redisOOM & eviction, slow commands, blocking ops, persistence (RDB/AOF), replication lag, keyspace hot keys. Docker & K8s.
rabbitmqQueue buildup, unacked messages, memory/disk alarms, flow control, dead-letter loops, connection churn. Docker & K8s.
CUDA OOM & fragmentation, GPU underutilization/starvation, clock throttling, NCCL hangs & collective mismatch, driver/arch errors, Xid hardware faults.
inferenceSlow TTFT vs TPOT, low throughput under load, KV-cache OOM, garbled output (chat template/sampling/stop tokens), "same prompt different answer".
trainingLoss NaN/diverging/flat, OOM, low MFU, distributed hangs (collective mismatch / dead rank), reproduce & resume — with grad-norm as the smoke alarm.