← Discover

// DEBUG GUIDES live

Debug Guides.

Symptom → likely cause → fix. Step-by-step playbooks for the failures DevOps and SRE engineers actually hit in production — reproduce, isolate, diagnose, resolve. 19 playbooks — a master guide, platform and server playbooks, failure-mode runbooks, plus a GPU / inference / training track.


START HERE

🧭master guide

How to Debug Anything live

The method behind every guide below — the universal debugging loop (reproduce → observe → bisect → verify) and the layer map for application, system, server, and network. Read this first.

#methodology #sre #problem-solving
master playbook· any layer

PLATFORMS & SERVERS

kubernetes

Kubernetes live

Cluster-level: events, NotReady nodes, control plane, service/DNS/ingress, PVC pending, RBAC, evictions, debug toolkit.

cluster playbook· symptom → fix
🐳docker

Docker live

Engine-level: daemon down, disk/log bloat, build failures, networking, volumes/perms, image pull, inspection.

engine playbook· symptom → fix
🐘postgres

PostgreSQL live

Deep Postgres: MVCC + bloat/autovacuum, EXPLAIN, indexes, locks/deadlocks, pools, XID wraparound, replication, cache.

deep playbook· 14 sections
aws

AWS live

Can't SSH to EC2, IAM AccessDenied, VPC no internet, ELB unhealthy, RDS connect, Lambda errors — SG/IAM first.

cloud playbook· symptom → fix
🐧linux

Ubuntu Server live

Can't SSH, service won't start (systemd/journalctl), boot failures, resource pressure, netplan, apt, where the logs are.

server playbook· symptom → fix
🪟windows

Windows Server live

Event Viewer, services (sc/PowerShell), RDP, high CPU/mem, networking, IIS, updates — PowerShell diagnostics.

server playbook· symptom → fix

KUBERNETES & CONTAINERS

kubernetes

Pod Failures live

CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending, Init errors — decode each status and the fix path.

7 statuses· triage workflow

SYSTEM & PERFORMANCE

🔥cpu

High CPU / Throttling live

Saturation vs throttling, cgroup limits, run-queue, load average, flame graphs, top/pidstat workflow.

playbook· symptom → fix
🧠memory

Memory Leaks & OOM live

RSS vs cache, OOM killer, heap growth, dmesg evidence, limits vs requests, leak isolation.

playbook· symptom → fix
💾disk

Disk Full & I/O Bottlenecks live

No space left, inode exhaustion, hidden deleted-but-open files, iostat/iotop, log rotation gone wrong.

playbook· symptom → fix

NETWORK & CONNECTIVITY

🔌network

Connection Timeouts & Refused live

Refused vs timeout vs reset, firewall/SG, listening on 0.0.0.0, MTU, conntrack, port exhaustion.

playbook· symptom → fix


CI/CD, CLOUD & INCIDENTS

🆘incident

Incident Triage & Cascading Failures live

First 5 minutes, blast-radius scoping, retry storms, thundering herd, rollback vs forward-fix.

playbook· symptom → fix

MESSAGING & DATA STORES

🪵kafka

Kafka live

Consumer lag, rebalancing storms, under-replicated partitions, ISR shrink, broker down, offset resets. Debugging on Docker & K8s.

6 failure modes· Docker & K8s
🧱redis

Redis live

OOM & eviction, slow commands, blocking ops, persistence (RDB/AOF), replication lag, keyspace hot keys. Docker & K8s.

7 failure modes· Docker & K8s
🐰rabbitmq

RabbitMQ live

Queue buildup, unacked messages, memory/disk alarms, flow control, dead-letter loops, connection churn. Docker & K8s.

6 failure modes· Docker & K8s

GPU, INFERENCE & TRAINING

🎛gpu

GPU & CUDA live

CUDA OOM & fragmentation, GPU underutilization/starvation, clock throttling, NCCL hangs & collective mismatch, driver/arch errors, Xid hardware faults.

playbook· symptom → fix
inference

LLM Inference live

Slow TTFT vs TPOT, low throughput under load, KV-cache OOM, garbled output (chat template/sampling/stop tokens), "same prompt different answer".

playbook· symptom → fix
🏋training

LLM Training live

Loss NaN/diverging/flat, OOM, low MFU, distributed hangs (collective mismatch / dead rank), reproduce & resume — with grad-norm as the smoke alarm.

playbook· symptom → fix