Debugging is not luck or talent — it's a method. The specific bug doesn't matter: application, system, server, or network, the loop is the same. Reproduce → observe → form one hypothesis → isolate by halving → test one change → verify → understand why. This guide is that loop, plus the layer-by-layer map so you always know where to look.
The mindset
- The computer is not lying. It's doing exactly what it was told. The bug is a wrong assumption — yours. Find the assumption.
- Don't guess — know. Replace "I think it's X" with a measurement that proves or kills X. Every step should reduce uncertainty.
- It's always doing something. "Nothing happens" is data: nothing reached the log, the request never arrived, the process isn't running. Narrow where the chain breaks.
- Change one thing at a time. Two changes at once and you can't tell which fixed (or broke) it.
The universal debugging loop
- Reproduce. A bug you can trigger on demand is half-solved. Find the minimal steps. Intermittent? Find what correlates (load, time, a specific input, one host).
- Observe. Read the actual error — all of it, the first one, not the last. Logs, metrics, traces, exit codes. Look, don't assume.
- Hypothesize. One testable theory: "the request never reaches the backend."
- Isolate (bisect). Halve the search space with one test. Is it client or server? Cut the system in the middle and check which side is wrong.
- Test one change. Make the smallest change that proves/disproves the hypothesis. Revert if it didn't help.
- Verify. Confirm the fix actually fixes it — and didn't just move the symptom. Reproduce the original trigger.
- Understand why. If you don't know why the fix works, you haven't fixed it — you've hidden it. It'll be back.
Bisection — the most powerful move
Every hard bug shrinks fast if you keep halving the space:
- In the stack: client → DNS → LB → app → DB. Test the midpoint, eliminate half.
- In time: "worked yesterday."
git bisectfinds the exact commit in log(n) steps. - In data: bad input? Binary-search the dataset for the row that triggers it.
- In config: works in dev, not prod? Diff them; flip settings one at a time.
# git bisect: find the commit that introduced a bug git bisect start git bisect bad # current is broken git bisect good v1.2.0 # this tag was fine # git checks out the midpoint; test, then mark: git bisect good # or: git bisect bad # repeat until it names the culprit commit, then: git bisect reset
Diagnostic frameworks — USE, RED, Golden Signals
When you don't know where to look, these give you a checklist so you cover the system methodically instead of poking randomly. Each fits a different question: USE for resources, RED/Golden Signals for services.
USE — for resources (is the machine the problem?)
Brendan Gregg's method. For every resource (CPU, memory, disk, network, and their sub-resources) check three things:
| U — Utilization | S — Saturation | E — Errors |
|---|---|---|
| % time the resource was busy | Queued/waiting work it couldn't service yet | Error counts on the resource |
mpstat, iostat %util | run-queue, iowait, swap, retransmits | dmesg, NIC errors, disk errors |
Saturation is the one people miss: a disk at 100% utilization with a deep I/O queue is the bottleneck even though CPU looks fine. Walk every resource through U/S/E and the starved one falls out.
RED — for request-driven services
For each service, watch three signals:
- R — Rate: requests per second.
- E — Errors: failed requests per second (and ratio).
- D — Duration: latency distribution — always percentiles (p50/p95/p99), never just the mean.
A spike in any one localizes the incident: errors up = something broke; duration up = something slow/saturated; rate cliff = traffic isn't arriving (upstream/DNS/LB). Per-service RED dashboards tell you which service to open first.
Four Golden Signals (Google SRE)
RED plus one — for user-facing systems:
- Latency (split successful vs failed — a fast 500 shouldn't look healthy).
- Traffic (demand on the system).
- Errors (rate of failed requests).
- Saturation (how "full" the service is — the leading indicator of trouble).
And the reasoning frameworks
- Scientific method: the debugging loop itself — hypothesis → experiment → observe → refine. One variable per experiment.
- 5 Whys: keep asking "why" past the symptom until you reach the root cause (and the process gap that let it happen). The fix lives at the last why, not the first.
The layer map — where to look
Any request crosses layers. Bugs live at one. Walk them outside-in or follow the request:
| Layer | Ask | Tools |
|---|---|---|
| Application | Logic, exceptions, bad state, deps? | logs, debugger, stack traces, profilers, tracing |
| Runtime/process | Crashing, OOM, GC, threads, fds? | exit codes, dmesg, heap/CPU profiles, strace |
| System/OS | CPU, memory, disk, inodes, limits? | top, vmstat, iostat, df, ulimit |
| Server/host | Up? Right config? Time correct? | ssh, systemd, journalctl, NTP |
| Network | Reachable? DNS? Firewall? TLS? | curl -v, dig, nc, traceroute, ss, tcpdump |
| Data | Query slow? Locks? Stale replica? | EXPLAIN, DB stat views, slow logs |
Debugging applications
- Reproduce with the smallest input. Add logging at the boundary where you think it's still correct, then move it until the value goes wrong.
- Read the exception type + message literally.
NullPointer/undefined= something you assumed existed didn't. - Rubber-duck it: explain the code line by line out loud; the wrong assumption surfaces.
- Diff against the last working version. What changed — code, deps, data, config?
- Use a real debugger over scattered prints when state is complex; set a breakpoint where it's still right and step until it breaks.
Debugging systems & servers
- Start with resources — the USE method: for each resource check Utilization, Saturation, Errors. CPU, memory, disk, network.
- "Slow" → is it CPU saturation, CPU throttling, memory pressure/swap, disk I/O wait, or waiting on a dependency? Each has a distinct signal.
- Is the process even running? Right user, right config file, right env? Check
journalctl -u svcand exit codes. - Check the clock (NTP) — skew breaks TLS, auth, and logs.
uptime; vmstat 1; mpstat -P ALL 1 # CPU free -m; cat /proc/meminfo # memory / swap df -h; df -i; iostat -xz 1 # disk: space, inodes, I/O ss -s; ss -ltnp # sockets / listeners journalctl -u--since "10 min ago" strace -p # what syscalls is it stuck on?
Debugging networks
Follow the packet's journey and test each hop:
- Resolve:
dig name— does DNS return the right IP? - Reach:
ping/traceroute— is the host reachable, where does it die? - Connect:
nc -vz host port— refused (no listener) vs timeout (firewall)? - Speak:
curl -v— does the app respond, what status, TLS ok? - Inspect:
tcpdumpwhen you must see the actual packets.
When you're truly stuck
- Re-read the error — slowly, every word. The answer is often right there.
- Question the assumption you're most sure of. The bug hides behind "that part definitely works."
- Make it smaller. Strip the system to a minimal reproduction; the bug gets nowhere to hide.
- Take a break. Stepping away genuinely surfaces answers — well-documented, not a cliché.
- Explain it to someone (or a duck). Forcing words exposes the gap.
- Check the dumb stuff: right environment? saved the file? deployed the build? correct cluster/branch? typo in the name?
Biases that waste hours
| Trap | Antidote |
|---|---|
| Fixing the symptom, not the cause | Ask "why" until you hit the root (5 whys). |
| Assuming, not measuring | Prove each step with a tool. |
| Changing many things at once | One variable per test. |
| "It can't be that" | Test it anyway — it often is. |
| Tunnel vision on familiar code | The bug may be in config/data/env, not code. |
| Not reading the whole error | First error, full message. |
Go deeper — specific playbooks
This is the method. For the failure you're actually staring at, jump to the specific guide:
- Kubernetes: Pod Failures · Kubernetes (cluster)
- System: High CPU · Memory Leaks & OOM · Disk & I/O
- Network: Connection Timeouts & Refused
- Reliability: Database · Incident Triage
- Messaging/stores: Kafka · Redis · RabbitMQ