← Debug Guides

DEBUG GUIDE · MASTER · METHODOLOGY

How to Debug Anything — The Master Guide.

debugging methodology sre problem-solving
Debugging is not luck or talent — it's a method. The specific bug doesn't matter: application, system, server, or network, the loop is the same. Reproduce → observe → form one hypothesis → isolate by halving → test one change → verify → understand why. This guide is that loop, plus the layer-by-layer map so you always know where to look.

The mindset

  • The computer is not lying. It's doing exactly what it was told. The bug is a wrong assumption — yours. Find the assumption.
  • Don't guess — know. Replace "I think it's X" with a measurement that proves or kills X. Every step should reduce uncertainty.
  • It's always doing something. "Nothing happens" is data: nothing reached the log, the request never arrived, the process isn't running. Narrow where the chain breaks.
  • Change one thing at a time. Two changes at once and you can't tell which fixed (or broke) it.

The universal debugging loop

  1. Reproduce. A bug you can trigger on demand is half-solved. Find the minimal steps. Intermittent? Find what correlates (load, time, a specific input, one host).
  2. Observe. Read the actual error — all of it, the first one, not the last. Logs, metrics, traces, exit codes. Look, don't assume.
  3. Hypothesize. One testable theory: "the request never reaches the backend."
  4. Isolate (bisect). Halve the search space with one test. Is it client or server? Cut the system in the middle and check which side is wrong.
  5. Test one change. Make the smallest change that proves/disproves the hypothesis. Revert if it didn't help.
  6. Verify. Confirm the fix actually fixes it — and didn't just move the symptom. Reproduce the original trigger.
  7. Understand why. If you don't know why the fix works, you haven't fixed it — you've hidden it. It'll be back.
read the first error Stack traces and CI logs cascade — the last error is often a downstream effect. Scroll to the first failure; that's usually the real cause.

Bisection — the most powerful move

Every hard bug shrinks fast if you keep halving the space:

  • In the stack: client → DNS → LB → app → DB. Test the midpoint, eliminate half.
  • In time: "worked yesterday." git bisect finds the exact commit in log(n) steps.
  • In data: bad input? Binary-search the dataset for the row that triggers it.
  • In config: works in dev, not prod? Diff them; flip settings one at a time.
# git bisect: find the commit that introduced a bug
git bisect start
git bisect bad                 # current is broken
git bisect good v1.2.0         # this tag was fine
# git checks out the midpoint; test, then mark:
git bisect good   # or: git bisect bad
# repeat until it names the culprit commit, then:
git bisect reset

Diagnostic frameworks — USE, RED, Golden Signals

When you don't know where to look, these give you a checklist so you cover the system methodically instead of poking randomly. Each fits a different question: USE for resources, RED/Golden Signals for services.

USE — for resources (is the machine the problem?)

Brendan Gregg's method. For every resource (CPU, memory, disk, network, and their sub-resources) check three things:

U — UtilizationS — SaturationE — Errors
% time the resource was busyQueued/waiting work it couldn't service yetError counts on the resource
mpstat, iostat %utilrun-queue, iowait, swap, retransmitsdmesg, NIC errors, disk errors

Saturation is the one people miss: a disk at 100% utilization with a deep I/O queue is the bottleneck even though CPU looks fine. Walk every resource through U/S/E and the starved one falls out.

RED — for request-driven services

For each service, watch three signals:

  • R — Rate: requests per second.
  • E — Errors: failed requests per second (and ratio).
  • D — Duration: latency distribution — always percentiles (p50/p95/p99), never just the mean.

A spike in any one localizes the incident: errors up = something broke; duration up = something slow/saturated; rate cliff = traffic isn't arriving (upstream/DNS/LB). Per-service RED dashboards tell you which service to open first.

Four Golden Signals (Google SRE)

RED plus one — for user-facing systems:

  • Latency (split successful vs failed — a fast 500 shouldn't look healthy).
  • Traffic (demand on the system).
  • Errors (rate of failed requests).
  • Saturation (how "full" the service is — the leading indicator of trouble).
which framework when USE = "is a resource the bottleneck?" (host/infra view). RED = "is a service misbehaving?" (request view). Golden Signals = RED + saturation for user-facing SLOs. Use them together: RED finds the sick service, USE finds the starved resource underneath it.

And the reasoning frameworks

  • Scientific method: the debugging loop itself — hypothesis → experiment → observe → refine. One variable per experiment.
  • 5 Whys: keep asking "why" past the symptom until you reach the root cause (and the process gap that let it happen). The fix lives at the last why, not the first.

The layer map — where to look

Any request crosses layers. Bugs live at one. Walk them outside-in or follow the request:

LayerAskTools
ApplicationLogic, exceptions, bad state, deps?logs, debugger, stack traces, profilers, tracing
Runtime/processCrashing, OOM, GC, threads, fds?exit codes, dmesg, heap/CPU profiles, strace
System/OSCPU, memory, disk, inodes, limits?top, vmstat, iostat, df, ulimit
Server/hostUp? Right config? Time correct?ssh, systemd, journalctl, NTP
NetworkReachable? DNS? Firewall? TLS?curl -v, dig, nc, traceroute, ss, tcpdump
DataQuery slow? Locks? Stale replica?EXPLAIN, DB stat views, slow logs

Debugging applications

  • Reproduce with the smallest input. Add logging at the boundary where you think it's still correct, then move it until the value goes wrong.
  • Read the exception type + message literally. NullPointer/undefined = something you assumed existed didn't.
  • Rubber-duck it: explain the code line by line out loud; the wrong assumption surfaces.
  • Diff against the last working version. What changed — code, deps, data, config?
  • Use a real debugger over scattered prints when state is complex; set a breakpoint where it's still right and step until it breaks.

Debugging systems & servers

  • Start with resources — the USE method: for each resource check Utilization, Saturation, Errors. CPU, memory, disk, network.
  • "Slow" → is it CPU saturation, CPU throttling, memory pressure/swap, disk I/O wait, or waiting on a dependency? Each has a distinct signal.
  • Is the process even running? Right user, right config file, right env? Check journalctl -u svc and exit codes.
  • Check the clock (NTP) — skew breaks TLS, auth, and logs.
uptime; vmstat 1; mpstat -P ALL 1     # CPU
free -m; cat /proc/meminfo            # memory / swap
df -h; df -i; iostat -xz 1            # disk: space, inodes, I/O
ss -s; ss -ltnp                       # sockets / listeners
journalctl -u  --since "10 min ago"
strace -p                        # what syscalls is it stuck on?

Debugging networks

Follow the packet's journey and test each hop:

  1. Resolve: dig name — does DNS return the right IP?
  2. Reach: ping/traceroute — is the host reachable, where does it die?
  3. Connect: nc -vz host port — refused (no listener) vs timeout (firewall)?
  4. Speak: curl -v — does the app respond, what status, TLS ok?
  5. Inspect: tcpdump when you must see the actual packets.
refused ≠ timeout Connection refused = you reached the host, nothing's listening (app down / wrong port). Timeout = packets vanished (firewall, wrong IP, routing). The word picks your next step.

When you're truly stuck

  • Re-read the error — slowly, every word. The answer is often right there.
  • Question the assumption you're most sure of. The bug hides behind "that part definitely works."
  • Make it smaller. Strip the system to a minimal reproduction; the bug gets nowhere to hide.
  • Take a break. Stepping away genuinely surfaces answers — well-documented, not a cliché.
  • Explain it to someone (or a duck). Forcing words exposes the gap.
  • Check the dumb stuff: right environment? saved the file? deployed the build? correct cluster/branch? typo in the name?

Biases that waste hours

TrapAntidote
Fixing the symptom, not the causeAsk "why" until you hit the root (5 whys).
Assuming, not measuringProve each step with a tool.
Changing many things at onceOne variable per test.
"It can't be that"Test it anyway — it often is.
Tunnel vision on familiar codeThe bug may be in config/data/env, not code.
Not reading the whole errorFirst error, full message.

Go deeper — specific playbooks

This is the method. For the failure you're actually staring at, jump to the specific guide:

← all debug guides start: Pod Failures →
© cvam — written in plaintext, served warm