← Debug Guides

DEBUG GUIDE · PERFORMANCE · SRE PLAYBOOK

Debugging High CPU & CPU Throttling.

cpu performance sre linux
Two problems, one mask. Saturation = work needs more CPU than exists (run queue grows). Throttling = a cgroup CPU limit caps you even while the node is idle. Same latency symptom, opposite fix. Tell them apart first.

Saturation vs throttling

uptime ; vmstat 1            # load avg, 'r' run-queue = saturation
mpstat -P ALL 1             # per-core busy
cat /sys/fs/cgroup/cpu.stat # nr_throttled, throttled_usec (cgroup v2)
kubectl top pod <pod>       # near CPU limit?

nr_throttled rising while node idle → throttling. Load > cores, all busy → saturation.

CPU limits throttle, not kill Memory over limit = OOMKill. CPU over limit = throttled (slowed). Latency spike + calm node + rising nr_throttled = limit too tight.

CPU throttling

Cause. CPU limit too low for bursty work; quota exhausted each 100ms period.

Fix. Raise/remove the CPU limit (keep requests); set GOMAXPROCS / thread pools to the limit, not node cores; cut per-request CPU.

Genuine saturation

top -H -p <pid> ; pidstat -t 1 -p <pid>   # hot thread
perf top -p <pid>                          # hot functions

Fix. Profile and optimize the hot path; scale out; cache; offload heavy work async.

CPU steal (noisy neighbour)

%steal in top/mpstat = hypervisor giving your vCPU away. Fix: dedicated/larger instances.

Quick reference

uptime ; vmstat 1 ; mpstat -P ALL 1
top -H -p <pid> ; perf top -p <pid>
cat /sys/fs/cgroup/cpu.stat ; kubectl top pod/node
← prev: Pod Failures next: Memory Leaks →
© cvam — written in plaintext, served warm