Debugging Memory Leaks & OOM

Two shapes: OOMKilled (hit a hard limit, kernel killed it, exit 137) and a slow leak (RSS creeps up until it does). Tell a real leak from normal cache growth, find what holds memory, then cap or fix.

OOMKilled

dmesg | grep -i oom
kubectl describe pod <pod> | grep -i -A2 "Last State"   # OOMKilled / 137
kubectl top pod <pod>

Fix. Raise the limit if legitimate; else it's a leak. Set runtime heap below the limit (JVM -Xmx, Node --max-old-space-size) so the runtime GCs first.

RSS vs cache Container memory often includes page cache, not just heap. Check the runtime's own heap metrics to confirm a real leak.

Slow leak

RSS rises steadily, never plateaus, restart "fixes" it. Graph RSS — a leak never levels off. Then heap-profile:

# Go:    go tool pprof http://app/debug/pprof/heap
# Java:  jmap -histo:live <pid>  (or heap dump + MAT)
# Node:  --inspect + Chrome heap snapshot
# Python: tracemalloc

Fix. Find the growing allocation (unbounded cache/map, unclosed resources, listener accumulation). Bound caches (TTL/size); close handles.

Thread / goroutine / fd leaks

ls /proc/<pid>/fd | wc -l        # climbing = fd/conn leak
curl http://app/debug/pprof/goroutine?debug=1 | head   # Go

Fix. Ensure spawned workers exit; use contexts/timeouts; pool + close connections.

Fragmentation

RSS >> live objects = allocator fragmentation. Try jemalloc/tcmalloc; tune MALLOC_ARENA_MAX.

Quick reference

dmesg | grep -i oom
kubectl describe pod <pod> | grep -A2 "Last State"
ls /proc/<pid>/fd | wc -l
# heap: pprof / jmap+MAT / heap snapshot / tracemalloc

Debugging Memory Leaks & OOM Kills.

OOMKilled

Slow leak

Thread / goroutine / fd leaks

Fragmentation

Quick reference