Two shapes: OOMKilled (hit a hard limit, kernel killed it, exit 137) and a slow leak (RSS creeps up until it does). Tell a real leak from normal cache growth, find what holds memory, then cap or fix.
OOMKilled
dmesg | grep -i oom kubectl describe pod <pod> | grep -i -A2 "Last State" # OOMKilled / 137 kubectl top pod <pod>
Fix. Raise the limit if legitimate; else it's a leak. Set runtime heap below the
limit (JVM -Xmx, Node --max-old-space-size) so the runtime GCs first.
RSS vs cache
Container memory often includes page cache, not just heap. Check the runtime's own heap metrics
to confirm a real leak.
Slow leak
RSS rises steadily, never plateaus, restart "fixes" it. Graph RSS — a leak never levels off. Then heap-profile:
# Go: go tool pprof http://app/debug/pprof/heap # Java: jmap -histo:live <pid> (or heap dump + MAT) # Node: --inspect + Chrome heap snapshot # Python: tracemalloc
Fix. Find the growing allocation (unbounded cache/map, unclosed resources, listener accumulation). Bound caches (TTL/size); close handles.
Thread / goroutine / fd leaks
ls /proc/<pid>/fd | wc -l # climbing = fd/conn leak curl http://app/debug/pprof/goroutine?debug=1 | head # Go
Fix. Ensure spawned workers exit; use contexts/timeouts; pool + close connections.
Fragmentation
RSS >> live objects = allocator fragmentation. Try jemalloc/tcmalloc; tune
MALLOC_ARENA_MAX.
Quick reference
dmesg | grep -i oom kubectl describe pod <pod> | grep -A2 "Last State" ls /proc/<pid>/fd | wc -l # heap: pprof / jmap+MAT / heap snapshot / tracemalloc