← Debug Guides

DEBUG GUIDE · KUBERNETES · SRE PLAYBOOK

Debugging Pod Failures — Decode Every Status.

kubernetes debugging sre pods
The pod status column is a diagnosis, not a mystery. Each value points at a specific layer: scheduling, image pull, container start, the app itself, or probes. Workflow is always the same — kubectl get pod → read status → kubectl describe pod (Events) → kubectl logs (incl. --previous). This guide maps each status to its cause and fix.

The 30-second triage

kubectl get pod  -o wide                 # STATUS + RESTARTS + node
kubectl describe pod                      # scroll to Events (the why)
kubectl logs                              # app output
kubectl logs  --previous                  # crashed container's last words
kubectl get events --sort-by=.lastTimestamp -n 

Status tells you which section below to jump to.

StatusLayer
PendingScheduling — no node fits yet
ImagePullBackOff / ErrImagePullImage pull
CreateContainerConfigErrorMissing ConfigMap/Secret
CrashLoopBackOffApp starts then dies, repeatedly
OOMKilledHit memory limit
Running but 0/1 READYReadiness probe failing
Init:…An init container is stuck/failing

Pending — won't schedule

Cause. Scheduler found no node that fits. describe Events spell it out: Insufficient cpu/memory, node(s) had taint …, didn't match node affinity, or unbound PersistentVolumeClaim.

Diagnose & fix.

kubectl describe pod  | grep -A10 Events
kubectl get nodes -o wide
kubectl describe node  | grep -A5 Allocated   # free resources
  • Insufficient resources: lower the pod's requests, free capacity, or scale the cluster (autoscaler).
  • Taints: add a matching toleration, or schedule elsewhere.
  • Affinity/nodeSelector: no node has the required label — fix the rule or label a node.
  • Unbound PVC: no PV / StorageClass can satisfy the claim — see the PVC events.
requests, not limits, drive scheduling A pod requesting 4Gi stays Pending if no node has 4Gi free — even if it would only use 200Mi. Right-size requests; they reserve capacity.

ImagePullBackOff / ErrImagePull

Cause. Kubelet can't pull the image.

  • Wrong image name/tag (typo, tag doesn't exist).
  • Private registry, no/invalid imagePullSecret.
  • Registry unreachable / rate-limited (Docker Hub anonymous pull limits).
  • Wrong architecture (arm64 image on amd64 node).
kubectl describe pod  | grep -A5 Events    # exact pull error
kubectl get secret  -o yaml          # exists? right registry?
# test the pull manually on a node / locally:
docker pull :

Fix. Correct the tag; attach a valid imagePullSecret to the pod or ServiceAccount; use a mirror/authenticated pull to dodge rate limits; match the node arch.

CreateContainerConfigError

Cause. The pod references a ConfigMap or Secret (env or volume) that doesn't exist or lacks the key.

kubectl describe pod  | grep -A5 Events   # "configmap X not found"
kubectl get configmap,secret -n 

Fix. Create the missing ConfigMap/Secret (right name, right namespace, right key), or fix the reference in the pod spec.

CrashLoopBackOff

Cause. The container starts, exits, and Kubernetes restarts it on a backoff — repeatedly. The crash is your app's, not K8s'. The why is in the logs.

kubectl logs  --previous          # the crash output — start here
kubectl describe pod              # exit code + reason
#   exit 1   = app error (read the log)
#   exit 137 = SIGKILL (often OOM — check OOMKilled)
#   exit 143 = SIGTERM (shutdown)

Common causes & fix.

  • Bad config / missing env var / can't reach a dependency at boot → fix config; don't crash on a transient dep, retry.
  • Failing migration or panic on startup → fix the app; gate with an init container.
  • Liveness probe killing a slow starter → add a startupProbe (see below).
  • Wrong command/entrypoint → container exits immediately; verify the cmd.
--previous is the key kubectl logs <pod> shows the current (just-started) container, often empty. --previous shows the one that just crashed — that's where the error is.

OOMKilled

Cause. The container exceeded its memory limit; the kernel killed it (exit 137). Restarts, often into CrashLoopBackOff.

kubectl describe pod  | grep -i -A2 "Last State"   # OOMKilled, exit 137
kubectl top pod                                     # current usage
# node-side: dmesg shows the OOM kill

Fix.

  • Raise the memory limit if the workload legitimately needs it.
  • Fix the leak / cap heap (JVM -Xmx below the limit; many runtimes ignore cgroup limits unless told).
  • Set requests = typical, limit = peak with headroom.
runtimes don't auto-see limits A JVM/Node process may size its heap to node RAM, not the pod limit, then get OOMKilled. Set -Xmx / --max-old-space-size below the container limit.

Running but 0/1 READY

Cause. Container is up but the readiness probe fails, so it's kept out of Service endpoints (no traffic). Not restarted — readiness ≠ liveness.

kubectl describe pod  | grep -A3 Readiness   # probe config + failures
kubectl get endpoints                         # empty = no ready pods
kubectl exec  -- curl -s localhost:/healthz

Fix. Make sure the probe path/port match a real health endpoint; give a sufficient initialDelaySeconds or use a startupProbe; ensure the app actually binds and the dependency it health-checks is reachable.

Init / stuck containers

Cause. An init container hasn't completed (waiting on a dependency, failing repeatedly), so the main containers never start. Status shows Init:0/1 or Init:CrashLoopBackOff.

kubectl logs  -c 
kubectl describe pod      # which init container, what it's waiting on

Fix. Debug the init container like any container (logs, exit code). Common: "wait-for-db" loops forever because the DB Service name/port is wrong or the DB isn't up.

Probe tuning — the recurring root cause

Half of "weird pod" tickets are probe misconfig. Three probes, three jobs:

ProbeOn failUse for
livenessrestart containerdetect a wedged process
readinessremove from Service"can I serve traffic right now"
startuphold off livenessslow-booting apps
don't point liveness at dependencies If liveness checks the DB and the DB blips, K8s restarts your healthy app for no reason — turning a small blip into a crash loop. Liveness = "is the process stuck"; readiness = "should I get traffic"; startup = "still booting."

Quick reference

kubectl get pod  -o wide
kubectl describe pod                  # Events = the why
kubectl logs  [-c container] [--previous] [-f]
kubectl get events --sort-by=.lastTimestamp -n 
kubectl top pod                       # needs metrics-server
kubectl debug -it  --image=busybox --target=   # ephemeral debug
kubectl get pod  -o jsonpath='{.status.containerStatuses[*].state}'
← all debug guides next: High CPU →
© cvam — written in plaintext, served warm