Debugging Pod Failures

The pod status column is a diagnosis, not a mystery. Each value points at a specific layer: scheduling, image pull, container start, the app itself, or probes. Workflow is always the same — kubectl get pod → read status → kubectl describe pod (Events) → kubectl logs (incl. --previous). This guide maps each status to its cause and fix.

The 30-second triage

kubectl get pod <pod> -o wide                 # STATUS + RESTARTS + node
kubectl describe pod <pod>                     # scroll to Events (the why)
kubectl logs <pod>                             # app output
kubectl logs <pod> --previous                  # crashed container's last words
kubectl get events --sort-by=.lastTimestamp -n <ns>

Status tells you which section below to jump to.

Status	Layer
`Pending`	Scheduling — no node fits yet
`ImagePullBackOff` / `ErrImagePull`	Image pull
`CreateContainerConfigError`	Missing ConfigMap/Secret
`CrashLoopBackOff`	App starts then dies, repeatedly
`OOMKilled`	Hit memory limit
`Running` but `0/1 READY`	Readiness probe failing
`Init:…`	An init container is stuck/failing

Pending — won't schedule

Cause. Scheduler found no node that fits. describe Events spell it out: Insufficient cpu/memory, node(s) had taint …, didn't match node affinity, or unbound PersistentVolumeClaim.

Diagnose & fix.

kubectl describe pod <pod> | grep -A10 Events
kubectl get nodes -o wide
kubectl describe node <node> | grep -A5 Allocated   # free resources

Insufficient resources: lower the pod's requests, free capacity, or scale the cluster (autoscaler).
Taints: add a matching toleration, or schedule elsewhere.
Affinity/nodeSelector: no node has the required label — fix the rule or label a node.
Unbound PVC: no PV / StorageClass can satisfy the claim — see the PVC events.

requests, not limits, drive scheduling A pod requesting 4Gi stays Pending if no node has 4Gi free — even if it would only use 200Mi. Right-size requests; they reserve capacity.

ImagePullBackOff / ErrImagePull

Cause. Kubelet can't pull the image.

Wrong image name/tag (typo, tag doesn't exist).
Private registry, no/invalid imagePullSecret.
Registry unreachable / rate-limited (Docker Hub anonymous pull limits).
Wrong architecture (arm64 image on amd64 node).

kubectl describe pod <pod> | grep -A5 Events    # exact pull error
kubectl get secret <pullsecret> -o yaml          # exists? right registry?
# test the pull manually on a node / locally:
docker pull <image>:<tag>

Fix. Correct the tag; attach a valid imagePullSecret to the pod or ServiceAccount; use a mirror/authenticated pull to dodge rate limits; match the node arch.

CreateContainerConfigError

Cause. The pod references a ConfigMap or Secret (env or volume) that doesn't exist or lacks the key.

kubectl describe pod <pod> | grep -A5 Events   # "configmap X not found"
kubectl get configmap,secret -n <ns>

Fix. Create the missing ConfigMap/Secret (right name, right namespace, right key), or fix the reference in the pod spec.

CrashLoopBackOff

Cause. The container starts, exits, and Kubernetes restarts it on a backoff — repeatedly. The crash is your app's, not K8s'. The why is in the logs.

kubectl logs <pod> --previous          # the crash output — start here
kubectl describe pod <pod>             # exit code + reason
#   exit 1   = app error (read the log)
#   exit 137 = SIGKILL (often OOM — check OOMKilled)
#   exit 143 = SIGTERM (shutdown)

Common causes & fix.

Bad config / missing env var / can't reach a dependency at boot → fix config; don't crash on a transient dep, retry.
Failing migration or panic on startup → fix the app; gate with an init container.
Liveness probe killing a slow starter → add a startupProbe (see below).
Wrong command/entrypoint → container exits immediately; verify the cmd.

--previous is the key kubectl logs <pod> shows the current (just-started) container, often empty. --previous shows the one that just crashed — that's where the error is.

OOMKilled

Cause. The container exceeded its memory limit; the kernel killed it (exit 137). Restarts, often into CrashLoopBackOff.

kubectl describe pod <pod> | grep -i -A2 "Last State"   # OOMKilled, exit 137
kubectl top pod <pod>                                    # current usage
# node-side: dmesg shows the OOM kill

Fix.

Raise the memory limit if the workload legitimately needs it.
Fix the leak / cap heap (JVM -Xmx below the limit; many runtimes ignore cgroup limits unless told).
Set requests = typical, limit = peak with headroom.

runtimes don't auto-see limits A JVM/Node process may size its heap to node RAM, not the pod limit, then get OOMKilled. Set -Xmx / --max-old-space-size below the container limit.

Running but 0/1 READY

Cause. Container is up but the readiness probe fails, so it's kept out of Service endpoints (no traffic). Not restarted — readiness ≠ liveness.

kubectl describe pod <pod> | grep -A3 Readiness   # probe config + failures
kubectl get endpoints <svc>                        # empty = no ready pods
kubectl exec <pod> -- curl -s localhost:<port>/healthz

Fix. Make sure the probe path/port match a real health endpoint; give a sufficient initialDelaySeconds or use a startupProbe; ensure the app actually binds and the dependency it health-checks is reachable.

Init / stuck containers

Cause. An init container hasn't completed (waiting on a dependency, failing repeatedly), so the main containers never start. Status shows Init:0/1 or Init:CrashLoopBackOff.

kubectl logs <pod> -c <init-container-name>
kubectl describe pod <pod>     # which init container, what it's waiting on

Fix. Debug the init container like any container (logs, exit code). Common: "wait-for-db" loops forever because the DB Service name/port is wrong or the DB isn't up.

Probe tuning — the recurring root cause

Half of "weird pod" tickets are probe misconfig. Three probes, three jobs:

Probe	On fail	Use for
liveness	restart container	detect a wedged process
readiness	remove from Service	"can I serve traffic right now"
startup	hold off liveness	slow-booting apps

don't point liveness at dependencies If liveness checks the DB and the DB blips, K8s restarts your healthy app for no reason — turning a small blip into a crash loop. Liveness = "is the process stuck"; readiness = "should I get traffic"; startup = "still booting."

Quick reference

kubectl get pod <pod> -o wide
kubectl describe pod <pod>                 # Events = the why
kubectl logs <pod> [-c container] [--previous] [-f]
kubectl get events --sort-by=.lastTimestamp -n <ns>
kubectl top pod <pod>                      # needs metrics-server
kubectl debug -it <pod> --image=busybox --target=<container>   # ephemeral debug
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[*].state}'

Debugging Pod Failures — Decode Every Status.