← Debug Guides

DEBUG GUIDE · KUBERNETES · SRE PLAYBOOK

Debugging Kubernetes — Cluster-Level Playbook.

kubernetes debugging sre cluster
When the cluster (not one pod) is sick, work top-down: events → nodes → control plane → networking → storage → RBAC. kubectl get events and kubectl describe answer most of it. For a single pod's status, see the Pod Failures guide.

Always start with events

kubectl get events -A --sort-by=.lastTimestamp | tail -40
kubectl get pods -A | grep -vE 'Running|Completed'    # what's not healthy
kubectl get nodes -o wide                              # node health

Events are the cluster's narration — scheduling failures, image pulls, evictions, probe failures all land here.

Node NotReady

Symptom. A node shows NotReady; its pods get evicted/stuck.

kubectl describe node <node>     # Conditions: MemoryPressure, DiskPressure, PIDPressure
kubectl get node <node> -o yaml | grep -A20 conditions
# on the node:
systemctl status kubelet ; journalctl -u kubelet -n 100

Causes & fix. kubelet down/crashed; node out of disk (DiskPressure → image GC, free space); out of memory; container runtime down; network plugin unhealthy; cloud instance unreachable. Restart kubelet/runtime, free resources, or cordon+drain and replace.

DiskPressure evicts pods A node low on disk taints itself DiskPressure and evicts pods. Cause is often unbounded container logs or image buildup — set log rotation + image GC thresholds.

Control plane health

kubectl get componentstatuses          # (older clusters)
kubectl -n kube-system get pods         # apiserver, etcd, scheduler, controller-manager, coredns
kubectl cluster-info
# API slow/unreachable? check etcd health + apiserver logs

If the API server is slow, suspect etcd (disk latency, size) or apiserver overload (too many watches/list-all). ActiveControllerCount equivalent: exactly one scheduler/controller leader should be active.

Service / DNS / Ingress not reachable

kubectl get endpoints <svc>            # empty = selector matches no ready pods
kubectl get svc <svc> -o wide
kubectl run -it --rm dns --image=tutum/dnsutils -- nslookup <svc>.<ns>
kubectl -n kube-system get pods -l k8s-app=kube-dns   # CoreDNS

Fix. Selector/label mismatch (empty endpoints); CoreDNS down; NetworkPolicy dropping traffic; Ingress controller misconfigured or no backend. Details in the Timeouts guide.

PVC stuck Pending

kubectl describe pvc <pvc>     # Events: provisioning failure, no matching PV
kubectl get storageclass

Fix. No StorageClass / wrong name; CSI driver not installed; zone mismatch (volume in a different AZ than the pod's node); quota. A pod stays Pending while its PVC is unbound.

RBAC "Forbidden"

kubectl auth can-i create deployments --as=system:serviceaccount:ns:sa -n ns
kubectl describe rolebinding,clusterrolebinding -n ns | grep -A3 <sa>

Fix. Grant the missing verb/resource via a Role + RoleBinding (namespaced) or ClusterRole + ClusterRoleBinding (cluster-wide). Least privilege; explicit deny wins.

Evictions & resource pressure

Pods Evicted / OOMKilled when nodes run hot. Check requests vs node capacity; set requests/limits; see Memory and High CPU. QoS class (Guaranteed > Burstable > BestEffort) decides eviction order.

Debug toolkit

kubectl get events -A --sort-by=.lastTimestamp
kubectl describe <kind> <name>
kubectl logs <pod> [-c c] [--previous]
kubectl exec -it <pod> -- sh
kubectl debug -it <pod> --image=busybox --target=<c>   # ephemeral container
kubectl debug node/<node> -it --image=ubuntu              # node shell
kubectl top nodes ; kubectl top pods -A
kubectl get --raw /healthz
← all debug guides next: Docker →
© cvam — written in plaintext, served warm