When the cluster (not one pod) is sick, work top-down: events → nodes → control plane → networking → storage → RBAC.kubectl get eventsandkubectl describeanswer most of it. For a single pod's status, see the Pod Failures guide.
Always start with events
kubectl get events -A --sort-by=.lastTimestamp | tail -40 kubectl get pods -A | grep -vE 'Running|Completed' # what's not healthy kubectl get nodes -o wide # node health
Events are the cluster's narration — scheduling failures, image pulls, evictions, probe failures all land here.
Node NotReady
Symptom. A node shows NotReady; its pods get evicted/stuck.
kubectl describe node <node> # Conditions: MemoryPressure, DiskPressure, PIDPressure kubectl get node <node> -o yaml | grep -A20 conditions # on the node: systemctl status kubelet ; journalctl -u kubelet -n 100
Causes & fix. kubelet down/crashed; node out of disk (DiskPressure
→ image GC, free space); out of memory; container runtime down; network plugin unhealthy; cloud
instance unreachable. Restart kubelet/runtime, free resources, or cordon+drain and replace.
DiskPressure and evicts pods. Cause is often
unbounded container logs or image buildup — set log rotation + image GC thresholds.Control plane health
kubectl get componentstatuses # (older clusters) kubectl -n kube-system get pods # apiserver, etcd, scheduler, controller-manager, coredns kubectl cluster-info # API slow/unreachable? check etcd health + apiserver logs
If the API server is slow, suspect etcd (disk latency, size) or apiserver overload (too many
watches/list-all). ActiveControllerCount equivalent: exactly one scheduler/controller
leader should be active.
Service / DNS / Ingress not reachable
kubectl get endpoints <svc> # empty = selector matches no ready pods kubectl get svc <svc> -o wide kubectl run -it --rm dns --image=tutum/dnsutils -- nslookup <svc>.<ns> kubectl -n kube-system get pods -l k8s-app=kube-dns # CoreDNS
Fix. Selector/label mismatch (empty endpoints); CoreDNS down; NetworkPolicy dropping traffic; Ingress controller misconfigured or no backend. Details in the Timeouts guide.
PVC stuck Pending
kubectl describe pvc <pvc> # Events: provisioning failure, no matching PV kubectl get storageclass
Fix. No StorageClass / wrong name; CSI driver not installed; zone mismatch (volume in a different AZ than the pod's node); quota. A pod stays Pending while its PVC is unbound.
RBAC "Forbidden"
kubectl auth can-i create deployments --as=system:serviceaccount:ns:sa -n ns kubectl describe rolebinding,clusterrolebinding -n ns | grep -A3 <sa>
Fix. Grant the missing verb/resource via a Role + RoleBinding (namespaced) or ClusterRole + ClusterRoleBinding (cluster-wide). Least privilege; explicit deny wins.
Evictions & resource pressure
Pods Evicted / OOMKilled when nodes run hot. Check requests vs node
capacity; set requests/limits; see Memory and
High CPU. QoS class (Guaranteed > Burstable > BestEffort)
decides eviction order.
Debug toolkit
kubectl get events -A --sort-by=.lastTimestamp kubectl describe <kind> <name> kubectl logs <pod> [-c c] [--previous] kubectl exec -it <pod> -- sh kubectl debug -it <pod> --image=busybox --target=<c> # ephemeral container kubectl debug node/<node> -it --image=ubuntu # node shell kubectl top nodes ; kubectl top pods -A kubectl get --raw /healthz