The one-liner: Kubernetes is a declarative control loop. You declare desired
state (YAML), controllers continuously reconcile actual → desired. You never "start a
container" — you tell K8s what you want and it makes reality match, then keeps it matched.
1. Architecture
Control plane (the brain):
| Component | Job |
| kube-apiserver | Front door. All reads/writes go through it (REST). The only thing that talks to etcd. AuthN/AuthZ/admission. |
| etcd | Consistent key-value store = the single source of truth for all cluster state. |
| kube-scheduler | Assigns Pods to Nodes (filtering + scoring) on resources, affinity, taints, spread. |
| controller-manager | Runs reconcile loops (Deployment, ReplicaSet, Node, Job, endpoints…). |
| cloud-controller-manager | Integrates the cloud (LBs, volumes, node lifecycle). |
Worker node (the muscle):
| Component | Job |
| kubelet | Node agent. Watches the API for Pods assigned to it; ensures their containers run & stay healthy; runs probes. |
| kube-proxy | Programs iptables/IPVS so Service VIPs load-balance to Pod IPs. |
| container runtime | Runs containers via the CRI (containerd, CRI-O). |
Flow of a kubectl apply: CLI → apiserver (validate, admission) → etcd
→ controller creates Pods → scheduler binds them to nodes → kubelet on that node
pulls images and starts containers → kube-proxy wires Service routing.
2. Core objects
| Object | What it does |
| Pod | Smallest deployable unit. One+ containers sharing network (same IP) + storage. Usually not created directly. |
| ReplicaSet | Keeps N identical Pods running. Managed by Deployment. |
| Deployment | Declarative updates for Pods/ReplicaSets — rolling updates + rollback. The stateless workhorse. |
| StatefulSet | Stable identity (web-0, web-1) + stable per-Pod storage + ordered ops. Databases. |
| DaemonSet | One Pod per node (log/metrics/CNI agents). |
| Job / CronJob | Run-to-completion / scheduled tasks. |
| Service | Stable endpoint + load balancing across a set of Pods. |
| Ingress | L7 HTTP(S) routing (host/path → Service). |
| ConfigMap / Secret | Inject config / sensitive data. |
| Namespace | Virtual cluster for isolation, quotas, RBAC scoping. |
| PV / PVC | Cluster storage / a Pod's claim on it. |
| ServiceAccount | Identity for Pods talking to the API. |
3. kubectl essentials
kubectl get pods -A -o wide # all namespaces, with node/IP
kubectl get all -n app # pods/svc/deploy/rs in a ns
kubectl describe pod # events + state (debug gold)
kubectl logs -f [-c container] [--previous]
kubectl exec -it -- sh
kubectl apply -f manifest.yaml # declarative create/update
kubectl diff -f manifest.yaml # preview the change
kubectl delete -f manifest.yaml
kubectl rollout status/history/undo deploy/web
kubectl scale deploy/web --replicas=5
kubectl set image deploy/web web=nginx:1.28
kubectl port-forward svc/web 8080:80 # local access
kubectl get events --sort-by=.lastTimestamp
kubectl top pod / node # needs metrics-server
kubectl explain pod.spec.containers # field docs
kubectl debug -it --image=busybox --target= # ephemeral debug
kubectl config get-contexts / use-context
kubectl label/annotate ; kubectl cordon/drain/uncordon
kubectl get pod -o jsonpath='{.status.podIP}'
4. Pod spec anatomy
apiVersion: v1
kind: Pod
metadata:
name: web
labels: { app: web }
spec:
serviceAccountName: web-sa
initContainers:
- name: wait-db
image: busybox
command: ["sh","-c","until nc -z db 5432; do sleep 1; done"]
containers:
- name: web
image: nginx:1.27
ports: [{ containerPort: 80 }]
env:
- name: LOG_LEVEL
value: info
- name: DB_PASS
valueFrom: { secretKeyRef: { name: db, key: pass } }
resources:
requests: { cpu: "100m", memory: "128Mi" } # scheduler reserves
limits: { cpu: "500m", memory: "256Mi" } # hard cap (OOMKill over mem)
readinessProbe:
httpGet: { path: /healthz, port: 80 }
initialDelaySeconds: 5
livenessProbe:
httpGet: { path: /healthz, port: 80 }
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
volumeMounts: [{ name: cache, mountPath: /tmp }]
volumes: [{ name: cache, emptyDir: {} }]
requests vs limits
requests = what the scheduler reserves (drives placement + QoS).
limits = hard ceiling. Over memory limit → OOMKilled; over CPU
→ throttled (not killed). No requests → poor scheduling + first to be evicted.
5. Choosing a workload
| Need | Use |
| Stateless app, N replicas, rolling updates | Deployment |
| Stable name + storage (DB, Kafka, Zookeeper) | StatefulSet |
| One Pod on every node (agent, CNI, logging) | DaemonSet |
| Run once to completion (migration, batch) | Job |
| Scheduled recurring task | CronJob |
apiVersion: apps/v1
kind: Deployment
metadata: { name: web }
spec:
replicas: 3
selector: { matchLabels: { app: web } }
strategy:
type: RollingUpdate
rollingUpdate: { maxSurge: 1, maxUnavailable: 0 } # zero-downtime
template:
metadata: { labels: { app: web } }
spec:
containers:
- { name: web, image: nginx:1.27, ports: [{ containerPort: 80 }] }
6. Services & networking
| Type | Exposure |
| ClusterIP | Default. Internal-only virtual IP. Pod-to-pod. |
| NodePort | Opens a port on every node (30000–32767). Basic external. |
| LoadBalancer | Provisions a cloud LB → Service. Standard external entry. |
| ExternalName | CNAME to an external DNS name. |
Headless (clusterIP: None) | No VIP — DNS returns Pod IPs directly. StatefulSet discovery. |
apiVersion: v1
kind: Service
metadata: { name: web }
spec:
selector: { app: web } # matches Pod labels
ports: [{ port: 80, targetPort: 80 }]
type: ClusterIP
- DNS:
<svc>.<ns>.svc.cluster.local. Service → Pods via label selector → an Endpoints/EndpointSlice list.
- CNI (Calico, Cilium, Flannel) gives every Pod a routable IP — flat network.
- Ingress = L7 router (needs an ingress controller); Gateway API is its successor.
- NetworkPolicy = firewall for Pods (default-allow until one selects a Pod, then default-deny).
empty endpoints
A Service whose selector matches no ready Pod has empty Endpoints → "connection refused"
with no obvious error. Check kubectl get endpoints <svc> first.
7. Config & Secrets
kubectl create configmap appcfg --from-literal=LOG_LEVEL=info --from-file=app.conf
kubectl create secret generic dbcreds --from-literal=password=s3cr3t
envFrom: [{ configMapRef: { name: appcfg } }]
env:
- name: DB_PASSWORD
valueFrom: { secretKeyRef: { name: dbcreds, key: password } }
volumes:
- name: cfg
configMap: { name: appcfg } # or mount as files
Secrets aren't encrypted by default
They're only base64-encoded in etcd. Enable encryption-at-rest + tight RBAC; consider an
external secrets store (Vault, External Secrets Operator).
8. Storage
| Object | Role |
| PV | A piece of cluster storage (admin/provisioner side). |
| PVC | A Pod's request for storage; binds to a PV. |
| StorageClass | Template for dynamic provisioning (gp3, ssd, …). |
Access modes: RWO (one node RW), ROX (many nodes RO),
RWX (many nodes RW). Most block storage is RWO; shared filesystems do RWX.
Reclaim policy: Delete vs Retain when the PVC goes away.
9. Scheduling controls
| Mechanism | Effect |
| nodeSelector | Simple "only nodes with this label". |
| Affinity / anti-affinity | Rich rules — co-locate or spread (e.g. replicas across zones). |
| Taints & tolerations | Taint a node to repel; only Pods with a matching toleration land there (GPU/spot nodes). |
| Topology spread | Even distribution across zones/nodes. |
| PriorityClass | Higher-priority Pods can preempt lower ones. |
| requests | Drive bin-packing — placement by available requested resources. |
Mental model: taint = node repels, toleration = pod allowed,
affinity = pod attracted.
10. Health probes
| Probe | On fail | For |
| liveness | restart the container | detect a wedged process |
| readiness | remove from Service endpoints (no restart) | "can I serve traffic now" |
| startup | hold off liveness until booted | slow-starting apps |
probe trap
Don't point liveness at a dependency (DB). If the DB blips, liveness fails and K8s restarts your
healthy app pointlessly. Liveness = "is the process stuck"; readiness = "should I get traffic".
11. Rollouts & QoS
kubectl set image deploy/web web=nginx:1.28
kubectl rollout status deploy/web
kubectl rollout history deploy/web
kubectl rollout undo deploy/web --to-revision=2
RollingUpdate scales a new ReplicaSet up while the old scales down, bounded by
maxSurge/maxUnavailable. QoS classes (eviction order):
Guaranteed (requests=limits) > Burstable >
BestEffort (no requests, evicted first).
12. RBAC
| Object | Meaning |
| Role / ClusterRole | A set of permissions (verbs on resources). Role = namespaced; ClusterRole = cluster-wide. |
| RoleBinding / ClusterRoleBinding | Grants a Role to a user/group/ServiceAccount. |
| ServiceAccount | Identity for Pods to call the API. |
kubectl auth can-i create deploy --as=system:serviceaccount:ns:sa -n ns
Formula: Subject + Role + Binding = access. Least privilege; avoid cluster-admin.
13. Autoscaling
kubectl autoscale deploy/web --min=2 --max=10 --cpu-percent=70
- HPA — scales replicas on CPU/mem/custom metrics (needs metrics-server). No requests = no HPA.
- VPA — adjusts a Pod's requests/limits.
- Cluster Autoscaler — adds/removes nodes when Pods can't schedule.
14. Security context & namespaces
- securityContext:
runAsNonRoot, readOnlyRootFilesystem, drop capabilities, allowPrivilegeEscalation: false, seccompProfile.
- Pod Security Admission (privileged / baseline / restricted) replaces PodSecurityPolicy.
- ResourceQuota + LimitRange per namespace cap usage and set defaults.
- Use dedicated ServiceAccounts per workload, least-privilege RBAC, NetworkPolicies.
15. Debugging playbook
kubectl get pods # STATUS column tells the story
kubectl describe pod # Events = why it's stuck
kubectl logs --previous # crashed container's last words
kubectl get events --sort-by=.lastTimestamp
kubectl get endpoints # is the Service wired to Pods?
| Status | Meaning → fix |
| CrashLoopBackOff | Starts then crashes — logs --previous; bad config/cmd, failing liveness, missing dep. |
| ImagePullBackOff | Can't pull — wrong name/tag, private registry without imagePullSecret, rate limit. |
| Pending | No node fits — insufficient requests, taints, unbound PVC. |
| OOMKilled | Hit memory limit — raise limit / fix leak. |
| CreateContainerConfigError | Missing ConfigMap/Secret. |
| 0/1 Ready but Running | Readiness probe failing — app up, not serving. |
16. Rapid-fire interview Q&A
- What is a Pod?Smallest deployable unit — one+ containers sharing a network namespace (same IP) and storage, always co-scheduled.
- Deployment vs StatefulSet?Deployment = interchangeable stateless replicas, random names. StatefulSet = stable identity (web-0), stable per-pod storage, ordered rollout. DBs.
- How does a Service find its Pods?Label selector → Endpoints/EndpointSlice; kube-proxy programs iptables/IPVS to load-balance to those Pod IPs.
- ClusterIP vs NodePort vs LoadBalancer?Internal-only → node port on every node → cloud LB. They layer.
- Liveness vs readiness vs startup?Liveness fail → restart. Readiness fail → pull from endpoints. Startup → guard slow boots.
- requests vs limits?requests = guaranteed/scheduled. limits = hard cap. Over mem = OOMKilled; over CPU = throttled.
- What's in the control plane?apiserver (front door), etcd (state), scheduler (placement), controller-manager (reconcile). Nodes run kubelet + kube-proxy + runtime.
- How does a rolling update work?New ReplicaSet scales up while old scales down per maxSurge/maxUnavailable. Rollback = re-point to the previous ReplicaSet.
- ConfigMap vs Secret?Same idea; Secret is for sensitive data (base64 in etcd, not encrypted by default). Mount as env or files.
- Taint vs toleration vs affinity?Taint repels pods from a node; toleration lets a pod ignore it; affinity attracts pods to nodes/pods.
- Why is my Pod Pending?No fit: insufficient requestable CPU/mem, a taint with no toleration, or an unbound PVC.
- What does kubelet do?Node agent — watches the API for Pods on its node and keeps their containers running and healthy.
- QoS classes?Guaranteed (requests=limits) > Burstable > BestEffort. BestEffort evicted first under pressure.
- Headless service?
clusterIP: None — DNS returns Pod IPs directly (no VIP). Used by StatefulSets for stable per-pod DNS.
- How is config declarative?You apply desired state; controllers reconcile actual → desired continuously. No imperative "start this".