DevOps Interview Questions

Real DevOps-engineer questions — CI/CD, containers, Kubernetes, IaC, and the culture/process bits — graded easy → hard with full answers. Click to expand. Pair with the Kubernetes / Docker / Terraform cheatsheets.

easy fundamentals / screening medium applied — most loops hard senior / design & debug

Easy — fundamentals

What is DevOps, really? easy

A culture and set of practices that shorten the path from code to production by breaking down the wall between development and operations. Core ideas: automate everything (build, test, deploy, infra), small frequent releases, shared ownership of reliability, fast feedback (monitoring), and treating infrastructure as code. It's measured by the DORA metrics: deployment frequency, lead time for changes, change-failure rate, and mean time to restore (MTTR).

CI vs CD vs CD — continuous integration, delivery, deployment? easy

Continuous Integration: developers merge to a shared branch frequently; every push triggers automated build + tests to catch integration problems early. Continuous Delivery: every change that passes CI is automatically prepared and kept releasable — deploying to prod is a one-click/manual approval. Continuous Deployment: goes one step further — every change that passes the pipeline is deployed to production automatically, no human gate.

What is a container, and how is it different from a VM? easy

A container packages an app with its dependencies and runs as an isolated process on the host kernel, using Linux namespaces (isolation) and cgroups (resource limits). A VM virtualizes hardware and runs a full guest OS with its own kernel via a hypervisor. Containers are lighter (no guest OS), start in milliseconds, and have higher density; VMs give stronger isolation and can run different OS kernels. Containers share the host kernel, so kernel-level isolation is weaker.

What is Infrastructure as Code (IaC)? easy

Managing infrastructure (servers, networks, load balancers, DNS) through declarative configuration files kept in version control, instead of manual clicks. Tools like Terraform or CloudFormation read the desired state and converge the real infra to match. Benefits: reproducibility, review/audit via PRs, rollback, drift detection, and the ability to spin up identical environments. Declarative (describe the end state) is the norm; the tool computes the diff.

What is a Docker image vs a container? easy

An image is an immutable, layered template (filesystem + metadata) built from a Dockerfile. A container is a running (or stopped) instance of an image with a writable top layer. Images are built once and shared via a registry; you can run many containers from one image. Layers are cached and shared across images, which is why ordering Dockerfile instructions for cache reuse matters.

Trunk-based vs GitFlow branching? easy

Trunk-based: short-lived branches merged to main often behind flags (suits CI/CD). GitFlow: long-lived develop/release/feature branches (heavier). Most modern teams use trunk-based.

What is Docker layer caching? easy

Each Dockerfile instruction is a cached layer; order rarely-changing steps (deps) before code so rebuilds reuse cache.

What is a container registry? easy

A store for built images (ECR/GHCR/Docker Hub) CI pushes to and clusters pull from, tagged with an immutable SHA.

What is idempotency in IaC? easy

Re-applying the same config yields the same state with no extra changes — reruns are safe.

What is a health probe? easy

An endpoint the platform polls to decide if an instance is alive (restart) and ready (route traffic).

What is blue-green deployment? easy

Two identical envs; deploy to the idle one, switch the LB once verified, keep the old for instant rollback.

What is a canary deployment? easy

Send a small % of traffic to the new version, watch metrics, then ramp — limits blast radius with auto-rollback on SLO breach.

What are the DORA metrics? easy

Deployment frequency, lead time, change-failure rate, time to restore — standard delivery-performance measures.

What is artifact promotion? easy

Build once and promote the same image dev→staging→prod, rather than rebuilding per environment.

Rollback vs roll-forward? easy

Rollback reverts to the last good version (usually fastest to restore service); roll-forward ships a new fix.

Medium — applied

Walk through what happens when you run kubectl apply for a Deployment. medium

kubectl sends the manifest to the API server, which authenticates/authorizes, validates, and persists the desired state in etcd. The Deployment controller sees the new/changed Deployment and creates/updates a ReplicaSet; the ReplicaSet controller creates the required Pods. The scheduler assigns each Pod to a node based on resources/affinity/taints. The kubelet on that node pulls the image and starts the containers via the container runtime, and reports status back. kube-proxy / the CNI wire up networking; Services route to the Pods via endpoints. It's a continuous reconciliation loop — controllers drive actual state toward desired state.

How do you make a Docker image small and build fast? medium

Multi-stage builds: compile in a builder stage, copy only the artifact into a slim runtime stage. Use a minimal base (distroless/alpine/slim). Order layers so rarely-changing steps (deps install) come before frequently-changing ones (app code) to maximize cache hits; copy lockfiles and install deps before copying source. Use a .dockerignore. Combine RUN steps to reduce layers, clean package caches in the same layer. Pin versions for reproducibility. Result: smaller attack surface, faster pulls, faster CI.

Compare deployment strategies: rolling, blue-green, canary. medium

Rolling: replace instances gradually (k8s default) — no extra infra, but mixed versions briefly and slower rollback. Blue-green: stand up a full new (green) environment, switch traffic at the LB once verified, keep blue for instant rollback — fast/clean but doubles infra cost. Canary: route a small % of traffic to the new version, watch metrics, then ramp up — best risk control, needs good traffic-splitting + observability. Choose by risk tolerance, cost, and tooling; canary + automated rollback on SLO breach is the gold standard.

What's the difference between Docker liveness and readiness probes in Kubernetes? medium

Readiness probe decides whether a Pod should receive traffic — fail it and the Pod is removed from Service endpoints but not restarted (used during startup/warmup or when a dependency is down). Liveness probe decides whether the container is healthy — fail it and the kubelet restarts the container (used to recover from deadlocks/hangs). There's also a startup probe to protect slow-starting apps from liveness killing them early. Misconfiguring liveness (too aggressive) causes restart loops; that's a classic outage.

How does Terraform state work, and why is it sensitive? medium

Terraform records the mapping between your config and real-world resources in a state file. On plan/apply it diffs desired config vs state vs real infra to compute changes. State is sensitive because it can contain secrets and is the source of truth — if it's lost or corrupted, Terraform loses track of resources. So store it remotely (e.g. S3 + DynamoDB lock, or Terraform Cloud) with locking (prevent concurrent applies), versioning, and encryption. Never edit it by hand; use terraform state commands and import for adopting existing resources.

How does Kubernetes reconciliation work? medium

Controllers loop comparing desired state (etcd via API server) to actual and act to converge — declarative, self-healing.

How do you design CI caching? medium

Cache dependency dirs + Docker layers keyed by lockfile hashes, restore per run, scope per branch — the biggest CI speedup.

Explain GitOps and benefits. medium

Git is source of truth; an in-cluster agent (Argo CD/Flux) reconciles to the repo — audit trail, easy revert, drift heal, no cluster creds in CI.

How do you achieve zero-downtime deploys? medium

Rolling/blue-green/canary + readiness probes + connection draining, plus backward-compatible (expand-contract) changes so versions coexist.

How do you manage secrets across envs? medium

A secrets manager injected at runtime, workload identity (IRSA) over static keys, per-env isolation, rotation, sealed-secrets/SOPS for GitOps.

Liveness vs readiness vs startup probes? medium

Readiness gates traffic (no restart), liveness restarts a hung container, startup shields slow-booting apps from liveness.

CI/CD for a microservices monorepo? medium

Build/test only what changed (affected graph), per-PR preview envs, build-once-promote, progressive prod rollout with auto-rollback, heavy caching.

How do you handle DB migrations in CI/CD? medium

Backward-compatible expand-contract, decoupled from code, idempotent, batched backfills, with a tested rollback.

What is observability (logs/metrics/traces)? medium

Metrics (trends/alerts), logs (events), traces (request flow across services) — together let you ask arbitrary questions about behavior.

What is infrastructure drift and how do you handle it? medium

When real infra diverges from IaC (manual changes); detect via plan/drift detection, reconcile by re-applying, and lock down manual access to prevent it.

Hard — senior & design

A deploy went out and error rates spiked. Walk me through your response. hard

Stabilize first, diagnose second. (1) Confirm + scope: check dashboards/alerts — which service, which endpoints, blast radius, correlate the spike with the deploy time. (2) Mitigate: if it's clearly the deploy, roll back (or shift canary traffic back) immediately — restoring service beats root-causing. Feature-flag off if it's a flagged change. (3) Communicate: declare an incident, assign roles, update stakeholders. (4) Diagnose once stable: logs/traces around the change, compare config/diff, check dependencies. (5) Postmortem: blameless, identify the gap (missing test, missing canary/alert), add the guardrail. The principles: MTTR over RCA in the moment, rollback is a first-class option, and every incident hardens the pipeline.

How do you design a CI/CD pipeline for a microservices monorepo? hard

Key concerns: only build what changed (path-based triggers / affected-graph tools like Nx/Bazel/Turborepo) so a one-line change doesn't rebuild everything. Stages: lint + unit tests → build artifact/image (tagged with immutable SHA) → security scans (SAST, deps, image) → push to registry → deploy to staging → integration/e2e tests → progressive prod rollout (canary) with automated rollback on SLO breach. Use ephemeral preview environments per PR. Cache aggressively (deps, layers). Promote the same artifact through environments (build once, deploy many). Secrets from a vault, never in the repo. Enforce branch protection + required checks. GitOps (Argo CD/Flux) for declarative, auditable deploys.

What is GitOps and what does it buy you? hard

GitOps makes a Git repo the single source of truth for declarative infrastructure/app state, and an in-cluster agent (Argo CD / Flux) continuously reconciles the live cluster to match the repo. Deploys happen by merging a PR; the agent pulls and applies. Benefits: full audit trail and review on every change, trivial rollback (revert the commit), drift detection/auto-heal (manual changes get reverted to match Git), and no cluster credentials handed to CI (the agent pulls, rather than CI pushing). It cleanly separates CI (build/test → produce artifact + update manifests) from CD (cluster reconciles to Git).

A pod is stuck in CrashLoopBackOff. How do you debug it? hard

CrashLoopBackOff = the container keeps starting and exiting, so the kubelet backs off restarts. Workflow: kubectl describe pod (events: image pull? OOMKilled? failed mount? probe failure?), kubectl logs pod --previous (the crashed instance's logs — the actual error). Common causes: app throws on startup (bad config/missing env/secret, can't reach a dependency), OOMKilled (raise memory limit or fix the leak — check exit code 137), failing liveness probe killing a healthy-but-slow app (add a startup probe), missing ConfigMap/Secret/volume, wrong command/entrypoint, or readiness-gated dependency. Reproduce locally with the same image/env; if the entrypoint exits immediately, override the command to sleep and exec in to inspect.

How would you manage secrets across environments? hard

Never in Git or images. Use a dedicated secrets manager — HashiCorp Vault, AWS Secrets Manager / SSM, or cloud KMS-backed stores — and inject at deploy/runtime. In Kubernetes, prefer External Secrets Operator or CSI secret store driver to sync from the manager (k8s Secrets are only base64, so enable encryption-at-rest in etcd and tight RBAC). Practices: least privilege per service (workload identity / IRSA so apps assume roles instead of holding static keys), short-lived/dynamic credentials, rotation, audit logging, and per-environment isolation. For GitOps, use sealed-secrets or SOPS-encrypted values so only the cluster can decrypt. The goal: secrets are centralized, access-controlled, rotated, and never persisted in plaintext.

How do you secure a CI/CD supply chain? hard

Least-privilege runners + OIDC (no static creds), pinned/verified deps + SBOM, SAST/SCA/secret/image scans, signed artifacts (Sigstore/SLSA), policy-as-code on deploys.

Design an autoscaling strategy for a bursty service. hard

HPA/ASG on the real load metric (queue depth/RPS, not always CPU), pre-baked images/warm pools for fast scale-up, cooldowns to avoid flapping, and downstream limits (DB) accounted for.

How do you run progressive delivery with automated rollback? hard

Canary/blue-green via Argo Rollouts/Flagger, analyze SLO metrics (error rate/latency) during ramp, auto-abort+rollback on breach, with a kill switch.

How do you design multi-region active-passive failover? hard

Async data replication, health-checked DNS/global LB failover, warm standby scaled on demand, tested promotion runbook, and config/secrets present in both regions.

What is a service mesh and when worth it? hard

Sidecar/eBPF layer (Istio/Linkerd) for mTLS, traffic shaping, retries, and observability without app changes — worth it at scale; overhead/complexity hurts small setups.

How do you cut MTTR systematically? hard

Fast rollback as first-class, good alerting on SLOs, runbooks, observability (traces), feature flags/kill switches, blameless postmortems that add guardrails.

How do you do capacity planning? hard

Load-test to find per-instance limits and the first bottleneck (often DB connections), model peak demand + headroom, autoscale, and monitor saturation (USE).

How do you manage Terraform at scale across teams? hard

Remote state with locking + versioning, state split by blast radius, reusable modules, per-env workspaces, plan-in-CI + reviewed apply, and drift detection.

How do you design effective SLOs and error budgets? hard

Pick user-centric SLIs (availability/latency), set SLO targets, derive an error budget; spend it on velocity, freeze releases when exhausted — aligns reliability with shipping.

How do you secure container workloads at runtime? hard

Non-root + dropped caps + read-only FS, Pod Security Admission (restricted), network policies, image scanning + signed images, and runtime detection (Falco).

Scenario-based

A deploy went out and error rates are spiking right now. What do you do? hard

Stabilize before diagnosing. Confirm + scope (which service, correlate with deploy time), then roll back / shift canary traffic back immediately or flag-off the change — restoring service beats root-causing. Declare an incident, assign roles, communicate. Once stable, diagnose via logs/traces/diff, then write a blameless postmortem and add the missing guardrail (test, canary, alert). MTTR over RCA in the moment.

Your CI pipeline takes 45 minutes and devs are blocked. How do you speed it up? medium

Profile the stages first. Then: cache dependencies and Docker layers, parallelize independent jobs, run only affected tests/builds (path-based / monorepo affected-graph), use faster runners and bigger caches, split slow test suites, and move non-blocking work (heavy e2e, security scans) to async/post-merge. Build artifacts once and promote them. Target the longest pole each iteration.

A pod is in CrashLoopBackOff. Walk me through debugging it. medium

kubectl describe pod (events: image pull, OOMKilled, mount/probe failure) and kubectl logs --previous (the crashed instance's real error). Common causes: bad config / missing env/secret, can't reach a dependency on startup, OOMKilled (exit 137 — raise limit/fix leak), aggressive liveness probe killing a slow starter (add startup probe), missing ConfigMap/Secret/volume, or wrong entrypoint. Reproduce locally with the same image/env.

Secrets were committed to the git repo. What's your response? hard

Treat the secret as compromised the moment it's pushed. Rotate it immediately (the leaked value is burned — purging history is not enough). Then remove from history (filter-repo/BFG) and force-push, and revoke any access it granted. Move secrets to a manager (Vault / cloud secrets) injected at runtime, add pre-commit + CI secret scanning (gitleaks/trufflehog) to prevent recurrence, and audit logs for misuse during the exposure window.

Flaky tests keep blocking the pipeline. How do you handle it? medium

Don't just add blind retries — that hides real bugs. Quarantine known-flaky tests out of the blocking path so the pipeline isn't held hostage, then fix root causes (timing/async races, shared state, test ordering, external deps — use deterministic clocks, isolation, mocks/fakes). Track flake rate per test, and only allow narrow, logged retries as a last resort. A flaky suite that everyone ignores is worse than a slow one.

You need a zero-downtime database migration on a live service. How? hard

Use the expand-contract (parallel-change) pattern. Expand: add the new column/table, make it nullable/backward-compatible, deploy code that writes to both old and new (dual-write) and can read either. Backfill existing rows in batches. Switch reads to the new schema once backfilled and verified. Contract: stop writing the old, then drop it in a later deploy. Each step is independently deployable and reversible — never a single breaking migration.

CPU pegged at 100% in prod. First steps? medium

Confirm scope/correlate with a deploy, scale out to relieve, then profile the hot path (flame graph/APM) and fix or roll back.

Works in staging, fails in prod. Likely causes? medium

Config/secret/env diffs, data scale, missing flag/dependency, resource limits — diff environments, check prod-only config.

2GB image, slow pulls. Slim it. medium

Multi-stage build, minimal/distroless base, .dockerignore, combine+clean RUN layers, copy only the artifact into runtime.

Disk fills on a node, pods evicted. Respond. medium

Free space (rotate logs, prune images), find the culprit, set ephemeral-storage limits + log rotation, add disk-pressure alerts.

Ongoing incident, stakeholders keep asking. Handle? medium

Assign incident commander + comms role, post status on a cadence, keep responders on mitigation, single source of truth.

Releases keep breaking prod despite passing tests. Gaps? hard

Add canary + auto-rollback on SLO breach, integration/e2e/contract tests, prod-like staging, flags, better observability — tests miss integration/scale.

Rollback fails due to a DB migration. Lesson? hard

Migrations must be backward-compatible (expand-contract) and decoupled from code so rollback doesn't revert schema; never ship breaking migrations with the deploy.

Build times ballooned with repo growth. Strategy? medium

Affected-only builds/tests, parallelization, dependency+layer caching, faster runners, slow e2e to post-merge.

On-call drowning in noisy alerts. Fix? medium

Alert on symptoms/SLOs not causes, tune thresholds, group/dedupe, add runbooks, delete unactioned alerts.

Roll a risky change to 10M users safely. Plan? hard

Feature-flag (ship dark), canary a small % with SLO monitoring, ramp gradually with auto-rollback, kill switch — decouple deploy from release.

what industry actually asks

DevOps loops lean heavily on scenario debugging ("pod won't start," "deploy broke prod," "pipeline is slow") and design ("design CI/CD for X," "how do you do zero-downtime deploys," "how do you manage secrets/state"). Expect deep Kubernetes (probes, scheduling, networking, RBAC) and a hands-on take-home or live kubectl/Terraform task. Senior loops add reliability (SLOs, incident response, DORA) and cost. Always answer debugging with a method (describe → logs → events) and design with trade-offs, not a single "right" tool.

DevOps — Interview Questions.

Easy — fundamentals

Medium — applied

Hard — senior & design

Scenario-based