Infrastructure Engineer Interview Questions

Real infrastructure-engineer questions — provisioning, IaC, networking, load balancing, scaling, capacity, storage, HA/DR, and the design judgement that separates "wires it up" from "owns the platform" — graded easy → hard with full answers. Click to expand. Pair with the Terraform / AWS / Networking cheatsheets.

easy fundamentals / screening medium applied — most loops hard senior / design & debug

Easy — fundamentals

What does an infrastructure engineer actually own? easy

The foundation everything else runs on: compute (VMs, containers, bare metal), networking (VPCs, subnets, load balancers, DNS, firewalls), storage (block, object, file), and the provisioning + lifecycle of all of it. Increasingly that means infrastructure as code, the CI/CD and deployment platform other teams build on, observability plumbing, and the reliability/cost/security posture of the whole estate. It overlaps DevOps/SRE/platform — the distinguishing focus is the platform layer: making infra reproducible, scalable, available, and cheap, then exposing it as a paved road to product teams.

Declarative vs imperative infrastructure — what's the difference? easy

Imperative: you specify the steps ("create this VM, then attach this disk, then open this port") — like a shell script. Declarative: you specify the desired end state ("I want 3 VMs with these disks and these ports") and the tool computes the diff and converges reality to match. Terraform, CloudFormation, and Kubernetes manifests are declarative; the tool is idempotent — re-running yields the same result. Declarative wins for infra because it's reproducible, diffable in PRs, and self-documents the target state.

What is a load balancer, and why use one? easy

It distributes incoming traffic across multiple backend instances so no single one is overwhelmed. Buys you scale (spread load horizontally), availability (route around unhealthy backends via health checks), and a stable single entry point that decouples clients from individual servers. L4 (transport) load balancers route on IP/port — fast, protocol-agnostic. L7 (application) load balancers understand HTTP — path/host routing, TLS termination, header-based routing, sticky sessions. Common algorithms: round-robin, least-connections, hashing.

Vertical vs horizontal scaling? easy

Vertical (scale up): give one machine more CPU/RAM/disk. Simple, no app changes, but has a hard ceiling and is a single point of failure; resizing often means downtime. Horizontal (scale out): add more machines and balance load across them. Near-limitless and fault-tolerant, but requires the workload to be stateless (or state externalized) and adds coordination/networking complexity. Default to horizontal for stateless services; vertical is the quick lever for stateful systems (databases) until you must shard.

What's the difference between block, object, and file storage? easy

Block (EBS, persistent disks): raw volumes attached to one instance, formatted with a filesystem — low latency, good for databases and boot disks. File (NFS, EFS): a shared filesystem mountable by many instances at once — good for shared app data. Object (S3, GCS): flat namespace of immutable blobs accessed over HTTP API, infinitely scalable and cheap with rich durability — good for backups, media, logs, static assets, data lakes. Trade-off axis: latency + mutability (block) vs scale + cost + durability (object).

What is DNS and what happens in a lookup? easy

DNS maps names to addresses. A resolver checks its cache, then walks the hierarchy: root servers → TLD servers (.com) → the domain's authoritative nameservers, which return the record. Key record types: A/AAAA (name→IP), CNAME (alias), MX (mail), TXT (verification/SPF), NS (delegation). TTL controls cache duration — lower it before a planned migration so changes propagate fast. For infra, DNS is also a load-balancing and failover tool (weighted/latency/geo routing, health-checked failover).

What is infrastructure as code? easy

Managing infra via version-controlled declarative config (Terraform/CloudFormation) instead of manual clicks — reproducible, reviewable, rollback-able.

What is a load balancer's job? easy

Distribute traffic across backends for scale + availability via health checks; L4 routes on IP/port, L7 on HTTP.

Stateful vs stateless service? easy

Stateless holds no per-client state between requests (easy to scale/replace); stateful keeps data locally (harder — externalize state to scale).

What is a CDN? easy

Edge caches near users serving static/cacheable content, cutting latency and origin load.

What is a reverse proxy? easy

A server fronting backends that handles TLS, routing, caching, and load balancing on their behalf (nginx/Envoy).

What is high availability? easy

Designing so the system keeps serving despite component failures — redundancy + no single point of failure.

What is a VPC/subnet? easy

An isolated virtual network divided into subnets (public/private) with route tables and gateways controlling traffic.

What is autoscaling? easy

Automatically adding/removing capacity to match load between min/max bounds, replacing unhealthy instances.

What is a golden image? easy

A pre-baked machine/container image with OS+deps+config, deployed immutably so every instance is identical.

What is a CMDB / infra inventory? easy

A record of what infrastructure exists (resources, ownership, config) — the basis for change management and cost/security attribution.

Medium — applied

How do you keep Terraform state safe in a team? medium

State is the source of truth mapping config to real resources, and can hold secrets — never keep it local or in Git. Use a remote backend (S3 + DynamoDB lock table, GCS, or Terraform Cloud) with state locking so two applies can't race, versioning for rollback, and encryption at rest. Split state by blast radius — separate state per environment/component (networking vs app) so a mistake or lock contention doesn't affect everything. Use terraform plan in CI on every PR, apply only after review, and never hand-edit state — use state mv/import/rm. Workspaces or directory-per-env for isolation.

Mutable vs immutable infrastructure — and why immutable wins? medium

Mutable: you SSH in and patch/upgrade servers in place (config management converging long-lived hosts). Over time servers drift into unique "snowflake" states that are impossible to reproduce. Immutable: you never modify a running server — you bake a new image (Packer/AMI/container) and replace the instance wholesale. Benefits: no config drift, trivial rollback (redeploy the previous image), identical staging/prod, and deploys become "swap the fleet." It pairs with golden images + autoscaling groups + blue-green. Cost: longer build step and you must externalize all state. Modern infra strongly favors immutable.

Walk through designing a VPC for a 3-tier app. medium

One VPC, spread across multiple AZs for HA. Subnet tiers: public subnets (internet-facing LB + NAT gateway) and private subnets for app servers and a separate private/isolated tier for databases. Route tables: public subnet → internet gateway; private → NAT gateway for outbound only; DB subnet → no internet route. Security groups as the per-instance firewall (LB SG allows 443 from world; app SG allows traffic only from the LB SG; DB SG allows only from the app SG) — reference SGs, not CIDRs. NACLs as a coarse subnet-level backstop. Use VPC endpoints to reach AWS services privately. Non-overlapping CIDR blocks so you can peer/VPN later.

How does autoscaling work, and what do you scale on? medium

An autoscaling group maintains a target number of instances between min/max bounds, replacing unhealthy ones and adding/removing capacity by policy. Target tracking (keep avg CPU at 60%) is the simplest and usually best; step scaling reacts to alarm thresholds; scheduled handles known traffic patterns; predictive uses history. Scale on the metric that actually reflects load — often not CPU: queue depth, request latency, RPS, or concurrency for I/O-bound services. Watch out: slow instance warm-up (use warm pools / pre-baked images), scale-in flapping (cooldowns), and downstream limits (DB connections) that don't scale with you. For containers it's HPA on k8s; for serverless it's automatic per-request concurrency.

Config management (Ansible/Chef/Puppet) vs provisioning (Terraform) — where does each fit? medium

Different jobs. Provisioning tools (Terraform, CloudFormation) create and manage cloud resources — VPCs, instances, load balancers, DBs — declaratively. Config management (Ansible, Chef, Puppet) configures what's inside a server — packages, files, services, users. With immutable infra the config-management role shrinks: you use Packer/Ansible to bake an image once, then Terraform provisions instances from it, and you don't reconfigure live hosts. With mutable infra, config management runs continuously to converge long-lived servers. A common split: Terraform for infra topology, Ansible for image build / one-off ops tasks, app config via env/secrets at runtime.

How do you add caching and a CDN to reduce load and latency? medium

Cache at the layer closest to the bottleneck. CDN (CloudFront/Fastly) caches static assets and cacheable responses at edge PoPs near users — cuts latency and offloads origin; control with Cache-Control headers, cache keys, and TTLs, invalidate on deploy. Application cache (Redis/Memcached) for hot DB reads, sessions, computed results — pick an eviction policy (LRU) and a strategy (cache-aside is most common). DB-level: read replicas, query/result cache. Key concerns: invalidation (the hard problem — TTL vs explicit purge vs versioned keys), the thundering herd / cache stampede on expiry (use request coalescing / staggered TTLs), and avoiding caching of personalized/auth'd responses.

What are RTO and RPO, and how do they drive your backup/DR design? medium

RPO (Recovery Point Objective) = how much data loss you can tolerate → sets backup/replication frequency (RPO of 5 min means replicate at least every 5 min). RTO (Recovery Time Objective) = how long you can be down → sets your recovery architecture. Tighter targets cost more. The DR-strategy ladder, cheap→expensive: backup & restore (hours, cheapest), pilot light (core minimal stack always on), warm standby (scaled-down full copy running), active-active / multi-region (near-zero RTO/RPO, costliest). Test restores regularly — an untested backup is a hope, not a plan. Match the strategy to the business cost of downtime per tier.

How do you keep Terraform state safe in a team? medium

Remote backend (S3+DynamoDB lock / TF Cloud) with locking, versioning, encryption; split state by blast radius; plan-in-CI, reviewed apply; never hand-edit.

Mutable vs immutable infrastructure? medium

Mutable patches live servers (drift, snowflakes); immutable replaces instances from new images (no drift, easy rollback, identical envs) — modern default.

Config management vs provisioning? medium

Provisioning (Terraform) creates cloud resources; config management (Ansible/Chef) configures inside servers. With immutable infra, config mgmt mostly bakes images.

How does autoscaling pick a metric? medium

Target tracking on the metric that reflects load — often queue depth/RPS/latency, not CPU for I/O-bound services; watch warm-up and downstream limits.

What are RTO and RPO? medium

RTO = max tolerable downtime (sets recovery architecture); RPO = max tolerable data loss (sets backup/replication frequency).

How do caching and CDN reduce load? medium

CDN caches static/edge content; app cache (Redis) absorbs hot DB reads; key concerns are invalidation and cache stampede (coalescing/staggered TTLs).

What is blast radius and how do you limit it? medium

The scope of damage from a failure/change; limit via isolation (accounts/namespaces/state splitting), progressive rollout, and quotas.

What is a service mesh? medium

A sidecar/eBPF layer (Istio/Linkerd) adding mTLS, traffic control, retries, and observability to service-to-service traffic without app changes.

How do you design for graceful degradation? medium

Shed non-critical features, serve cached/stale data, rate-limit, and use circuit breakers so overload degrades instead of collapsing.

What is the expand-contract migration pattern? medium

Add the new (backward-compatible) schema, dual-write/backfill, switch reads, then drop the old — each step independently deployable and reversible.

Hard — senior & design

Design infrastructure for a service that must handle 10x traffic spikes. hard

Make everything elastic and decoupled. Stateless app tier behind an L7 load balancer in an autoscaling group / HPA, scaling on the real load metric with pre-baked images or warm pools so scale-up isn't gated by boot time. Absorb spikes with a queue (SQS/Kafka) — turn synchronous bursts into async work the consumers drain at their own pace, so the DB never sees the spike directly. Protect the data tier: read replicas + a cache (Redis) in front to absorb reads; the DB is usually the part that can't 10x instantly, so shield it. CDN for static/edge-cacheable content. Backpressure + rate limiting + graceful degradation (shed non-critical features) so overload degrades instead of collapsing. Pre-provision/headroom for known events; load-test to find the real ceiling and the first thing to break (usually DB connections or a downstream limit). Add circuit breakers so a slow dependency doesn't cascade.

You inherit infra built by hand in the console — no IaC. How do you bring it under control? hard

Goal: get to reproducible IaC without an outage. (1) Inventory + freeze: discover what exists (cloud config tools, tag audit) and stop new manual changes — announce a change freeze on console edits. (2) Import incrementally: use terraform import (or terraformer to bulk-generate) starting with the lowest-risk, least-coupled resources; write config to match, run plan until it shows no changes (proving state matches reality). (3) Work blast-radius-out: networking and IAM last since they're highest-risk. (4) Lock the door: tighten IAM so humans can't make manual prod changes, enforce changes only via CI/PR, add drift detection. (5) Refactor into modules and environments once everything's imported. Patience over big-bang — a botched import can desync state from reality.

How do you design for high availability across failure domains? hard

Eliminate single points of failure at every layer and match redundancy to the failure domain you're protecting against. Within a region: spread across ≥2–3 AZs (independent power/network/cooling) — LB across AZs, app instances in each, DB with a standby in another AZ (sync replication for zero RPO). Region failure: multi-region with async replication + DNS/global LB failover (accept higher RPO and cost). Make components stateless so any instance can serve any request; externalize state to replicated stores. Health checks + automatic failover everywhere. Beware correlated failures (a bad deploy or config push hits all AZs at once — AZ redundancy won't save you; you need progressive rollout) and shared dependencies (one overloaded DB behind "redundant" app tiers). Quantify: HA is about removing SPOFs and reducing MTTR, and the architecture follows from the SLA target (99.9 vs 99.99 implies very different spend).

The platform bill doubled this quarter with flat traffic. How do you find and fix it? hard

Method, not guessing. (1) Attribute: cost-explorer breakdown by service, then by tag/team/environment — find what grew. Enforce a tagging policy if attribution is impossible (that's often the real bug). (2) Usual suspects: idle/over-provisioned instances and unattached EBS volumes, old snapshots, orphaned load balancers/IPs, forgotten non-prod environments running 24/7, data-transfer/egress (cross-AZ and NAT-gateway traffic is a classic silent cost), over-retained logs, and storage-class mismanagement (hot data that should be cold). (3) Fix structurally: rightsizing, autoscaling/scale-to-zero for non-prod, scheduled shutdown of dev, savings plans / reserved / spot for steady or fault-tolerant workloads, S3 lifecycle policies to tier/expire, VPC endpoints to cut NAT egress, log sampling/retention. (4) Prevent: budgets + anomaly alerts, cost as a CI check on IaC (Infracost), and showback so teams see their own spend. Cost is an engineering metric, not a finance afterthought.

A region-wide cloud outage takes down your primary. Walk through failover. hard

Depends on the DR posture you pre-built (you can't architect this mid-incident). Assuming warm standby / active-passive multi-region: (1) Confirm + declare: verify it's a region failure (provider status + your own health checks), declare incident, assign roles. (2) Promote data: promote the secondary-region DB replica to primary (this is the step with real RPO/data-loss risk — know your replication lag). (3) Shift traffic: fail DNS / global LB over to the standby region (health-check-based failover ideally automates this); scale the standby fleet up from its reduced footprint. (4) Verify: smoke-test critical paths, watch error rates and capacity. (5) Communicate status throughout. (6) Fail back carefully once primary recovers — re-sync data the other direction before switching back, off-peak. Pitfalls: DNS TTLs too high (slow cutover), standby capacity that can't actually take full load, and configs/secrets that were only in the dead region. The whole thing hinges on having tested this game-day beforehand.

Design infra for a 10x traffic spike. hard

Stateless autoscaled app tier (warm pools), a queue to absorb bursts, read replicas + cache to protect the DB, CDN, and backpressure/rate-limiting/graceful degradation; load-test the ceiling.

Design HA across failure domains. hard

Multi-AZ (independent power/net) within a region, multi-region for region failure (async replication + DNS failover), stateless components, no SPOF, and beware correlated failures (bad deploy hits all AZs).

How do you design multi-region DR for tight RTO/RPO? hard

Warm standby (or active-active) with continuous async replication, health-checked DNS/global LB failover, automated DB promotion + scale-up, and a game-dayed runbook.

How do you bring hand-built infra under IaC? hard

Freeze manual changes, import incrementally (terraform import/terraformer) until plan shows no changes, work blast-radius-out (IAM/networking last), then lock down manual access + drift detection.

How do you do cost engineering at scale? hard

Tagging + attribution, rightsizing, autoscale/scale-to-zero non-prod, savings/reserved/spot, lifecycle/tiering, cut cross-AZ/NAT egress, budgets + anomaly alerts, cost-as-CI-check.

How do you design a self-service internal platform (paved road)? hard

Golden paths with sane defaults + guardrails (templates, policy-as-code), self-service provisioning within limits, built-in observability/security, and escape hatches for edge cases.

How do you choose AZ vs region redundancy for an SLA? hard

AZ redundancy handles datacenter failure cheaply; region redundancy handles region outages at higher cost/complexity (data replication, RPO). Derive from the SLA target and cost of downtime.

How do you manage Terraform across many teams/environments? hard

Reusable modules, state split by component/env with remote locking+versioning, workspaces, plan-in-CI + reviewed apply, drift detection, and policy-as-code (OPA/Sentinel).

How do you load-test and capacity-plan a system? hard

Ramp load to find per-component limits and the first bottleneck (often DB connections), model peak + headroom, validate autoscaling, and monitor saturation (USE).

How do you handle stateful workloads at scale? hard

Externalize/replicate state (replicated DBs, distributed stores), shard for horizontal scale, use leader election + consensus where needed, and plan backups/restore + failover explicitly.

Scenario-based

You inherit infrastructure built entirely by hand in the console — no IaC. How do you bring it under control? hard

Get to reproducible IaC without an outage. Inventory + freeze manual changes first. Import incrementally with terraform import (or terraformer to bulk-generate), starting with low-risk, loosely-coupled resources; write config until plan shows no changes (proving state matches reality). Work blast-radius outward — networking/IAM last. Then lock down IAM so humans can't make manual prod changes, enforce changes via CI/PR, add drift detection. Patience over big-bang.

Design infra for a service that must absorb 10x traffic spikes. hard

Make it elastic and decoupled. Stateless app tier in an autoscaling group/HPA scaling on the real load metric, with pre-baked images/warm pools so scale-up isn't gated by boot time. Queue (SQS/Kafka) to turn synchronous bursts into async work so the DB never sees the spike. Protect the data tier with read replicas + cache. CDN for static/edge content. Backpressure, rate limiting, graceful degradation so overload degrades instead of collapsing. Load-test to find the real ceiling (usually DB connections).

A region-wide cloud outage takes down your primary. Walk through failover. hard

Assuming a pre-built warm-standby multi-region setup (can't architect mid-incident). Confirm + declare the region failure, assign roles. Promote the secondary-region DB replica to primary (the real data-loss-risk step — know your replication lag). Shift traffic via DNS/global LB failover and scale up the standby fleet. Verify critical paths, communicate. Fail back carefully after recovery (re-sync the other direction, off-peak). Pitfalls: high DNS TTLs, standby that can't take full load, configs/secrets only in the dead region.

The platform bill doubled with flat traffic. How do you find and fix it? hard

Attribute via cost tooling by service then tag/team/env (enforce tagging if you can't). Usual suspects: idle/oversized compute, unattached disks + old snapshots, orphaned LBs/IPs, non-prod running 24/7, data-transfer/NAT egress, over-retained logs, wrong storage class. Fix structurally: rightsize, autoscale/scale-to-zero non-prod, savings/reserved/spot, lifecycle policies, VPC endpoints. Prevent: budgets + anomaly alerts, cost-as-CI-check (Infracost), showback so teams see their spend.

Terraform state is locked / corrupted and applies are failing. What do you do? medium

If locked by a dead run, verify no apply is actually in progress, then terraform force-unlock <id> carefully (releasing a live lock corrupts state). If corrupted, restore from the versioned remote backend (S3 versioning / TF Cloud history) — that's why state lives remote with versioning + locking. Never hand-edit; use state subcommands and import to reconcile drift. Prevent: remote backend with locking, versioning, encryption, and split state by blast radius.

You need multi-region DR with a 15-minute RTO. How do you design it? hard

RTO 15min + low RPO rules out plain backup/restore. Use warm standby (a scaled-down full stack always running in region B) with continuous async DB replication (sets your RPO from replication lag). Pre-provision networking, configs, and secrets in both regions. Health-checked DNS/global LB failover to flip traffic, plus automation to promote the DB replica and scale the standby up. Write and game-day the runbook regularly — an untested DR plan won't hit 15min. Cost scales with how warm the standby is.

Inherit no-IaC infra. Bring it under control. hard

Freeze manual changes, terraform import incrementally until plan is clean, work blast-radius-out, then enforce changes via CI/PR + drift detection.

Design for a 10x spike. Outline. hard

Autoscaled stateless tier + warm pools, queue to absorb bursts, replicas+cache to shield DB, CDN, backpressure/graceful degradation; load-test the ceiling.

Region outage hits primary. Failover steps? hard

Confirm + declare, promote secondary-region DB replica (mind lag/RPO), shift DNS/global LB, scale the standby, verify, then fail back carefully (re-sync, off-peak).

Bill doubled with flat traffic. Find/fix. hard

Attribute by service+tag, find idle/oversized + unattached disks/snapshots + orphaned LBs + 24/7 non-prod + NAT egress; rightsize, schedule, savings plans, lifecycle, budgets.

Terraform state locked/corrupted, applies fail. Do what? medium

If a dead lock, verify nothing's running then force-unlock carefully; if corrupted, restore from versioned remote backend; reconcile drift with state cmds/import, never hand-edit.

Need multi-region DR with 15-min RTO. Design. hard

Warm standby + continuous async replication, health-checked DNS failover, automation to promote DB + scale up, configs/secrets in both regions, game-dayed runbook.

One service's failure keeps cascading. Fix? hard

Add circuit breakers, timeouts, bulkheads, and retries-with-backoff to isolate the failing dependency; degrade gracefully instead of blocking the whole system.

Provisioning is slow and error-prone manually. Improve. medium

Move to IaC + modules + CI pipeline (plan/apply), golden images, and self-service templates so provisioning is fast, repeatable, and reviewed.

A single DB is the bottleneck under growth. Options? hard

Read replicas + caching first; then vertical scale; then shard/partition or move hot data to a fit-for-purpose store; introduce a queue to smooth writes.

Need zero-downtime infra change touching networking. Approach? hard

Stage in a non-prod copy, change incrementally with smallest blast radius, use canary/parallel resources where possible, have rollback ready, and do it in a maintenance window with monitoring.

what industry actually asks

Infrastructure loops center on design trade-offs ("design infra for X," "make this scale/HA," "how do you do DR") and operational judgement ("inherited messy infra," "bill spiked," "region down") — there's rarely one right answer, they want your reasoning and awareness of cost/blast-radius. Expect deep IaC (Terraform state, modules, drift, import), networking (VPC/subnets/SG vs NACL/DNS/LB), scaling & capacity (autoscaling on the right metric, queues, caching), and reliability (AZ vs region, RTO/RPO, SPOFs). Senior loops add cost engineering and platform/paved-road thinking (self-service, golden paths, guardrails). Often a hands-on Terraform or whiteboard architecture round. Always answer with trade-offs and a method, never a single tool.

Infrastructure Engineer — Interview Questions.

Easy — fundamentals

Medium — applied

Hard — senior & design

Scenario-based