AWS Cloud Engineer Interview Questions

AWS cloud-engineer questions — IAM, VPC, compute, storage, databases, and architecture/cost — graded easy → hard with full answers. Click to expand. Pair with the AWS cheatsheet.

easy fundamentals / cert-level medium applied hard senior / architecture

Easy — fundamentals

What are Regions and Availability Zones? easy

A Region is a geographic area (e.g. us-east-1) with multiple isolated Availability Zones — physically separate data centers with independent power/cooling/network, connected by low-latency links. You design for HA by spreading across AZs (an AZ can fail without taking your app down). Regions are isolated from each other for data residency and blast-radius; some services are global (IAM, Route 53, CloudFront). Choosing a region: latency to users, data residency/compliance, service availability, and cost.

What is IAM, and what are users, roles, and policies? easy

IAM controls who can do what. A policy is a JSON document granting/denying actions on resources (with conditions). A user is a long-lived identity (person/service) with credentials. A role is an identity with policies but no permanent credentials — it's assumed temporarily, vending short-lived credentials. Best practice: humans via SSO, workloads via roles (EC2 instance profiles, EKS IRSA), never long-lived access keys. Least privilege, and remember an explicit Deny always wins.

S3 storage classes — when use each? easy

S3 Standard — frequent access, low latency. Standard-IA / One Zone-IA — infrequent access, cheaper storage + retrieval fee (One Zone = single AZ, less durable, cheaper). Intelligent-Tiering — auto-moves objects between tiers based on access; great when patterns are unknown. Glacier Instant / Flexible / Deep Archive — archival, cheapest storage, retrieval from ms to hours. Use lifecycle policies to transition objects automatically as they age. All offer 11 nines of durability (except One Zone's lower availability).

EC2 vs Lambda vs containers (ECS/EKS) — when each? easy

EC2 — full VMs, max control, for long-running/stateful or specialized workloads; you manage the OS. Lambda — serverless functions, event-driven, scale-to-zero, pay-per-invocation; great for spiky/glue/event workloads but with limits (duration, size, cold starts). Containers — ECS (simpler, AWS-native) or EKS (Kubernetes, portable, complex) for microservices; run on EC2 or Fargate (serverless containers, no nodes to manage). Pick by control vs operational overhead, traffic pattern, and team familiarity.

What is the AWS shared responsibility model? easy

AWS secures the cloud ("of" the cloud): hardware, the global infrastructure, the hypervisor, managed-service internals. You secure what's "in" the cloud: your data, IAM/permissions, OS patching (on EC2), network config (security groups, NACLs), encryption choices, and application security. The line shifts by service — with managed/serverless (Lambda, S3, RDS) AWS handles more of the stack, but data classification, access control, and encryption are always yours.

What are Regions and Availability Zones? easy

A Region is a geographic area; an AZ is one or more isolated datacenters within it. Spread across AZs for HA, across Regions for disaster/latency.

IAM user vs role? easy

A user has long-lived credentials for a person/app; a role is assumed temporarily for short-lived creds — prefer roles (no static keys).

What are S3 storage classes? easy

Standard (hot), Infrequent Access, Glacier/Deep Archive (cold/cheap, slow retrieval), Intelligent-Tiering (auto-moves by access pattern).

EC2 vs Lambda vs containers? easy

EC2: full VMs you manage. Lambda: event-driven functions, no servers, per-invocation. Containers (ECS/EKS/Fargate): packaged apps, more control than Lambda.

What is a VPC? easy

A logically isolated virtual network where you define subnets, route tables, and gateways for your resources.

Security Group vs NACL? easy

SG: stateful, instance-level, allow-only. NACL: stateless, subnet-level, allow+deny, needs both directions + ephemeral ports.

What is an S3 bucket policy? easy

A resource-based policy on a bucket granting/denying access by principal/action/condition — combined with IAM and Block Public Access.

RDS vs DynamoDB? easy

RDS: managed relational (SQL, joins, transactions). DynamoDB: managed NoSQL key-value/document, single-digit-ms at scale, designed around access patterns.

What is an Auto Scaling Group? easy

Maintains a target number of EC2 instances between min/max, replacing unhealthy ones and scaling on policy.

What is CloudWatch? easy

AWS monitoring: metrics, logs, alarms, and dashboards for resources and apps.

Medium — applied

Design a basic VPC for a 3-tier web app. medium

One VPC with a CIDR (e.g. 10.0.0.0/16), spread across ≥2 AZs for HA. Public subnets (route to an Internet Gateway) hold the ALB and NAT gateways. Private app subnets hold the EC2/containers; they reach the internet for updates via the NAT gateway (egress only). Private DB subnets hold RDS, no internet route. Security: security groups chain the tiers (ALB SG → app SG → db SG, referencing SGs not CIDRs), NACLs as a coarse subnet guardrail. Use VPC endpoints for S3/DynamoDB to keep traffic off the internet. Each tier in each AZ for redundancy. This is the canonical public-ALB / private-compute / isolated-DB layout.

Security Group vs NACL — what's the difference? medium

Security Group: instance/ENI-level, stateful (return traffic is auto-allowed), allow rules only (implicit deny), evaluated as a whole. The primary firewall; you can reference other SGs as sources. NACL: subnet-level, stateless (must allow both inbound and the return ephemeral ports explicitly), supports allow and deny, evaluated by numbered rule order. Use SGs for normal app-tier control; use NACLs for coarse subnet-wide guardrails or explicit blocks (e.g. block an IP range). A common bug: a stateless NACL blocking return traffic on ephemeral ports.

How do you give an application running on EC2/EKS access to AWS APIs securely? medium

Never hardcode access keys. On EC2, attach an instance profile (IAM role); the SDK auto-fetches temporary credentials from the instance metadata service (use IMDSv2). On EKS, use IRSA (IAM Roles for Service Accounts) or EKS Pod Identity so each pod's service account maps to a scoped IAM role — pods get short-lived creds, least privilege per workload, no shared node role. On Lambda, the function's execution role. The principle: workload identity → assume a least-privilege role → short-lived rotating credentials, never static keys in code/env.

RDS vs DynamoDB — how do you choose? medium

RDS = managed relational (Postgres/MySQL/etc): SQL, joins, transactions, strong consistency, flexible queries — pick it for relational data, complex queries, and when you need ACID across rows. It scales vertically (and read replicas/Aurora horizontally for reads). DynamoDB = managed NoSQL key-value/document: single-digit-ms latency at any scale, serverless, pay-per-use — pick it for high-scale, well-known access patterns, where you design the table around your queries (partition key design is everything). DynamoDB trades query flexibility and joins for scale and operational simplicity. Choose by data model + access patterns + scale, not hype.

How do you autoscale and load-balance a web tier on AWS? medium

Put an Application Load Balancer (L7, path/host routing, TLS termination, health checks) in public subnets in front of an Auto Scaling Group of instances (or use ECS/EKS with target tracking). The ASG spans multiple AZs and scales on a policy — target tracking (e.g. keep CPU ~60% or requests-per-target steady) is the modern default; step scaling and scheduled scaling for known patterns; predictive scaling for cyclical load. The ALB only routes to healthy targets (health checks), and the ASG replaces unhealthy instances. Use a launch template, immutable AMIs/containers, and connection draining for graceful deploys.

What is IRSA / workload identity? medium

IAM Roles for Service Accounts: EKS pods assume IAM roles via OIDC, getting scoped temporary creds instead of node-wide or static keys.

How do you design a multi-AZ VPC? medium

Public subnets (LB + NAT) and private subnets (app, DB) per AZ; route public→IGW, private→NAT egress-only, DB tier no internet; SGs referencing SGs.

ALB vs NLB vs CLB? medium

ALB: L7 HTTP routing/host-path/TLS. NLB: L4, ultra-low latency, static IP, high throughput. CLB: legacy — use ALB/NLB.

What is S3 cross-region replication / versioning? medium

Versioning keeps object history (protects against overwrite/delete); CRR async-replicates to another region for DR/compliance/latency.

How does autoscaling choose a metric? medium

Target tracking on the metric reflecting load (often not CPU — use RPS, queue depth, latency); step/scheduled/predictive for other patterns.

What is a NAT gateway and its cost gotcha? medium

Provides outbound internet for private subnets; you pay per hour + per GB processed, so high egress through NAT is a common silent cost — use VPC endpoints.

What are VPC endpoints? medium

Private connectivity to AWS services (S3/DynamoDB gateway endpoints, interface endpoints for others) without traversing the internet/NAT — cheaper, more secure.

How does Route 53 routing work? medium

Policies: simple, weighted, latency-based, geolocation, failover (health-checked) — for traffic distribution and HA/DR.

What is the shared responsibility model? medium

AWS secures the cloud (hardware, infra, managed-service internals); you secure in the cloud (data, IAM, config, patching your instances).

RDS Multi-AZ vs Read Replicas? medium

Multi-AZ: synchronous standby for HA/failover (not for read scaling). Read replicas: async copies for read scaling (can lag, manual promotion).

Hard — senior & architecture

Design a highly-available, fault-tolerant architecture. What are the principles? hard

Eliminate single points of failure and degrade gracefully. Multi-AZ for everything stateful (RDS Multi-AZ, ASG across AZs, ALB cross-zone) so an AZ loss is survivable; multi-Region only for DR/very high availability (active-passive with Route 53 failover + cross-region replication, or active-active for global low latency — far more complex/costly). Stateless compute behind a load balancer so instances are disposable and horizontally scalable; push state to managed stores (RDS, DynamoDB, S3) and caches (ElastiCache). Decouple with queues (SQS) / events (EventBridge) so components fail independently and absorb spikes. Health checks + auto-replacement, retries with backoff + idempotency, circuit breakers. Backups + tested restores, infrastructure as code for reproducibility. Match the cost/complexity to the real RTO/RPO — don't build multi-region if Multi-AZ meets the SLA.

An EC2 instance can't reach the internet (or vice versa). How do you debug it systematically? hard

Walk the path. Outbound from a private instance: is it in a private subnet whose route table sends 0.0.0.0/0 to a NAT gateway (in a public subnet)? Public subnet needs a route to the Internet Gateway. Inbound: instance needs a public IP/EIP, a subnet route to the IGW (=public subnet), and a security group allowing the port. Check in order: (1) Security group (stateful, allow rules) — most common. (2) NACL (stateless — did you allow return ephemeral ports?). (3) Route table (IGW for public, NAT for private). (4) Public IP present? (5) OS firewall / app listening on 0.0.0.0. (6) DNS. Use VPC Reachability Analyzer and VPC Flow Logs (ACCEPT/REJECT) to see exactly where packets are dropped. Method: SG → NACL → route → public IP → OS.

How do you approach cost optimization on AWS? hard

Measure first: Cost Explorer/CUR + tagging to attribute spend, find the top line items. Then: right-size over-provisioned instances/volumes (Compute Optimizer); buy commitments for steady baseline (Savings Plans / Reserved Instances) and use Spot for fault-tolerant/batch (huge discount). Scale to demand (autoscaling, scale-to-zero with Lambda/Fargate, stop dev envs off-hours). Storage — S3 lifecycle to IA/Glacier, delete orphaned EBS volumes/snapshots/unattached EIPs, gp3 over gp2. Data transfer — keep traffic in-AZ/in-region, use VPC endpoints and CloudFront to cut egress (often a hidden cost). Managed/serverless where it lowers ops + idle cost. Set budgets/alerts and make cost a continuous practice (FinOps), not a one-off. The biggest wins are usually idle/over-provisioned resources and data-transfer surprises.

How would you build a secure, scalable serverless API? hard

API Gateway (or ALB/Lambda URL) → Lambda → DynamoDB is the classic pattern. Auth at the edge: Cognito / a Lambda authorizer / JWT, plus WAF for common attacks and throttling/usage plans for rate limiting. Lambda with least-privilege execution roles per function; secrets from Secrets Manager/SSM, not env vars in plaintext. DynamoDB with on-demand capacity (or autoscaling) and a partition-key design matching access patterns; add DAX/caching if hot. Decouple slow work to SQS/Step Functions so the API stays fast. Concerns to call out: cold starts (provisioned concurrency for latency-sensitive paths), Lambda concurrency limits + downstream throttling, idempotency for retries, and observability (X-Ray tracing, structured logs, CloudWatch alarms). It scales automatically and costs nothing at idle, at the price of cold starts and per-request limits.

Multi-account strategy — why and how? hard

Separate AWS accounts give hard isolation boundaries — for blast radius, security, billing, and environment separation (prod vs dev vs sandbox, or per team/product). Manage them with AWS Organizations: a management account, organizational units (OUs), and Service Control Policies (SCPs) that set guardrails on what member accounts can do (e.g. deny disabling CloudTrail, restrict regions). Use Control Tower to provision accounts with a compliant baseline (landing zone). Centralize identity via IAM Identity Center (SSO) with permission sets (assume roles cross-account), centralize logging (CloudTrail/Config to a log-archive account), and consolidate billing. Networking via Transit Gateway / shared VPCs. The principle: isolate by account for security/blast-radius, govern centrally with Organizations + SCPs + SSO.

Design a highly available architecture on AWS. hard

ALB across ≥2 AZs → ASG app tier per AZ → RDS Multi-AZ/Aurora; static on S3+CloudFront; Route 53 health-checked; stateless app, secrets in Secrets Manager, IAM roles, autoscale on real metric, no SPOF.

How do you design multi-account governance? hard

AWS Organizations + SCPs for guardrails, separate accounts per env/team (blast radius + billing), centralized logging/security accounts, Control Tower, and cross-account roles.

How do you secure data at rest and in transit? hard

KMS-backed encryption (S3/EBS/RDS) with key policies + rotation, TLS everywhere, enforce encryption via SCP/bucket policy, and least-privilege key access.

How do you optimize a large AWS bill structurally? hard

Tagging + Cost Explorer attribution, rightsizing, autoscale/scale-to-zero non-prod, savings plans/reserved/spot, S3 lifecycle, VPC endpoints to cut NAT egress, budgets + anomaly alerts.

How does cross-account access work securely? hard

Define a role in the target account trusting the source account/principal; the source assumes it via STS for scoped temporary creds — with ExternalId for third parties to prevent confused-deputy.

Design a serverless event-driven pipeline. hard

API Gateway/EventBridge → Lambda → SQS/SNS for decoupling + retries/DLQ → DynamoDB/S3; idempotent handlers, concurrency limits, and async fan-out — scales to zero.

How do you do disaster recovery on AWS? hard

Pick by RTO/RPO: backup-restore (cheap, slow), pilot light, warm standby, or active-active multi-region; replicate data (cross-region), automate failover (Route 53), and test.

How do you handle Lambda at high scale? hard

Mind account concurrency limits + reserved/provisioned concurrency, downstream throttling (RDS connections — use RDS Proxy), DLQs for failures, and avoid VPC ENI cold-start pitfalls.

How do you enforce least privilege in IAM at scale? hard

Permission boundaries, SCPs, Access Analyzer to find unused/excessive perms, generate policies from CloudTrail, scoped roles per workload, and regular access reviews.

How do you architect for cost-efficient high throughput? hard

Spot for fault-tolerant compute, autoscaling, caching (ElastiCache/CloudFront), right storage class, batch/async via SQS, and avoid cross-AZ/NAT data-transfer charges.

Scenario-based

An EC2 instance can't reach the internet. How do you debug? medium

Trace the path. Public subnet: route table has 0.0.0.0/0 → Internet Gateway? Instance has a public IP / EIP? Private subnet: route to a NAT gateway for outbound? Then Security Group egress (allows outbound — SGs are stateful) and NACL (both directions, stateless — needs ephemeral return ports). Also DNS enabled on the VPC. Check in that order: subnet route → IGW/NAT → SG → NACL → DNS.

An S3 bucket was accidentally made public. What's your response? hard

Enable Block Public Access (account + bucket level) immediately to cut exposure, then fix the bucket policy / ACLs to least privilege. Audit what was exposed and for how long (CloudTrail / S3 access logs / Athena) — assume data was accessed if sensitive, and follow breach process if so. Rotate anything secret that was in there. Prevent recurrence: org-level Block Public Access SCP, Config rules/Access Analyzer to flag public buckets, default encryption, and least-privilege IAM.

A Lambda is timing out / has bad cold starts. How do you fix it? medium

Cold starts: minimize package size and init work, use Provisioned Concurrency (or SnapStart) for latency-sensitive paths, and avoid heavy SDK/connection setup in the handler (move to init, reuse across invocations). If it's in a VPC, ENI setup historically added latency — ensure modern VPC networking and right-sized subnets. Timeouts: raise memory (also scales CPU), increase the timeout, check a slow downstream (DB/API), and add retries/idempotency.

RDS is hitting max connections. How do you address it? medium

Apps (esp. serverless/many containers) open too many connections. Add connection pooling — RDS Proxy or an app-side pooler (pgbouncer) — so many clients share few DB connections. Reduce idle/leaked connections, tune pool sizes, and cache hot reads. Short-term, scale the instance (more max_connections) or add read replicas for read load. Root cause is usually unbounded per-instance pools × autoscaled fleet — fix the pooling.

Design a highly available architecture on AWS for a web app. hard

Multi-AZ everything: ALB across ≥2 AZs → app in an Auto Scaling Group spanning AZs → RDS Multi-AZ (sync standby) or Aurora. Static assets on S3 + CloudFront. Route 53 with health checks (and multi-region active-passive if you need region resilience). Stateless app tier (state in RDS/DynamoDB/ElastiCache). SGs per tier, secrets in Secrets Manager, IAM roles (IRSA/instance roles) not static keys. Autoscale on the real load metric; design for AZ loss with no SPOF.

The AWS bill spiked with flat traffic. How do you find and fix it? hard

Cost Explorer grouped by service then tag/account to find what grew (enforce tagging if you can't attribute). Usual suspects: idle/oversized EC2, unattached EBS + old snapshots, orphaned LBs/EIPs, forgotten non-prod left running 24/7, data transfer / NAT-gateway egress (a silent classic), over-retained logs, wrong S3 storage class. Fix: rightsize, schedule/scale-to-zero non-prod, savings plans/reserved/spot, S3 lifecycle, VPC endpoints to cut NAT egress. Add budgets + anomaly alerts to prevent repeats.

EC2 can't reach the internet. Debug. medium

Public: route 0.0.0.0/0→IGW + public IP? Private: route→NAT? Then SG egress (stateful), NACL (both directions), VPC DNS — check in order.

S3 bucket accidentally public. Response? hard

Enable Block Public Access now, fix policy/ACLs to least privilege, audit access (CloudTrail/logs), assume exposure if sensitive, rotate secrets, add Config rules/SCP to prevent recurrence.

Lambda timing out / bad cold starts. Fix? medium

Reduce package + init work, provisioned concurrency, check VPC ENI/networking, raise memory (=CPU) and timeout, check slow downstream.

RDS hitting max connections. Address? medium

Connection pooling (RDS Proxy/pgbouncer), reduce idle/leaked connections, tune pool size, cache reads, add read replicas — root cause is unbounded pools × fleet.

Design HA architecture for a web app. Outline. hard

Multi-AZ ALB → ASG → RDS Multi-AZ, S3+CloudFront static, Route 53 health checks, stateless app, secrets in Secrets Manager, IAM roles, autoscale, no SPOF.

AWS bill spiked with flat traffic. Find/fix. hard

Cost Explorer by service then tag, find idle/oversized, unattached EBS+old snapshots, orphaned LBs/EIPs, 24/7 non-prod, NAT egress; rightsize, schedule, savings plans, lifecycle, endpoints, budgets.

App needs to access AWS services without static keys. How? medium

Instance profile (EC2) or IRSA (EKS) so the workload assumes an IAM role for scoped temporary creds — never embed access keys.

Cross-region failover needed for RTO 15min. Design. hard

Warm standby in region B with cross-region data replication, Route 53 health-checked failover, automation to promote DB + scale the standby, and a game-dayed runbook.

Sensitive data must never leave the VPC. How? hard

VPC endpoints (S3/DynamoDB gateway, interface for others), no IGW/NAT for those subnets, endpoint policies, and encryption — keeps traffic on the AWS backbone.

A deployment needs blue-green with instant rollback on AWS. How? hard

Two target groups behind an ALB (or CodeDeploy blue-green), shift listener/weighted traffic to green after health checks, keep blue warm to roll back by flipping traffic.

what industry actually asks

AWS/cloud-engineer loops mix fundamentals (Regions/AZs, IAM roles vs users, S3 classes, EC2 vs Lambda, shared responsibility — often cert-aligned, SAA/SysOps) with design ("design a 3-tier / HA / serverless architecture," VPC layout) and scenario debugging ("instance can't reach the internet," "why is the bill high," "this isn't scaling"). The senior signal is reasoning about trade-offs (Multi-AZ vs multi-Region, RDS vs DynamoDB, cost vs resilience) and security defaults (least-privilege roles, no static keys, SG vs NACL). Networking (VPC/SG/NACL/route tables) and IAM come up in almost every loop.

AWS Cloud Engineer — Interview Questions.

Easy — fundamentals

Medium — applied

Hard — senior & architecture

Scenario-based