AWS cloud-engineer questions — IAM, VPC, compute, storage, databases, and architecture/cost — graded easy → hard with full answers. Click to expand. Pair with the AWS cheatsheet.
Easy — fundamentals
What are Regions and Availability Zones? easy
A Region is a geographic area (e.g. us-east-1) with multiple isolated Availability Zones — physically separate data centers with independent power/cooling/network, connected by low-latency links. You design for HA by spreading across AZs (an AZ can fail without taking your app down). Regions are isolated from each other for data residency and blast-radius; some services are global (IAM, Route 53, CloudFront). Choosing a region: latency to users, data residency/compliance, service availability, and cost.
What is IAM, and what are users, roles, and policies? easy
IAM controls who can do what. A policy is a JSON document granting/denying actions on resources (with conditions). A user is a long-lived identity (person/service) with credentials. A role is an identity with policies but no permanent credentials — it's assumed temporarily, vending short-lived credentials. Best practice: humans via SSO, workloads via roles (EC2 instance profiles, EKS IRSA), never long-lived access keys. Least privilege, and remember an explicit Deny always wins.
S3 storage classes — when use each? easy
S3 Standard — frequent access, low latency. Standard-IA / One Zone-IA — infrequent access, cheaper storage + retrieval fee (One Zone = single AZ, less durable, cheaper). Intelligent-Tiering — auto-moves objects between tiers based on access; great when patterns are unknown. Glacier Instant / Flexible / Deep Archive — archival, cheapest storage, retrieval from ms to hours. Use lifecycle policies to transition objects automatically as they age. All offer 11 nines of durability (except One Zone's lower availability).
EC2 vs Lambda vs containers (ECS/EKS) — when each? easy
EC2 — full VMs, max control, for long-running/stateful or specialized workloads; you manage the OS. Lambda — serverless functions, event-driven, scale-to-zero, pay-per-invocation; great for spiky/glue/event workloads but with limits (duration, size, cold starts). Containers — ECS (simpler, AWS-native) or EKS (Kubernetes, portable, complex) for microservices; run on EC2 or Fargate (serverless containers, no nodes to manage). Pick by control vs operational overhead, traffic pattern, and team familiarity.
What is the AWS shared responsibility model? easy
AWS secures the cloud ("of" the cloud): hardware, the global infrastructure, the hypervisor, managed-service internals. You secure what's "in" the cloud: your data, IAM/permissions, OS patching (on EC2), network config (security groups, NACLs), encryption choices, and application security. The line shifts by service — with managed/serverless (Lambda, S3, RDS) AWS handles more of the stack, but data classification, access control, and encryption are always yours.
What are Regions and Availability Zones? easy
A Region is a geographic area; an AZ is one or more isolated datacenters within it. Spread across AZs for HA, across Regions for disaster/latency.
IAM user vs role? easy
A user has long-lived credentials for a person/app; a role is assumed temporarily for short-lived creds — prefer roles (no static keys).
What are S3 storage classes? easy
Standard (hot), Infrequent Access, Glacier/Deep Archive (cold/cheap, slow retrieval), Intelligent-Tiering (auto-moves by access pattern).
EC2 vs Lambda vs containers? easy
EC2: full VMs you manage. Lambda: event-driven functions, no servers, per-invocation. Containers (ECS/EKS/Fargate): packaged apps, more control than Lambda.
What is a VPC? easy
A logically isolated virtual network where you define subnets, route tables, and gateways for your resources.
Security Group vs NACL? easy
SG: stateful, instance-level, allow-only. NACL: stateless, subnet-level, allow+deny, needs both directions + ephemeral ports.
What is an S3 bucket policy? easy
A resource-based policy on a bucket granting/denying access by principal/action/condition — combined with IAM and Block Public Access.
RDS vs DynamoDB? easy
RDS: managed relational (SQL, joins, transactions). DynamoDB: managed NoSQL key-value/document, single-digit-ms at scale, designed around access patterns.
What is an Auto Scaling Group? easy
Maintains a target number of EC2 instances between min/max, replacing unhealthy ones and scaling on policy.
What is CloudWatch? easy
AWS monitoring: metrics, logs, alarms, and dashboards for resources and apps.
Medium — applied
Design a basic VPC for a 3-tier web app. medium
One VPC with a CIDR (e.g. 10.0.0.0/16), spread across ≥2 AZs for HA. Public subnets (route to an Internet Gateway) hold the ALB and NAT gateways. Private app subnets hold the EC2/containers; they reach the internet for updates via the NAT gateway (egress only). Private DB subnets hold RDS, no internet route. Security: security groups chain the tiers (ALB SG → app SG → db SG, referencing SGs not CIDRs), NACLs as a coarse subnet guardrail. Use VPC endpoints for S3/DynamoDB to keep traffic off the internet. Each tier in each AZ for redundancy. This is the canonical public-ALB / private-compute / isolated-DB layout.
Security Group vs NACL — what's the difference? medium
Security Group: instance/ENI-level, stateful (return traffic is auto-allowed), allow rules only (implicit deny), evaluated as a whole. The primary firewall; you can reference other SGs as sources. NACL: subnet-level, stateless (must allow both inbound and the return ephemeral ports explicitly), supports allow and deny, evaluated by numbered rule order. Use SGs for normal app-tier control; use NACLs for coarse subnet-wide guardrails or explicit blocks (e.g. block an IP range). A common bug: a stateless NACL blocking return traffic on ephemeral ports.
How do you give an application running on EC2/EKS access to AWS APIs securely? medium
Never hardcode access keys. On EC2, attach an instance profile (IAM role); the SDK auto-fetches temporary credentials from the instance metadata service (use IMDSv2). On EKS, use IRSA (IAM Roles for Service Accounts) or EKS Pod Identity so each pod's service account maps to a scoped IAM role — pods get short-lived creds, least privilege per workload, no shared node role. On Lambda, the function's execution role. The principle: workload identity → assume a least-privilege role → short-lived rotating credentials, never static keys in code/env.
RDS vs DynamoDB — how do you choose? medium
RDS = managed relational (Postgres/MySQL/etc): SQL, joins, transactions, strong consistency, flexible queries — pick it for relational data, complex queries, and when you need ACID across rows. It scales vertically (and read replicas/Aurora horizontally for reads). DynamoDB = managed NoSQL key-value/document: single-digit-ms latency at any scale, serverless, pay-per-use — pick it for high-scale, well-known access patterns, where you design the table around your queries (partition key design is everything). DynamoDB trades query flexibility and joins for scale and operational simplicity. Choose by data model + access patterns + scale, not hype.
How do you autoscale and load-balance a web tier on AWS? medium
Put an Application Load Balancer (L7, path/host routing, TLS termination, health checks) in public subnets in front of an Auto Scaling Group of instances (or use ECS/EKS with target tracking). The ASG spans multiple AZs and scales on a policy — target tracking (e.g. keep CPU ~60% or requests-per-target steady) is the modern default; step scaling and scheduled scaling for known patterns; predictive scaling for cyclical load. The ALB only routes to healthy targets (health checks), and the ASG replaces unhealthy instances. Use a launch template, immutable AMIs/containers, and connection draining for graceful deploys.
What is IRSA / workload identity? medium
IAM Roles for Service Accounts: EKS pods assume IAM roles via OIDC, getting scoped temporary creds instead of node-wide or static keys.
How do you design a multi-AZ VPC? medium
Public subnets (LB + NAT) and private subnets (app, DB) per AZ; route public→IGW, private→NAT egress-only, DB tier no internet; SGs referencing SGs.
ALB vs NLB vs CLB? medium
ALB: L7 HTTP routing/host-path/TLS. NLB: L4, ultra-low latency, static IP, high throughput. CLB: legacy — use ALB/NLB.
What is S3 cross-region replication / versioning? medium
Versioning keeps object history (protects against overwrite/delete); CRR async-replicates to another region for DR/compliance/latency.
How does autoscaling choose a metric? medium
Target tracking on the metric reflecting load (often not CPU — use RPS, queue depth, latency); step/scheduled/predictive for other patterns.
What is a NAT gateway and its cost gotcha? medium
Provides outbound internet for private subnets; you pay per hour + per GB processed, so high egress through NAT is a common silent cost — use VPC endpoints.
What are VPC endpoints? medium
Private connectivity to AWS services (S3/DynamoDB gateway endpoints, interface endpoints for others) without traversing the internet/NAT — cheaper, more secure.
How does Route 53 routing work? medium
Policies: simple, weighted, latency-based, geolocation, failover (health-checked) — for traffic distribution and HA/DR.
What is the shared responsibility model? medium
AWS secures the cloud (hardware, infra, managed-service internals); you secure in the cloud (data, IAM, config, patching your instances).
RDS Multi-AZ vs Read Replicas? medium
Multi-AZ: synchronous standby for HA/failover (not for read scaling). Read replicas: async copies for read scaling (can lag, manual promotion).
Hard — senior & architecture
Design a highly-available, fault-tolerant architecture. What are the principles? hard
Eliminate single points of failure and degrade gracefully. Multi-AZ for everything stateful (RDS Multi-AZ, ASG across AZs, ALB cross-zone) so an AZ loss is survivable; multi-Region only for DR/very high availability (active-passive with Route 53 failover + cross-region replication, or active-active for global low latency — far more complex/costly). Stateless compute behind a load balancer so instances are disposable and horizontally scalable; push state to managed stores (RDS, DynamoDB, S3) and caches (ElastiCache). Decouple with queues (SQS) / events (EventBridge) so components fail independently and absorb spikes. Health checks + auto-replacement, retries with backoff + idempotency, circuit breakers. Backups + tested restores, infrastructure as code for reproducibility. Match the cost/complexity to the real RTO/RPO — don't build multi-region if Multi-AZ meets the SLA.
An EC2 instance can't reach the internet (or vice versa). How do you debug it systematically? hard
Walk the path. Outbound from a private instance: is it in a private subnet whose route table sends 0.0.0.0/0 to a NAT gateway (in a public subnet)? Public subnet needs a route to the Internet Gateway. Inbound: instance needs a public IP/EIP, a subnet route to the IGW (=public subnet), and a security group allowing the port. Check in order: (1) Security group (stateful, allow rules) — most common. (2) NACL (stateless — did you allow return ephemeral ports?). (3) Route table (IGW for public, NAT for private). (4) Public IP present? (5) OS firewall / app listening on 0.0.0.0. (6) DNS. Use VPC Reachability Analyzer and VPC Flow Logs (ACCEPT/REJECT) to see exactly where packets are dropped. Method: SG → NACL → route → public IP → OS.
How do you approach cost optimization on AWS? hard
Measure first: Cost Explorer/CUR + tagging to attribute spend, find the top line items. Then: right-size over-provisioned instances/volumes (Compute Optimizer); buy commitments for steady baseline (Savings Plans / Reserved Instances) and use Spot for fault-tolerant/batch (huge discount). Scale to demand (autoscaling, scale-to-zero with Lambda/Fargate, stop dev envs off-hours). Storage — S3 lifecycle to IA/Glacier, delete orphaned EBS volumes/snapshots/unattached EIPs, gp3 over gp2. Data transfer — keep traffic in-AZ/in-region, use VPC endpoints and CloudFront to cut egress (often a hidden cost). Managed/serverless where it lowers ops + idle cost. Set budgets/alerts and make cost a continuous practice (FinOps), not a one-off. The biggest wins are usually idle/over-provisioned resources and data-transfer surprises.
How would you build a secure, scalable serverless API? hard
API Gateway (or ALB/Lambda URL) → Lambda → DynamoDB is the classic pattern. Auth at the edge: Cognito / a Lambda authorizer / JWT, plus WAF for common attacks and throttling/usage plans for rate limiting. Lambda with least-privilege execution roles per function; secrets from Secrets Manager/SSM, not env vars in plaintext. DynamoDB with on-demand capacity (or autoscaling) and a partition-key design matching access patterns; add DAX/caching if hot. Decouple slow work to SQS/Step Functions so the API stays fast. Concerns to call out: cold starts (provisioned concurrency for latency-sensitive paths), Lambda concurrency limits + downstream throttling, idempotency for retries, and observability (X-Ray tracing, structured logs, CloudWatch alarms). It scales automatically and costs nothing at idle, at the price of cold starts and per-request limits.
Multi-account strategy — why and how? hard
Separate AWS accounts give hard isolation boundaries — for blast radius, security, billing, and environment separation (prod vs dev vs sandbox, or per team/product). Manage them with AWS Organizations: a management account, organizational units (OUs), and Service Control Policies (SCPs) that set guardrails on what member accounts can do (e.g. deny disabling CloudTrail, restrict regions). Use Control Tower to provision accounts with a compliant baseline (landing zone). Centralize identity via IAM Identity Center (SSO) with permission sets (assume roles cross-account), centralize logging (CloudTrail/Config to a log-archive account), and consolidate billing. Networking via Transit Gateway / shared VPCs. The principle: isolate by account for security/blast-radius, govern centrally with Organizations + SCPs + SSO.
Design a highly available architecture on AWS. hard
ALB across ≥2 AZs → ASG app tier per AZ → RDS Multi-AZ/Aurora; static on S3+CloudFront; Route 53 health-checked; stateless app, secrets in Secrets Manager, IAM roles, autoscale on real metric, no SPOF.
How do you design multi-account governance? hard
AWS Organizations + SCPs for guardrails, separate accounts per env/team (blast radius + billing), centralized logging/security accounts, Control Tower, and cross-account roles.
How do you secure data at rest and in transit? hard
KMS-backed encryption (S3/EBS/RDS) with key policies + rotation, TLS everywhere, enforce encryption via SCP/bucket policy, and least-privilege key access.
How do you optimize a large AWS bill structurally? hard
Tagging + Cost Explorer attribution, rightsizing, autoscale/scale-to-zero non-prod, savings plans/reserved/spot, S3 lifecycle, VPC endpoints to cut NAT egress, budgets + anomaly alerts.
How does cross-account access work securely? hard
Define a role in the target account trusting the source account/principal; the source assumes it via STS for scoped temporary creds — with ExternalId for third parties to prevent confused-deputy.
Design a serverless event-driven pipeline. hard
API Gateway/EventBridge → Lambda → SQS/SNS for decoupling + retries/DLQ → DynamoDB/S3; idempotent handlers, concurrency limits, and async fan-out — scales to zero.
How do you do disaster recovery on AWS? hard
Pick by RTO/RPO: backup-restore (cheap, slow), pilot light, warm standby, or active-active multi-region; replicate data (cross-region), automate failover (Route 53), and test.
How do you handle Lambda at high scale? hard
Mind account concurrency limits + reserved/provisioned concurrency, downstream throttling (RDS connections — use RDS Proxy), DLQs for failures, and avoid VPC ENI cold-start pitfalls.
How do you enforce least privilege in IAM at scale? hard
Permission boundaries, SCPs, Access Analyzer to find unused/excessive perms, generate policies from CloudTrail, scoped roles per workload, and regular access reviews.
How do you architect for cost-efficient high throughput? hard
Spot for fault-tolerant compute, autoscaling, caching (ElastiCache/CloudFront), right storage class, batch/async via SQS, and avoid cross-AZ/NAT data-transfer charges.
Scenario-based
An EC2 instance can't reach the internet. How do you debug? medium
Trace the path. Public subnet: route table has 0.0.0.0/0 → Internet Gateway? Instance has a public IP / EIP? Private subnet: route to a NAT gateway for outbound? Then Security Group egress (allows outbound — SGs are stateful) and NACL (both directions, stateless — needs ephemeral return ports). Also DNS enabled on the VPC. Check in that order: subnet route → IGW/NAT → SG → NACL → DNS.
An S3 bucket was accidentally made public. What's your response? hard
Enable Block Public Access (account + bucket level) immediately to cut exposure, then fix the bucket policy / ACLs to least privilege. Audit what was exposed and for how long (CloudTrail / S3 access logs / Athena) — assume data was accessed if sensitive, and follow breach process if so. Rotate anything secret that was in there. Prevent recurrence: org-level Block Public Access SCP, Config rules/Access Analyzer to flag public buckets, default encryption, and least-privilege IAM.
A Lambda is timing out / has bad cold starts. How do you fix it? medium
Cold starts: minimize package size and init work, use Provisioned Concurrency (or SnapStart) for latency-sensitive paths, and avoid heavy SDK/connection setup in the handler (move to init, reuse across invocations). If it's in a VPC, ENI setup historically added latency — ensure modern VPC networking and right-sized subnets. Timeouts: raise memory (also scales CPU), increase the timeout, check a slow downstream (DB/API), and add retries/idempotency.
RDS is hitting max connections. How do you address it? medium
Apps (esp. serverless/many containers) open too many connections. Add connection pooling — RDS Proxy or an app-side pooler (pgbouncer) — so many clients share few DB connections. Reduce idle/leaked connections, tune pool sizes, and cache hot reads. Short-term, scale the instance (more max_connections) or add read replicas for read load. Root cause is usually unbounded per-instance pools × autoscaled fleet — fix the pooling.
Design a highly available architecture on AWS for a web app. hard
Multi-AZ everything: ALB across ≥2 AZs → app in an Auto Scaling Group spanning AZs → RDS Multi-AZ (sync standby) or Aurora. Static assets on S3 + CloudFront. Route 53 with health checks (and multi-region active-passive if you need region resilience). Stateless app tier (state in RDS/DynamoDB/ElastiCache). SGs per tier, secrets in Secrets Manager, IAM roles (IRSA/instance roles) not static keys. Autoscale on the real load metric; design for AZ loss with no SPOF.
The AWS bill spiked with flat traffic. How do you find and fix it? hard
Cost Explorer grouped by service then tag/account to find what grew (enforce tagging if you can't attribute). Usual suspects: idle/oversized EC2, unattached EBS + old snapshots, orphaned LBs/EIPs, forgotten non-prod left running 24/7, data transfer / NAT-gateway egress (a silent classic), over-retained logs, wrong S3 storage class. Fix: rightsize, schedule/scale-to-zero non-prod, savings plans/reserved/spot, S3 lifecycle, VPC endpoints to cut NAT egress. Add budgets + anomaly alerts to prevent repeats.
EC2 can't reach the internet. Debug. medium
Public: route 0.0.0.0/0→IGW + public IP? Private: route→NAT? Then SG egress (stateful), NACL (both directions), VPC DNS — check in order.
S3 bucket accidentally public. Response? hard
Enable Block Public Access now, fix policy/ACLs to least privilege, audit access (CloudTrail/logs), assume exposure if sensitive, rotate secrets, add Config rules/SCP to prevent recurrence.
Lambda timing out / bad cold starts. Fix? medium
Reduce package + init work, provisioned concurrency, check VPC ENI/networking, raise memory (=CPU) and timeout, check slow downstream.
RDS hitting max connections. Address? medium
Connection pooling (RDS Proxy/pgbouncer), reduce idle/leaked connections, tune pool size, cache reads, add read replicas — root cause is unbounded pools × fleet.
Design HA architecture for a web app. Outline. hard
Multi-AZ ALB → ASG → RDS Multi-AZ, S3+CloudFront static, Route 53 health checks, stateless app, secrets in Secrets Manager, IAM roles, autoscale, no SPOF.
AWS bill spiked with flat traffic. Find/fix. hard
Cost Explorer by service then tag, find idle/oversized, unattached EBS+old snapshots, orphaned LBs/EIPs, 24/7 non-prod, NAT egress; rightsize, schedule, savings plans, lifecycle, endpoints, budgets.
App needs to access AWS services without static keys. How? medium
Instance profile (EC2) or IRSA (EKS) so the workload assumes an IAM role for scoped temporary creds — never embed access keys.
Cross-region failover needed for RTO 15min. Design. hard
Warm standby in region B with cross-region data replication, Route 53 health-checked failover, automation to promote DB + scale the standby, and a game-dayed runbook.
Sensitive data must never leave the VPC. How? hard
VPC endpoints (S3/DynamoDB gateway, interface for others), no IGW/NAT for those subnets, endpoint policies, and encryption — keeps traffic on the AWS backbone.
A deployment needs blue-green with instant rollback on AWS. How? hard
Two target groups behind an ALB (or CodeDeploy blue-green), shift listener/weighted traffic to green after health checks, keep blue warm to roll back by flipping traffic.
AWS/cloud-engineer loops mix fundamentals (Regions/AZs, IAM roles vs users, S3 classes, EC2 vs Lambda, shared responsibility — often cert-aligned, SAA/SysOps) with design ("design a 3-tier / HA / serverless architecture," VPC layout) and scenario debugging ("instance can't reach the internet," "why is the bill high," "this isn't scaling"). The senior signal is reasoning about trade-offs (Multi-AZ vs multi-Region, RDS vs DynamoDB, cost vs resilience) and security defaults (least-privilege roles, no static keys, SG vs NACL). Networking (VPC/SG/NACL/route tables) and IAM come up in almost every loop.