← Debug Guides

DEBUG GUIDE · AWS · SRE PLAYBOOK

Debugging AWS — Connectivity, IAM, and Services.

aws cloud debugging sre
Most AWS tickets are networking (security groups, routes, subnets) or IAM (not authorized). When something "can't connect" or is "denied", check those two first — they cause 80% of it.

Can't SSH/connect to EC2

Checklist, in order:

  • Security group — inbound rule for port 22 (or your app port) from your IP?
  • Public IP / subnet — instance in a public subnet with a public IP, and the subnet routes 0.0.0.0/0 to an Internet Gateway?
  • NACL — subnet NACL allows the port both directions (stateless)?
  • Key / user — right .pem (chmod 400), right user (ec2-user/ubuntu/admin)?
  • Instance health — status checks passing? Or use SSM Session Manager (no SSH needed).
SG stateful, NACL stateless Security groups auto-allow return traffic; NACLs don't — you must allow ephemeral ports (1024–65535) outbound for replies. A "works then hangs" connection often = NACL missing the return range.

"not authorized to perform" / AccessDenied

aws sts get-caller-identity        # who am I, really?
# decode the denial:
aws iam simulate-principal-policy --policy-source-arn <arn> \
  --action-names s3:GetObject --resource-arns <arn>

Causes. Missing IAM permission; an explicit Deny (always wins); S3 bucket policy / SCP / permission boundary overriding; wrong role assumed; resource in another account. Read the error — it names the action + resource. Use CloudTrail to see the exact denied call.

No internet / can't reach a service

NeedRequires
Public subnet → internetroute 0.0.0.0/0 → Internet Gateway + public IP
Private subnet → internet (outbound)route → NAT Gateway (in a public subnet)
Reach AWS APIs privatelyVPC Endpoint (no NAT needed)
VPC ↔ VPCpeering / Transit Gateway + routes both sides

ELB 5xx / unhealthy targets

# Target group health = the usual culprit
# check: health check path returns 200? SG allows LB → target port?
# 503 from ALB = no healthy targets ; 504 = target too slow

Fix. Health-check path/port correct and returning 200; target SG allows the LB's SG on the app port; targets registered and passing.

Can't connect to RDS

  • RDS security group must allow your source SG/IP on the DB port (5432/3306).
  • Same VPC or peered + routes; publicly accessible flag if connecting from outside.
  • Hitting max_connections? Use RDS Proxy / a pooler.

Lambda errors / timeouts

# logs are in CloudWatch Logs /aws/lambda/<fn>
# Task timed out  -> raise timeout or fix slow downstream
# permission errors -> the function's execution ROLE lacks the action
# in a VPC + needs internet -> route via NAT (Lambda in private subnets)

Where to look

  • CloudWatch Logs/Metrics — app + service logs, alarms.
  • CloudTrail — every API call (who did what, and why denied).
  • VPC Flow Logs — accepted/rejected packets (prove a SG/NACL drop).
  • VPC Reachability Analyzer — path test between two resources.
← prev: Docker next: Ubuntu Server →
© cvam — written in plaintext, served warm