← Debug Guides

DEBUG GUIDE · DATA STORE · SRE PLAYBOOK

Debugging Redis — Memory, Latency, and Replication.

redis debugging sre caching
Redis mental model: single-threaded command execution (one slow command blocks everything), all data in RAM, optional persistence (RDB snapshots / AOF log), optional replication (async to replicas). Most Redis incidents are: ran out of RAM, one command blocked the loop, or a replica fell behind. Diagnose with INFO, SLOWLOG, LATENCY, and --bigkeys.

OOM & eviction

Symptom. OOM command not allowed when used memory > 'maxmemory', writes failing, or keys silently vanishing.

Likely causes.

  • Dataset outgrew maxmemory.
  • maxmemory-policy noeviction → writes rejected once full.
  • No TTLs — keys accumulate forever.
  • Memory fragmentation inflating RSS beyond logical dataset size.

Diagnose.

redis-cli INFO memory
#  used_memory_human, maxmemory_human, maxmemory_policy
#  mem_fragmentation_ratio  (>1.5 = fragmented; <1 = swapping, bad)
redis-cli --bigkeys              # find the memory hogs
redis-cli DBSIZE                 # key count
redis-cli INFO keyspace          # keys with/without TTL per db

Fix.

  • Raise maxmemory (within node RAM) or shard.
  • Set an eviction policy that matches use: allkeys-lru/allkeys-lfu for a cache; keep noeviction only for a durable store.
  • Add TTLs to cache keys (EXPIRE) so they self-clean.
  • Kill the offending big keys; restructure (hash fields, smaller values).
cache vs store policy Using Redis as a cache with noeviction means it stops accepting writes when full instead of dropping cold keys — turns a cache into an outage. Pick allkeys-lru.

Slow commands & latency spikes

Symptom. p99 latency spikes; clients time out; everything stalls briefly.

Likely causes (single-threaded — one slow cmd blocks all).

  • O(N) commands on big collections: KEYS *, SMEMBERS, HGETALL, LRANGE 0 -1 on huge keys.
  • FLUSHALL/DEL on a giant key (sync free of memory).
  • Lua scripts that run long.
  • RDB fork / AOF rewrite stalling (see persistence).

Diagnose.

redis-cli SLOWLOG GET 20         # slowest recent commands + args
redis-cli LATENCY DOCTOR         # human-readable latency analysis
redis-cli LATENCY HISTORY command
redis-cli INFO commandstats      # calls + usec_per_call per command
redis-cli --latency              # live round-trip latency

Fix.

  • Never KEYS in prod — use SCAN (cursor, non-blocking).
  • Replace HGETALL/SMEMBERS on huge keys with HSCAN/SSCAN or smaller keys.
  • Use UNLINK instead of DEL for big keys (frees async).
  • Keep Lua scripts short; offload heavy work to the app.
KEYS * is a footgun KEYS * scans the whole keyspace and blocks the single thread — on millions of keys that's a multi-second freeze for every client. Always SCAN.

Blocking operations

Symptom. Connections pile up; latency cliff while one operation runs.

Causes & fix. Big-key deletes (use UNLINK), synchronous SAVE (use BGSAVE), long blocking pops with large timeouts hogging connections, and MONITOR left running (it streams every command — never leave on in prod). Watch blocked_clients in INFO clients.

Persistence (RDB / AOF) issues

Symptom. Periodic latency spikes, write stalls, or "MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist."

Likely causes.

  • RDB fork: BGSAVE forks the process; copy-on-write doubles memory under heavy writes → spike or OOM. Fork itself pauses on huge datasets.
  • AOF rewrite: CPU/disk heavy; if appendfsync always, every write waits on disk.
  • Disk full or slow → snapshot fails → Redis may stop accepting writes (stop-writes-on-bgsave-error).

Diagnose.

redis-cli INFO persistence
#  rdb_last_bgsave_status, aof_last_write_status,
#  rdb_changes_since_last_save, aof_rewrite_in_progress
df -h                            # is the disk full?

Fix.

  • Free disk; ensure RAM headroom for fork COW (keep used_memory well under 50% if write-heavy).
  • Use appendfsync everysec (not always) for balanced durability/latency.
  • Schedule snapshots off-peak; consider persistence on a replica, not the primary.

Replication lag

Symptom. Replica reads return stale data; master_repl_offset − replica offset grows; failover promotes a behind replica = data loss.

Diagnose.

redis-cli INFO replication
#  master:   connected_slaves, slaveN:...,offset=...,lag=...
#  replica:  master_link_status:up|down, master_last_io_seconds_ago
#  compare master_repl_offset vs slave_repl_offset

Likely causes & fix. Network saturation between primary/replica; replica underpowered (CPU/disk); a big write burst; full resync storms (replica reconnects → full RDB transfer). Fix: give the replica bandwidth/CPU, raise repl-backlog-size so brief disconnects do partial (not full) resync, avoid flapping links.

Hot keys

Symptom. One node/shard saturated while others idle; uneven latency; single-key throughput ceiling.

Diagnose.

redis-cli --hotkeys                       # needs an LFU/LRU policy
redis-cli INFO commandstats               # which commands dominate
# in Cluster: one slot/shard far hotter than others

Fix.

  • Add a client-side / local cache in front of the hot key (read-through).
  • Shard the value (split a counter into N sub-keys, sum on read).
  • Replicate reads to replicas for read-hot keys.
  • Fix key design — a single global key is a single bottleneck.

Docker & Kubernetes specifics

  • maxmemory must be set below the container limit. If Redis has no maxmemory and the pod limit is hit first, the kernel OOMKills the whole process (data loss) instead of Redis evicting gracefully. Set maxmemory ~70–80% of the container limit.
  • Disable THP (transparent huge pages) on the node — it causes latency spikes during fork. Redis warns about it at startup.
  • Persistence needs a PVC on K8s; emptyDir = data gone on reschedule.
  • vm.overcommit_memory=1 on the host or BGSAVE fork can fail under memory pressure.
  • Liveness probe via redis-cli PING, but keep timeouts generous so a brief fork pause doesn't trigger a restart loop.

Quick command reference

redis-cli INFO [memory|persistence|replication|clients|stats]
redis-cli SLOWLOG GET 20 / SLOWLOG RESET
redis-cli LATENCY DOCTOR / LATENCY RESET
redis-cli --bigkeys / --hotkeys / --latency
redis-cli SCAN 0 COUNT 100         # safe iteration (never KEYS *)
redis-cli UNLINK bigkey            # async delete
redis-cli MEMORY USAGE key         # bytes for one key
redis-cli CLIENT LIST              # who's connected / blocked
← prev: Kafka next: RabbitMQ →
© cvam — written in plaintext, served warm