Redis mental model: single-threaded command execution (one slow command blocks everything), all data in RAM, optional persistence (RDB snapshots / AOF log), optional replication (async to replicas). Most Redis incidents are: ran out of RAM, one command blocked the loop, or a replica fell behind. Diagnose withINFO,SLOWLOG,LATENCY, and--bigkeys.
OOM & eviction
Symptom. OOM command not allowed when used memory > 'maxmemory',
writes failing, or keys silently vanishing.
Likely causes.
- Dataset outgrew
maxmemory. maxmemory-policy noeviction→ writes rejected once full.- No TTLs — keys accumulate forever.
- Memory fragmentation inflating RSS beyond logical dataset size.
Diagnose.
redis-cli INFO memory # used_memory_human, maxmemory_human, maxmemory_policy # mem_fragmentation_ratio (>1.5 = fragmented; <1 = swapping, bad) redis-cli --bigkeys # find the memory hogs redis-cli DBSIZE # key count redis-cli INFO keyspace # keys with/without TTL per db
Fix.
- Raise
maxmemory(within node RAM) or shard. - Set an eviction policy that matches use:
allkeys-lru/allkeys-lfufor a cache; keepnoevictiononly for a durable store. - Add TTLs to cache keys (
EXPIRE) so they self-clean. - Kill the offending big keys; restructure (hash fields, smaller values).
noeviction means it stops accepting writes when full
instead of dropping cold keys — turns a cache into an outage. Pick allkeys-lru.
Slow commands & latency spikes
Symptom. p99 latency spikes; clients time out; everything stalls briefly.
Likely causes (single-threaded — one slow cmd blocks all).
- O(N) commands on big collections:
KEYS *,SMEMBERS,HGETALL,LRANGE 0 -1on huge keys. FLUSHALL/DELon a giant key (sync free of memory).- Lua scripts that run long.
- RDB fork / AOF rewrite stalling (see persistence).
Diagnose.
redis-cli SLOWLOG GET 20 # slowest recent commands + args redis-cli LATENCY DOCTOR # human-readable latency analysis redis-cli LATENCY HISTORY command redis-cli INFO commandstats # calls + usec_per_call per command redis-cli --latency # live round-trip latency
Fix.
- Never
KEYSin prod — useSCAN(cursor, non-blocking). - Replace
HGETALL/SMEMBERSon huge keys withHSCAN/SSCANor smaller keys. - Use
UNLINKinstead ofDELfor big keys (frees async). - Keep Lua scripts short; offload heavy work to the app.
KEYS * scans the whole keyspace and blocks the single thread — on millions of
keys that's a multi-second freeze for every client. Always SCAN.
Blocking operations
Symptom. Connections pile up; latency cliff while one operation runs.
Causes & fix. Big-key deletes (use UNLINK), synchronous
SAVE (use BGSAVE), long blocking pops with large timeouts hogging
connections, and MONITOR left running (it streams every command — never leave on in
prod). Watch blocked_clients in INFO clients.
Persistence (RDB / AOF) issues
Symptom. Periodic latency spikes, write stalls, or "MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist."
Likely causes.
- RDB fork:
BGSAVEforks the process; copy-on-write doubles memory under heavy writes → spike or OOM. Fork itself pauses on huge datasets. - AOF rewrite: CPU/disk heavy; if
appendfsync always, every write waits on disk. - Disk full or slow → snapshot fails → Redis may stop accepting writes (stop-writes-on-bgsave-error).
Diagnose.
redis-cli INFO persistence # rdb_last_bgsave_status, aof_last_write_status, # rdb_changes_since_last_save, aof_rewrite_in_progress df -h # is the disk full?
Fix.
- Free disk; ensure RAM headroom for fork COW (keep used_memory well under 50% if write-heavy).
- Use
appendfsync everysec(notalways) for balanced durability/latency. - Schedule snapshots off-peak; consider persistence on a replica, not the primary.
Replication lag
Symptom. Replica reads return stale data; master_repl_offset −
replica offset grows; failover promotes a behind replica = data loss.
Diagnose.
redis-cli INFO replication # master: connected_slaves, slaveN:...,offset=...,lag=... # replica: master_link_status:up|down, master_last_io_seconds_ago # compare master_repl_offset vs slave_repl_offset
Likely causes & fix. Network saturation between primary/replica; replica
underpowered (CPU/disk); a big write burst; full resync storms (replica reconnects → full RDB
transfer). Fix: give the replica bandwidth/CPU, raise repl-backlog-size so brief
disconnects do partial (not full) resync, avoid flapping links.
Hot keys
Symptom. One node/shard saturated while others idle; uneven latency; single-key throughput ceiling.
Diagnose.
redis-cli --hotkeys # needs an LFU/LRU policy redis-cli INFO commandstats # which commands dominate # in Cluster: one slot/shard far hotter than others
Fix.
- Add a client-side / local cache in front of the hot key (read-through).
- Shard the value (split a counter into N sub-keys, sum on read).
- Replicate reads to replicas for read-hot keys.
- Fix key design — a single global key is a single bottleneck.
Docker & Kubernetes specifics
- maxmemory must be set below the container limit. If Redis has no
maxmemoryand the pod limit is hit first, the kernel OOMKills the whole process (data loss) instead of Redis evicting gracefully. Setmaxmemory~70–80% of the container limit. - Disable THP (transparent huge pages) on the node — it causes latency spikes during fork. Redis warns about it at startup.
- Persistence needs a PVC on K8s;
emptyDir= data gone on reschedule. - vm.overcommit_memory=1 on the host or
BGSAVEfork can fail under memory pressure. - Liveness probe via
redis-cli PING, but keep timeouts generous so a brief fork pause doesn't trigger a restart loop.
Quick command reference
redis-cli INFO [memory|persistence|replication|clients|stats] redis-cli SLOWLOG GET 20 / SLOWLOG RESET redis-cli LATENCY DOCTOR / LATENCY RESET redis-cli --bigkeys / --hotkeys / --latency redis-cli SCAN 0 COUNT 100 # safe iteration (never KEYS *) redis-cli UNLINK bigkey # async delete redis-cli MEMORY USAGE key # bytes for one key redis-cli CLIENT LIST # who's connected / blocked