RabbitMQ mental model: producers publish to an exchange, which routes to queues by binding rules; consumers subscribe and ack each message when done. Unacked messages stay "in flight" and aren't redelivered until ack/nack or channel close. When memory/disk crosses a threshold, RabbitMQ raises an alarm and blocks publishers. Most incidents: queue grows faster than it drains, acks never come, or an alarm froze publishing.
Queue buildup
Symptom. messages_ready climbing; consumers can't keep up; memory
rising toward an alarm.
Likely causes.
- Publish rate > consume rate (too few/slow consumers).
- Low/again-default prefetch making consumers idle between round-trips, or no consumers at all.
- Slow per-message processing (downstream I/O).
- A poison message stuck at the head with manual ack blocking progress.
Diagnose.
rabbitmqctl list_queues name messages messages_ready messages_unacknowledged consumers # growing messages_ready + low consumers = under-consumed # high messages_unacknowledged = consumers holding, not acking (see below)
Fix.
- Add/scale consumers; raise consumer prefetch (
basic.qos) so each fetches a batch instead of one-at-a-time. - Speed up processing or make it async.
- Cap queues with a max length / TTL or use quorum queues for predictable behaviour under load.
- Set a dead-letter target so poison messages move aside instead of blocking.
basic.qos(prefetch) to a sane batch (e.g.
10–100) for balanced, fair dispatch.
Unacked messages piling up
Symptom. High messages_unacknowledged; messages "stuck" in
flight; redelivery on consumer restart; throughput stalls.
Likely causes.
- Consumer in manual-ack mode that crashes/hangs before acking.
- Bug: forgot to ack, or acking the wrong delivery tag.
- Prefetch too high — consumer holds thousands unacked while processing slowly.
- Long processing exceeding a consumer/delivery timeout → channel closed, messages requeued.
Diagnose.
rabbitmqctl list_queues name messages_unacknowledged consumers rabbitmqctl list_consumers # which channel holds them, prefetch rabbitmqctl list_channels name messages_unacknowledged prefetch_count
Fix.
- Ensure every path acks (success) or nacks/rejects (failure) — wrap in try/finally.
- Lower prefetch so a stuck consumer holds fewer messages.
- Use
basic.nackwith requeue/dead-letter for failures rather than silently dropping. - Set consumer
delivery-timeoutawareness so long jobs don't get force-requeued mid-process.
Memory & disk alarms (flow control)
Symptom. Publishers suddenly blocked / hung; connections in
blocking/blocked state; "memory resource limit alarm set".
Likely causes.
- Memory use crossed
vm_memory_high_watermark(default ~40% of RAM) → publishers blocked until it drains. - Free disk dropped below
disk_free_limit→ publishers blocked to protect persistence. - Root cause is usually queue buildup or huge unacked sets eating memory.
Diagnose.
rabbitmqctl status | grep -A5 alarms rabbitmqctl status | grep -A10 memory # what's using it (queues, conns, binaries) rabbitmqctl list_connections name state # 'blocked' = hit an alarm df -h # disk_free_limit?
Fix.
- Drain the queues (scale consumers) — the alarm clears when usage drops below the watermark.
- Free disk; raise
disk_free_limitonly if the value is mis-set for the node size. - Reduce memory pressure: shorter queues, lower prefetch, lazy/quorum queues that page to disk.
- Don't just raise
vm_memory_high_watermarkto 0.8 blindly — you trade the safety alarm for an OOMKill.
rabbitmqctl status alarms first.
Dead-letter loops
Symptom. A message bounces endlessly between a queue and its dead-letter queue; CPU/throughput burned on redelivery; DLQ never drains.
Likely causes.
- DLQ dead-letters back to the original queue (or a cycle of exchanges) with no exit condition.
- Consumer
nacks withrequeue=trueon a permanently-bad message → infinite immediate redelivery. - Message TTL on the DLQ routes it straight back, forever.
Diagnose. Inspect the x-death header (counts each dead-letter
hop) — a large count = looping. Map the DLX bindings to find the cycle.
Fix.
- DLQ should be a terminal sink (no dead-letter back to source). Drain it manually / with a separate repair consumer.
- For poison messages,
nackwithrequeue=falseso they dead-letter once, not loop. - Add a retry cap: count
x-deathand route to a parking queue after N attempts.
Connection / channel churn
Symptom. High connection/channel create-close rate; CPU on the node climbs;
too_many_channels/file-descriptor errors.
Causes & fix. App opening a connection (or channel) per message/request
instead of pooling. Connections and channels are meant to be long-lived — open
once, reuse, one channel per thread. Watch rabbitmqctl list_connections count and
FD/socket limits. Fix the client to pool, not reconnect per operation.
Docker & Kubernetes specifics
- Watermark is a fraction of detected RAM. In a container, RabbitMQ may see
host RAM, not the cgroup limit → it thinks it has more memory than the pod allows → pod
OOMKilled before the alarm fires. Set an absolute
vm_memory_high_watermark.absolute(e.g. 80% of the pod limit) instead of the default fraction. - Persistent storage: StatefulSet + PVC.
emptyDirloses durable queues and the Mnesia/quorum state on reschedule. - Clustering needs stable identities — use the peer-discovery-k8s plugin and a headless Service; nodes find each other by stable DNS. Prefer quorum queues over classic mirrored queues for HA.
- File descriptors: raise the FD ulimit; RabbitMQ uses one+ per connection.
- Liveness probe:
rabbitmq-diagnostics check_running/ping, with generous timeouts so a busy node isn't killed mid-load.
Quick command reference
rabbitmqctl list_queues name messages messages_ready messages_unacknowledged consumers rabbitmqctl list_consumers rabbitmqctl list_connections name state rabbitmqctl list_channels name prefetch_count messages_unacknowledged rabbitmqctl status # alarms, memory breakdown, disk rabbitmq-diagnostics observer # live top-style view rabbitmq-diagnostics memory_breakdown rabbitmqctl purge_queue NAME # drain a queue (careful)