Debugging RabbitMQ — cvam.sight

RabbitMQ mental model: producers publish to an exchange, which routes to queues by binding rules; consumers subscribe and ack each message when done. Unacked messages stay "in flight" and aren't redelivered until ack/nack or channel close. When memory/disk crosses a threshold, RabbitMQ raises an alarm and blocks publishers. Most incidents: queue grows faster than it drains, acks never come, or an alarm froze publishing.

Queue buildup

Symptom. messages_ready climbing; consumers can't keep up; memory rising toward an alarm.

Likely causes.

Publish rate > consume rate (too few/slow consumers).
Low/again-default prefetch making consumers idle between round-trips, or no consumers at all.
Slow per-message processing (downstream I/O).
A poison message stuck at the head with manual ack blocking progress.

Diagnose.

rabbitmqctl list_queues name messages messages_ready messages_unacknowledged consumers
# growing messages_ready + low consumers = under-consumed
# high messages_unacknowledged = consumers holding, not acking (see below)

Fix.

Add/scale consumers; raise consumer prefetch (basic.qos) so each fetches a batch instead of one-at-a-time.
Speed up processing or make it async.
Cap queues with a max length / TTL or use quorum queues for predictable behaviour under load.
Set a dead-letter target so poison messages move aside instead of blocking.

prefetch = throughput knob Default unlimited prefetch can dump the whole queue into one greedy consumer; prefetch of 1 starves throughput on round-trips. Tune basic.qos(prefetch) to a sane batch (e.g. 10–100) for balanced, fair dispatch.

Unacked messages piling up

Symptom. High messages_unacknowledged; messages "stuck" in flight; redelivery on consumer restart; throughput stalls.

Likely causes.

Consumer in manual-ack mode that crashes/hangs before acking.
Bug: forgot to ack, or acking the wrong delivery tag.
Prefetch too high — consumer holds thousands unacked while processing slowly.
Long processing exceeding a consumer/delivery timeout → channel closed, messages requeued.

Diagnose.

rabbitmqctl list_queues name messages_unacknowledged consumers
rabbitmqctl list_consumers           # which channel holds them, prefetch
rabbitmqctl list_channels name messages_unacknowledged prefetch_count

Fix.

Ensure every path acks (success) or nacks/rejects (failure) — wrap in try/finally.
Lower prefetch so a stuck consumer holds fewer messages.
Use basic.nack with requeue/dead-letter for failures rather than silently dropping.
Set consumer delivery-timeout awareness so long jobs don't get force-requeued mid-process.

unacked ≠ delivered-and-done Messages move to unacked the moment they're dispatched, before processing. If the consumer dies without acking, they're redelivered (at-least-once) — design idempotent handlers.

Memory & disk alarms (flow control)

Symptom. Publishers suddenly blocked / hung; connections in blocking/blocked state; "memory resource limit alarm set".

Likely causes.

Memory use crossed vm_memory_high_watermark (default ~40% of RAM) → publishers blocked until it drains.
Free disk dropped below disk_free_limit → publishers blocked to protect persistence.
Root cause is usually queue buildup or huge unacked sets eating memory.

Diagnose.

rabbitmqctl status | grep -A5 alarms
rabbitmqctl status | grep -A10 memory          # what's using it (queues, conns, binaries)
rabbitmqctl list_connections name state         # 'blocked' = hit an alarm
df -h                                           # disk_free_limit?

Fix.

Drain the queues (scale consumers) — the alarm clears when usage drops below the watermark.
Free disk; raise disk_free_limit only if the value is mis-set for the node size.
Reduce memory pressure: shorter queues, lower prefetch, lazy/quorum queues that page to disk.
Don't just raise vm_memory_high_watermark to 0.8 blindly — you trade the safety alarm for an OOMKill.

"RabbitMQ hung" = usually an alarm Publishers "freezing" with no error is almost always flow control from a memory/disk alarm, not a crash. Check rabbitmqctl status alarms first.

Dead-letter loops

Symptom. A message bounces endlessly between a queue and its dead-letter queue; CPU/throughput burned on redelivery; DLQ never drains.

Likely causes.

DLQ dead-letters back to the original queue (or a cycle of exchanges) with no exit condition.
Consumer nacks with requeue=true on a permanently-bad message → infinite immediate redelivery.
Message TTL on the DLQ routes it straight back, forever.

Diagnose. Inspect the x-death header (counts each dead-letter hop) — a large count = looping. Map the DLX bindings to find the cycle.

Fix.

DLQ should be a terminal sink (no dead-letter back to source). Drain it manually / with a separate repair consumer.
For poison messages, nack with requeue=false so they dead-letter once, not loop.
Add a retry cap: count x-death and route to a parking queue after N attempts.

Connection / channel churn

Symptom. High connection/channel create-close rate; CPU on the node climbs; too_many_channels/file-descriptor errors.

Causes & fix. App opening a connection (or channel) per message/request instead of pooling. Connections and channels are meant to be long-lived — open once, reuse, one channel per thread. Watch rabbitmqctl list_connections count and FD/socket limits. Fix the client to pool, not reconnect per operation.

connect-per-message kills RabbitMQ TLS + connection setup per message can cost more than the message itself and exhausts file descriptors. Reuse one connection, a channel per worker thread.

Docker & Kubernetes specifics

Watermark is a fraction of detected RAM. In a container, RabbitMQ may see host RAM, not the cgroup limit → it thinks it has more memory than the pod allows → pod OOMKilled before the alarm fires. Set an absolute vm_memory_high_watermark.absolute (e.g. 80% of the pod limit) instead of the default fraction.
Persistent storage: StatefulSet + PVC. emptyDir loses durable queues and the Mnesia/quorum state on reschedule.
Clustering needs stable identities — use the peer-discovery-k8s plugin and a headless Service; nodes find each other by stable DNS. Prefer quorum queues over classic mirrored queues for HA.
File descriptors: raise the FD ulimit; RabbitMQ uses one+ per connection.
Liveness probe: rabbitmq-diagnostics check_running / ping, with generous timeouts so a busy node isn't killed mid-load.

Quick command reference

rabbitmqctl list_queues name messages messages_ready messages_unacknowledged consumers
rabbitmqctl list_consumers
rabbitmqctl list_connections name state
rabbitmqctl list_channels name prefetch_count messages_unacknowledged
rabbitmqctl status                       # alarms, memory breakdown, disk
rabbitmq-diagnostics observer            # live top-style view
rabbitmq-diagnostics memory_breakdown
rabbitmqctl purge_queue NAME             # drain a queue (careful)

Debugging RabbitMQ — Queues, Acks, and Alarms.

Queue buildup

Unacked messages piling up

Memory & disk alarms (flow control)

Dead-letter loops

Connection / channel churn

Docker & Kubernetes specifics

Quick command reference