Debugging Kafka — cvam.sight

Kafka mental model: a topic is split into partitions; each partition is an append-only log replicated to N brokers. One replica is leader (handles reads/writes), the rest are followers. The ISR (in-sync replicas) set is followers caught up to the leader. Consumers join a group; each partition is owned by exactly one consumer in the group, tracked by a committed offset. Lag = log-end-offset − committed-offset. Almost every Kafka incident is one of these moving wrong.

Consumer lag climbing

Symptom. Lag grows without bound; downstream data is stale; alerts on records-lag-max.

Likely causes.

Consumers too slow / too few — produce rate > consume rate.
Slow processing per message (blocking I/O, big batches, GC pauses).
Partition count caps parallelism — group can't have more active consumers than partitions.
A stuck consumer holding partitions but not progressing.
Frequent rebalances resetting progress (see next section).

Diagnose.

# lag per partition for a group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-group
# columns: CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID  HOST

# is one partition hot? look for skew in LAG column
# is a consumer assigned but CONSUMER-ID empty? → no active member

Fix.

Add consumers up to the partition count; beyond that, add partitions (carefully — breaks key→partition ordering).
Speed up processing: batch writes, async downstream, raise max.poll.records only if you can keep up.
Parallelize work inside the consumer if ordering allows.
Check for skew — a bad partition key funnels traffic to one partition.

partitions cap consumers A group with 20 consumers and a topic with 6 partitions runs only 6 active consumers — the other 14 idle. Scaling pods does nothing past partition count. Repartition or rethink keys.

Rebalancing storms

Symptom. Group constantly rebalances; consumers repeatedly "lost partition ownership"; lag sawtooths; throughput collapses.

Likely causes.

max.poll.interval.ms exceeded — processing a batch took longer than allowed, broker assumes consumer dead, kicks it.
Session timeout / heartbeat misconfig — session.timeout.ms too low vs GC/network.
Pods churning (K8s OOMKills, rolling deploys, autoscaling) — every join/leave triggers a rebalance.
Using eager assignment — every rebalance stops the world for all consumers.

Diagnose. Consumer logs show Attempt to heartbeat failed / Revoking previously assigned partitions on a loop. Correlate with pod restarts and GC logs.

Fix.

Raise max.poll.interval.ms above worst-case batch processing time, or lower max.poll.records so a batch finishes faster.
Set session.timeout.ms / heartbeat.interval.ms sane (e.g. 30s / 3s).
Switch to cooperative-sticky assignor — incremental rebalances, no stop-the-world.
Stabilize pods: fix OOMKills, use group.instance.id for static membership so rolling restarts don't trigger full rebalances.

slow processing = phantom death The #1 rebalance cause isn't network — it's a poll loop where processing one batch exceeds max.poll.interval.ms. The broker can't tell "slow" from "dead." Time your batch.

Under-replicated partitions

Symptom. UnderReplicatedPartitions > 0; durability at risk; a broker may be lagging or down.

Likely causes.

A broker is down or slow (disk full, GC, network).
Followers can't keep up with leader throughput (under-provisioned disk/network).
Network partition between brokers.

Diagnose.

# list under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

# broker metrics to check:
#   kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
#   disk usage, network throughput, GC time per broker

Fix.

Recover the lagging broker — free disk, restart, fix network.
Throttle replication if catch-up traffic saturates the network (replication.quota).
If a broker is permanently gone, reassign partitions to healthy brokers.
Right-size disk/network; followers need bandwidth to keep up with peak produce.

ISR shrink

Symptom. ISR set drops below replication factor; IsrShrinksPerSec spikes; producers with acks=all slow or error.

Likely causes.

A follower fell behind by more than replica.lag.time.max.ms → kicked from ISR.
Broker GC pause or disk stall makes a replica momentarily unresponsive.
Sudden produce spike outruns follower fetch.

Diagnose. Watch IsrShrinksPerSec/IsrExpandsPerSec — flapping = a chronically struggling broker. kafka-topics.sh --describe shows the current Isr list per partition vs Replicas.

Fix.

Find the flapping broker (it appears/disappears from ISR) and fix its disk/GC/network.
Tune GC (G1, smaller heap) to cut pause times.
If min.insync.replicas + acks=all blocks producers during shrink, you've correctly traded availability for durability — fix the broker, don't lower the floor blindly.

acks=all + min.insync.replicas With acks=all and min.insync.replicas=2, if ISR shrinks to 1 the producer gets NotEnoughReplicas and blocks. That's durability working as designed — the fix is restoring the replica, not dropping the min.

Broker down / leader unavailable

Symptom. LeaderNotAvailable, produce/consume timeouts, OfflinePartitionsCount > 0.

Diagnose & fix.

Check broker liveness and controller: OfflinePartitionsCount, ActiveControllerCount (must be exactly 1 cluster-wide).
If a partition has no leader and no surviving ISR, you may face unclean-leader-election choices (availability vs data loss). Decide deliberately.
Verify the metadata quorum (KRaft) or ZooKeeper ensemble is healthy — no controller, no leader elections.

Docker & Kubernetes specifics

advertised.listeners — the #1 Docker/K8s Kafka bug. Clients connect to the bootstrap, then get redirected to whatever advertised.listeners says. If it advertises an internal hostname clients can't resolve → connection refused/timeout. Set it to an address reachable from the client's network.
Storage must be persistent. On K8s, run Kafka as a StatefulSet with PVCs. An emptyDir loses the log on restart = data loss + long re-replication.
OOMKills trigger rebalance storms. Set memory limits above heap + page cache headroom; Kafka leans on the OS page cache, so don't starve the node.
Don't cap CPU too tight — GC under a low CPU limit lengthens pauses → ISR flapping.
Liveness probes that restart a briefly-busy broker create churn; prefer generous probes.

Quick command reference

kafka-consumer-groups.sh --bootstrap-server B --describe --group G   # lag
kafka-consumer-groups.sh --bootstrap-server B --list                 # all groups
kafka-topics.sh --bootstrap-server B --describe --topic T            # leaders/ISR/replicas
kafka-topics.sh --bootstrap-server B --under-replicated-partitions   # URP
kafka-topics.sh --bootstrap-server B --unavailable-partitions        # no leader
kafka-reassign-partitions.sh ...                                     # move/rebalance replicas
# reset a group's offsets (stop consumers first!)
kafka-consumer-groups.sh --bootstrap-server B --group G \
  --reset-offsets --to-earliest --topic T --execute

Debugging Kafka — Lag, Rebalances, and Replication.

Consumer lag climbing

Rebalancing storms

Under-replicated partitions

ISR shrink

Broker down / leader unavailable

Docker & Kubernetes specifics

Quick command reference