← Debug Guides

DEBUG GUIDE · MESSAGING · SRE PLAYBOOK

Debugging Kafka — Lag, Rebalances, and Replication.

kafka debugging sre messaging
Kafka mental model: a topic is split into partitions; each partition is an append-only log replicated to N brokers. One replica is leader (handles reads/writes), the rest are followers. The ISR (in-sync replicas) set is followers caught up to the leader. Consumers join a group; each partition is owned by exactly one consumer in the group, tracked by a committed offset. Lag = log-end-offset − committed-offset. Almost every Kafka incident is one of these moving wrong.

Consumer lag climbing

Symptom. Lag grows without bound; downstream data is stale; alerts on records-lag-max.

Likely causes.

  • Consumers too slow / too few — produce rate > consume rate.
  • Slow processing per message (blocking I/O, big batches, GC pauses).
  • Partition count caps parallelism — group can't have more active consumers than partitions.
  • A stuck consumer holding partitions but not progressing.
  • Frequent rebalances resetting progress (see next section).

Diagnose.

# lag per partition for a group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-group
# columns: CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID  HOST

# is one partition hot? look for skew in LAG column
# is a consumer assigned but CONSUMER-ID empty? → no active member

Fix.

  • Add consumers up to the partition count; beyond that, add partitions (carefully — breaks key→partition ordering).
  • Speed up processing: batch writes, async downstream, raise max.poll.records only if you can keep up.
  • Parallelize work inside the consumer if ordering allows.
  • Check for skew — a bad partition key funnels traffic to one partition.
partitions cap consumers A group with 20 consumers and a topic with 6 partitions runs only 6 active consumers — the other 14 idle. Scaling pods does nothing past partition count. Repartition or rethink keys.

Rebalancing storms

Symptom. Group constantly rebalances; consumers repeatedly "lost partition ownership"; lag sawtooths; throughput collapses.

Likely causes.

  • max.poll.interval.ms exceeded — processing a batch took longer than allowed, broker assumes consumer dead, kicks it.
  • Session timeout / heartbeat misconfig — session.timeout.ms too low vs GC/network.
  • Pods churning (K8s OOMKills, rolling deploys, autoscaling) — every join/leave triggers a rebalance.
  • Using eager assignment — every rebalance stops the world for all consumers.

Diagnose. Consumer logs show Attempt to heartbeat failed / Revoking previously assigned partitions on a loop. Correlate with pod restarts and GC logs.

Fix.

  • Raise max.poll.interval.ms above worst-case batch processing time, or lower max.poll.records so a batch finishes faster.
  • Set session.timeout.ms / heartbeat.interval.ms sane (e.g. 30s / 3s).
  • Switch to cooperative-sticky assignor — incremental rebalances, no stop-the-world.
  • Stabilize pods: fix OOMKills, use group.instance.id for static membership so rolling restarts don't trigger full rebalances.
slow processing = phantom death The #1 rebalance cause isn't network — it's a poll loop where processing one batch exceeds max.poll.interval.ms. The broker can't tell "slow" from "dead." Time your batch.

Under-replicated partitions

Symptom. UnderReplicatedPartitions > 0; durability at risk; a broker may be lagging or down.

Likely causes.

  • A broker is down or slow (disk full, GC, network).
  • Followers can't keep up with leader throughput (under-provisioned disk/network).
  • Network partition between brokers.

Diagnose.

# list under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

# broker metrics to check:
#   kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
#   disk usage, network throughput, GC time per broker

Fix.

  • Recover the lagging broker — free disk, restart, fix network.
  • Throttle replication if catch-up traffic saturates the network (replication.quota).
  • If a broker is permanently gone, reassign partitions to healthy brokers.
  • Right-size disk/network; followers need bandwidth to keep up with peak produce.

ISR shrink

Symptom. ISR set drops below replication factor; IsrShrinksPerSec spikes; producers with acks=all slow or error.

Likely causes.

  • A follower fell behind by more than replica.lag.time.max.ms → kicked from ISR.
  • Broker GC pause or disk stall makes a replica momentarily unresponsive.
  • Sudden produce spike outruns follower fetch.

Diagnose. Watch IsrShrinksPerSec/IsrExpandsPerSec — flapping = a chronically struggling broker. kafka-topics.sh --describe shows the current Isr list per partition vs Replicas.

Fix.

  • Find the flapping broker (it appears/disappears from ISR) and fix its disk/GC/network.
  • Tune GC (G1, smaller heap) to cut pause times.
  • If min.insync.replicas + acks=all blocks producers during shrink, you've correctly traded availability for durability — fix the broker, don't lower the floor blindly.
acks=all + min.insync.replicas With acks=all and min.insync.replicas=2, if ISR shrinks to 1 the producer gets NotEnoughReplicas and blocks. That's durability working as designed — the fix is restoring the replica, not dropping the min.

Broker down / leader unavailable

Symptom. LeaderNotAvailable, produce/consume timeouts, OfflinePartitionsCount > 0.

Diagnose & fix.

  • Check broker liveness and controller: OfflinePartitionsCount, ActiveControllerCount (must be exactly 1 cluster-wide).
  • If a partition has no leader and no surviving ISR, you may face unclean-leader-election choices (availability vs data loss). Decide deliberately.
  • Verify the metadata quorum (KRaft) or ZooKeeper ensemble is healthy — no controller, no leader elections.

Docker & Kubernetes specifics

  • advertised.listeners — the #1 Docker/K8s Kafka bug. Clients connect to the bootstrap, then get redirected to whatever advertised.listeners says. If it advertises an internal hostname clients can't resolve → connection refused/timeout. Set it to an address reachable from the client's network.
  • Storage must be persistent. On K8s, run Kafka as a StatefulSet with PVCs. An emptyDir loses the log on restart = data loss + long re-replication.
  • OOMKills trigger rebalance storms. Set memory limits above heap + page cache headroom; Kafka leans on the OS page cache, so don't starve the node.
  • Don't cap CPU too tight — GC under a low CPU limit lengthens pauses → ISR flapping.
  • Liveness probes that restart a briefly-busy broker create churn; prefer generous probes.

Quick command reference

kafka-consumer-groups.sh --bootstrap-server B --describe --group G   # lag
kafka-consumer-groups.sh --bootstrap-server B --list                 # all groups
kafka-topics.sh --bootstrap-server B --describe --topic T            # leaders/ISR/replicas
kafka-topics.sh --bootstrap-server B --under-replicated-partitions   # URP
kafka-topics.sh --bootstrap-server B --unavailable-partitions        # no leader
kafka-reassign-partitions.sh ...                                     # move/rebalance replicas
# reset a group's offsets (stop consumers first!)
kafka-consumer-groups.sh --bootstrap-server B --group G \
  --reset-offsets --to-earliest --topic T --execute
← all debug guides next: Redis →
© cvam — written in plaintext, served warm