Kafka mental model: a topic is split into partitions; each partition is an append-only log replicated to N brokers. One replica is leader (handles reads/writes), the rest are followers. The ISR (in-sync replicas) set is followers caught up to the leader. Consumers join a group; each partition is owned by exactly one consumer in the group, tracked by a committed offset. Lag = log-end-offset − committed-offset. Almost every Kafka incident is one of these moving wrong.
Consumer lag climbing
Symptom. Lag grows without bound; downstream data is stale; alerts on
records-lag-max.
Likely causes.
- Consumers too slow / too few — produce rate > consume rate.
- Slow processing per message (blocking I/O, big batches, GC pauses).
- Partition count caps parallelism — group can't have more active consumers than partitions.
- A stuck consumer holding partitions but not progressing.
- Frequent rebalances resetting progress (see next section).
Diagnose.
# lag per partition for a group kafka-consumer-groups.sh --bootstrap-server localhost:9092 \ --describe --group my-group # columns: CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST # is one partition hot? look for skew in LAG column # is a consumer assigned but CONSUMER-ID empty? → no active member
Fix.
- Add consumers up to the partition count; beyond that, add partitions (carefully — breaks key→partition ordering).
- Speed up processing: batch writes, async downstream, raise
max.poll.recordsonly if you can keep up. - Parallelize work inside the consumer if ordering allows.
- Check for skew — a bad partition key funnels traffic to one partition.
Rebalancing storms
Symptom. Group constantly rebalances; consumers repeatedly "lost partition ownership"; lag sawtooths; throughput collapses.
Likely causes.
max.poll.interval.msexceeded — processing a batch took longer than allowed, broker assumes consumer dead, kicks it.- Session timeout / heartbeat misconfig —
session.timeout.mstoo low vs GC/network. - Pods churning (K8s OOMKills, rolling deploys, autoscaling) — every join/leave triggers a rebalance.
- Using eager assignment — every rebalance stops the world for all consumers.
Diagnose. Consumer logs show Attempt to heartbeat failed /
Revoking previously assigned partitions on a loop. Correlate with pod restarts and
GC logs.
Fix.
- Raise
max.poll.interval.msabove worst-case batch processing time, or lowermax.poll.recordsso a batch finishes faster. - Set
session.timeout.ms/heartbeat.interval.mssane (e.g. 30s / 3s). - Switch to cooperative-sticky assignor — incremental rebalances, no stop-the-world.
- Stabilize pods: fix OOMKills, use
group.instance.idfor static membership so rolling restarts don't trigger full rebalances.
max.poll.interval.ms. The broker can't tell "slow" from "dead." Time your batch.
Under-replicated partitions
Symptom. UnderReplicatedPartitions > 0; durability at risk; a
broker may be lagging or down.
Likely causes.
- A broker is down or slow (disk full, GC, network).
- Followers can't keep up with leader throughput (under-provisioned disk/network).
- Network partition between brokers.
Diagnose.
# list under-replicated partitions cluster-wide kafka-topics.sh --bootstrap-server localhost:9092 \ --describe --under-replicated-partitions # broker metrics to check: # kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions # disk usage, network throughput, GC time per broker
Fix.
- Recover the lagging broker — free disk, restart, fix network.
- Throttle replication if catch-up traffic saturates the network (
replication.quota). - If a broker is permanently gone, reassign partitions to healthy brokers.
- Right-size disk/network; followers need bandwidth to keep up with peak produce.
ISR shrink
Symptom. ISR set drops below replication factor; IsrShrinksPerSec
spikes; producers with acks=all slow or error.
Likely causes.
- A follower fell behind by more than
replica.lag.time.max.ms→ kicked from ISR. - Broker GC pause or disk stall makes a replica momentarily unresponsive.
- Sudden produce spike outruns follower fetch.
Diagnose. Watch IsrShrinksPerSec/IsrExpandsPerSec —
flapping = a chronically struggling broker. kafka-topics.sh --describe shows the
current Isr list per partition vs Replicas.
Fix.
- Find the flapping broker (it appears/disappears from ISR) and fix its disk/GC/network.
- Tune GC (G1, smaller heap) to cut pause times.
- If
min.insync.replicas+acks=allblocks producers during shrink, you've correctly traded availability for durability — fix the broker, don't lower the floor blindly.
acks=all and min.insync.replicas=2, if ISR shrinks to 1 the
producer gets NotEnoughReplicas and blocks. That's durability working as
designed — the fix is restoring the replica, not dropping the min.
Broker down / leader unavailable
Symptom. LeaderNotAvailable, produce/consume timeouts,
OfflinePartitionsCount > 0.
Diagnose & fix.
- Check broker liveness and controller:
OfflinePartitionsCount,ActiveControllerCount(must be exactly 1 cluster-wide). - If a partition has no leader and no surviving ISR, you may face unclean-leader-election choices (availability vs data loss). Decide deliberately.
- Verify the metadata quorum (KRaft) or ZooKeeper ensemble is healthy — no controller, no leader elections.
Docker & Kubernetes specifics
- advertised.listeners — the #1 Docker/K8s Kafka bug. Clients connect to the
bootstrap, then get redirected to whatever
advertised.listenerssays. If it advertises an internal hostname clients can't resolve → connection refused/timeout. Set it to an address reachable from the client's network. - Storage must be persistent. On K8s, run Kafka as a StatefulSet with PVCs.
An
emptyDirloses the log on restart = data loss + long re-replication. - OOMKills trigger rebalance storms. Set memory limits above heap + page cache headroom; Kafka leans on the OS page cache, so don't starve the node.
- Don't cap CPU too tight — GC under a low CPU limit lengthens pauses → ISR flapping.
- Liveness probes that restart a briefly-busy broker create churn; prefer generous probes.
Quick command reference
kafka-consumer-groups.sh --bootstrap-server B --describe --group G # lag kafka-consumer-groups.sh --bootstrap-server B --list # all groups kafka-topics.sh --bootstrap-server B --describe --topic T # leaders/ISR/replicas kafka-topics.sh --bootstrap-server B --under-replicated-partitions # URP kafka-topics.sh --bootstrap-server B --unavailable-partitions # no leader kafka-reassign-partitions.sh ... # move/rebalance replicas # reset a group's offsets (stop consumers first!) kafka-consumer-groups.sh --bootstrap-server B --group G \ --reset-offsets --to-earliest --topic T --execute