Real Kafka-administrator questions — brokers, partitions, replication, consumer groups, delivery guarantees, retention, and the cluster ops & debugging admins actually face — graded easy → hard with full answers. Click to expand. Pair with the Kafka cheatsheet.
Easy — fundamentals
What is Kafka and what problem does it solve? easy
A distributed, durable, append-only commit log for streaming data — a high-throughput pub/sub that decouples producers from consumers. Producers write events to topics; consumers read at their own pace, independently. Unlike a traditional queue, Kafka retains messages on disk for a configured time/size, so many consumers can read the same data and you can replay history. It's the backbone for event-driven architectures, log/metric pipelines, stream processing, and data integration — built for horizontal scale and ordering within a partition.
Explain topics, partitions, and offsets. easy
A topic is a named stream, split into partitions — each partition is an ordered, immutable, append-only log living on a broker. Order is guaranteed only within a partition, not across the topic. Each message in a partition has a monotonically increasing offset (its position). Partitions are the unit of parallelism (more partitions = more consumer parallelism) and the unit of distribution (spread across brokers). A message's partition is chosen by its key (hash) — same key → same partition → ordering for that key — or round-robin if no key.
What is a consumer group? easy
A set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer within the group, so the group splits the work and you get parallel, scaled consumption. Add consumers up to the partition count to scale out (beyond that, extras sit idle — partition count caps parallelism). If a consumer dies, its partitions are rebalanced to the others. Different groups each get their own copy of all messages (independent offsets) — that's how you fan out the same stream to multiple applications.
What is replication and the replication factor? easy
Each partition is replicated across multiple brokers for fault tolerance. Replication factor = total copies (3 is standard). One replica is the leader (handles all reads/writes); the others are followers that replicate from it. If the leader's broker dies, a follower is promoted. The set of replicas currently caught up is the ISR (in-sync replicas). RF=3 lets you lose 1–2 brokers without losing the partition. Spread replicas across racks/AZs (rack awareness) so a zone failure doesn't take all copies.
What is the role of the controller (and what replaced ZooKeeper)? easy
One broker acts as the controller — it manages cluster metadata: partition leadership, broker membership, and leader elections when a broker fails. Historically this metadata lived in ZooKeeper. Modern Kafka uses KRaft (KIP-500) — Kafka manages its own metadata via a Raft quorum of controller nodes, removing the ZooKeeper dependency. KRaft is the default/required mode now (ZooKeeper is deprecated/removed in recent versions), simplifying ops and scaling metadata to millions of partitions.
What is a Kafka topic? easy
A named, partitioned, append-only log of events that producers write to and consumers read from independently.
What is a partition? easy
An ordered, immutable log segment of a topic on a broker; the unit of parallelism and ordering (order is only guaranteed within a partition).
What is an offset? easy
A monotonically increasing position of a message within a partition; consumers track committed offsets to know where they are.
What is a producer vs consumer? easy
A producer writes events to topics; a consumer reads them, tracking offsets and (in a group) sharing partitions.
What is a broker? easy
A Kafka server hosting partitions, serving reads/writes, and replicating data; a cluster is multiple brokers.
What is replication factor? easy
Number of copies of each partition across brokers (3 standard); one leader serves traffic, followers replicate for fault tolerance.
What is a consumer group? easy
Consumers cooperating to consume a topic — each partition goes to one consumer in the group; different groups get independent copies.
What is a message key? easy
An optional key whose hash picks the partition — same key → same partition → ordering for that key; no key → round-robin.
What is the ISR? easy
In-Sync Replicas — the set of replicas caught up to the leader; only ISR members are eligible to become leader (with unclean election off).
What replaced ZooKeeper in Kafka? easy
KRaft (KIP-500) — Kafka manages its own metadata via a Raft controller quorum, removing the ZooKeeper dependency.
Medium — applied
How do you choose the number of partitions for a topic? medium
Partitions set the ceiling on consumer parallelism and on throughput, so size for peak target throughput ÷ per-partition throughput, and ensure partition count ≥ max consumers you'll ever want in a group. But more is not free: every partition adds open file handles, memory, replication overhead, and longer leader-election/recovery times; tens of thousands of partitions per broker hurt. You can increase partitions later but not decrease — and increasing breaks key-based ordering (the hash→partition mapping changes), so over-provision modestly rather than reshuffling. Rule of thumb: start with a reasonable number (e.g. 2–3× consumer count), leave headroom, measure.
Explain acks=0/1/all and how it relates to durability. medium
Producer acks controls when a write is acknowledged. acks=0: fire-and-forget, no ack — fastest, can silently lose data. acks=1: leader acks after writing locally — loses data if the leader dies before followers replicate. acks=all (-1): leader acks only after all in-sync replicas have the record — strongest durability. Pair acks=all with min.insync.replicas=2 (and RF=3): the producer write fails if fewer than 2 replicas are in sync, so you never ack a write that could be lost. That trio (RF=3, acks=all, min.insync=2) is the standard no-data-loss config — at the cost of latency and availability (writes fail if too few ISR).
At-most-once vs at-least-once vs exactly-once — how does Kafka deliver each? medium
Depends on offset-commit timing and idempotence. At-most-once: commit offset before processing — a crash loses the in-flight message (no dupes, possible loss). At-least-once: process then commit — a crash reprocesses (no loss, possible dupes); the common default, so make consumers idempotent. Exactly-once (EOS): enable the idempotent producer (dedupes retries via sequence numbers) plus transactions (atomically write messages + commit consumer offsets), and read with isolation.level=read_committed. EOS works cleanly for Kafka-to-Kafka (e.g. Kafka Streams); end-to-end to an external sink still needs idempotent writes there.
What is consumer lag and how do you monitor it? medium
Lag = log-end-offset (latest produced) − consumer's committed offset, per partition: how far behind the consumer is. Sustained or growing lag means consumers can't keep up (slow processing, too few consumers, a stuck/rebalancing consumer, or a hot partition). Monitor with kafka-consumer-groups.sh --describe, or tools like Burrow / Kafka Exporter → Prometheus/Grafana, and alert on lag trend (not just absolute value). Fixes: add consumers (up to partition count), speed up processing / batch, repartition to fix skew, or scale brokers. Lag is the #1 health signal admins watch.
How do retention and log compaction differ? medium
Two cleanup policies. delete (time/size retention): segments older than retention.ms or beyond retention.bytes are deleted — for event streams where old data ages out. compact: Kafka keeps at least the latest value per key and garbage-collects older values for the same key — turning the topic into a changelog / "current state per key" (used for things like a database of latest records, Kafka Streams state, __consumer_offsets). You can combine compact,delete. Compaction is async (a background cleaner) so duplicates/old values may linger briefly. Pick delete for time-series events, compact for keyed state you want to retain indefinitely.
What causes consumer-group rebalancing and why is it painful? medium
A rebalance reassigns partitions across the group; triggered when a consumer joins/leaves, crashes, misses heartbeats (session.timeout.ms), or takes too long between polls (max.poll.interval.ms — slow processing looks like a dead consumer). It's painful because the classic "stop-the-world" rebalance pauses all consumption while partitions are redistributed, and unprocessed work may be reprocessed. Mitigations: cooperative/incremental rebalancing (only moves affected partitions, no full stop), static group membership (group.instance.id avoids reshuffles on quick restarts), and tuning poll/heartbeat timeouts and batch sizes so processing fits within max.poll.interval.ms.
How do you choose partition count? medium
Size for peak throughput ÷ per-partition throughput and ≥ max consumers; but more partitions add overhead and can't be reduced — over-provision modestly, increasing breaks key ordering.
Explain acks=0/1/all. medium
0: no ack (fast, can lose). 1: leader-only ack (loses if leader dies pre-replication). all: ack after ISR has it — pair with min.insync.replicas=2 + RF=3 for no loss.
At-most/at-least/exactly-once delivery? medium
Depends on commit timing + idempotence: commit-before-process (at-most), process-then-commit (at-least, make idempotent), idempotent producer + transactions + read_committed (exactly-once).
What is consumer lag and how to monitor? medium
log-end-offset − committed offset per partition; monitor via kafka-consumer-groups/Burrow/exporter→Prometheus and alert on trend; fix by adding consumers or speeding processing.
Retention vs log compaction? medium
delete: drop segments past time/size (event streams). compact: keep latest value per key (changelog/state). Can combine compact,delete.
What causes consumer-group rebalancing? medium
Members join/leave/crash, missed heartbeats, or exceeding max.poll.interval.ms (slow processing looks dead); cooperative rebalancing + static membership reduce the pain.
What is the role of the controller? medium
One broker manages cluster metadata: partition leadership, broker membership, and leader elections on failure (via KRaft's Raft quorum).
How does the idempotent producer work? medium
Each producer gets a PID + per-partition sequence numbers so the broker dedupes retried sends — exactly-once semantics for producing within a session.
What are Kafka transactions? medium
Atomically write to multiple partitions and commit consumer offsets together; with read_committed consumers this enables exactly-once stream processing (consume-process-produce).
What is rack awareness? medium
Placing replicas across racks/AZs so a rack/zone failure doesn't take all copies of a partition — configured via broker.rack.
Hard — senior & debug
A broker is down. Walk through what happens and what you do. hard
When a broker dies, the controller detects it (session loss) and triggers leader election: for every partition that broker led, a follower from the ISR is promoted, so reads/writes continue (assuming RF≥2 and surviving ISR). Under-replicated partitions appear (replicas now missing). Your steps: (1) confirm via UnderReplicatedPartitions / OfflinePartitions metrics — Offline (no leader available) is the real emergency. (2) Diagnose the broker (disk, OOM, network, hung). (3) Bring it back — on rejoin it catches up from leaders and re-enters ISR. (4) If the node is permanently lost, replace it and reassign partitions to rebuild replicas. Watch out for unclean leader election (electing an out-of-sync replica) — keep it disabled in prod to avoid data loss, accepting that a partition with no in-sync replica goes offline rather than losing committed data.
Throughput is high but tail latency is spiking. How do you diagnose? hard
Bisect producer → broker → consumer. Broker disk: Kafka is sequential-IO bound; check disk utilization/IO wait, page-cache pressure, and segment flushing — a slow/full disk or fsync stall spikes latency. Replication: lagging followers (slow ISR) delay acks=all writes; check ReplicaFetcher and network between brokers. Partition skew / hot partition: a bad key concentrates load on one leader/broker — fix the keying or repartition. Request queue: check request-handler idle ratio and queue time — too few num.io.threads/num.network.threads. GC pauses on the broker JVM. Producer: batching (linger.ms/batch.size) and compression trade latency vs throughput; too-small batches hammer brokers. Consumer: rebalances and slow processing. Use end-to-end latency metrics and per-broker breakdown to localize, then address the specific bottleneck.
How do you design a no-data-loss, highly available Kafka topic? hard
Durability + availability come from layering settings: RF=3 with rack/AZ awareness (replicas in different failure domains). min.insync.replicas=2 so a write needs ≥2 copies. Producer acks=all + idempotent producer + retries (so retried sends don't dupe). unclean.leader.election.enable=false (never promote an out-of-sync replica → no committed-data loss; accept brief unavailability instead). Consumer: at-least-once with idempotent processing or full EOS via transactions + read_committed. Operationally: monitor under-replicated/offline partitions, spread leadership (preferred-leader election), and for region failure add MirrorMaker 2 replication to a DR cluster (async → non-zero RPO). The core trade-off: RF=3 + min.insync=2 + acks=all means a write fails when too few replicas are in sync — you choose consistency over availability for that write.
How do you scale a cluster and rebalance data without downtime? hard
Adding brokers does not automatically move existing partitions — new brokers only get new partitions until you act. Use partition reassignment (kafka-reassign-partitions.sh, or Cruise Control to automate) to redistribute replicas onto the new brokers for balance. Critically, throttle the reassignment (replication bandwidth limit) so the data movement doesn't saturate the network/disk and starve live traffic — unthrottled rebalances are a classic self-inflicted outage. Do it incrementally, watch under-replicated partitions and latency, and rebalance leadership afterward so load is even. Cruise Control is the standard for continuous, goal-based balancing (disk, leader count, rack) and self-healing. Removing a broker is the reverse: drain its replicas off first, then decommission.
Consumers are stuck — lag growing, but the app looks alive. What's happening? hard
Most likely a rebalance loop or a poll-interval timeout. If processing a batch takes longer than max.poll.interval.ms, the broker thinks the consumer is dead, kicks it out, and triggers a rebalance — the consumer rejoins, gets kicked again: lag grows while the app "runs." Check consumer logs for repeated rebalance/"leaving group" messages. Fixes: reduce max.poll.records so a batch finishes in time, raise max.poll.interval.ms, move heavy work off the poll thread, or use static membership. Other culprits: a poison message the consumer retries forever (add error handling / dead-letter topic), a hot/skewed partition overloading one consumer, a downstream dependency (DB/API) slow so processing stalls, or offsets not committing (auto-commit disabled and never committed manually). Diagnose via kafka-consumer-groups --describe (which partitions lag, is the group stable or perpetually rebalancing).
A broker dies — what happens and what do you do? hard
Controller triggers leader election (ISR follower promoted); under-replicated partitions appear (offline = emergency). Diagnose disk/OOM/net, bring it back to rejoin ISR, or replace + reassign; keep unclean election off.
Configure a no-data-loss topic. hard
RF=3 (rack-aware), min.insync.replicas=2, producer acks=all + idempotent + retries, unclean.leader.election=false; consumers at-least-once idempotent or EOS — writes fail if too few ISR (consistency over availability).
How do you scale and rebalance without downtime? hard
Adding brokers doesn't move data — run partition reassignment (or Cruise Control) with a throttle so it doesn't saturate network/disk; do it incrementally and watch under-replicated partitions.
Why disable unclean leader election? hard
Electing an out-of-sync replica as leader would lose committed messages; disabling it keeps a partition offline instead — choosing durability over availability.
How do you achieve exactly-once stream processing? hard
Idempotent producer + transactions to atomically produce + commit offsets, read_committed consumers, and Kafka Streams' EOS — clean for Kafka-to-Kafka; external sinks still need idempotency.
How does ISR shrink/expand and affect availability? hard
Replicas falling behind (replica.lag.time.max.ms) leave ISR; if ISR < min.insync.replicas, acks=all writes fail. Slow followers (disk/net) shrink ISR — monitor and fix replication lag.
How do you design for disaster recovery across regions? hard
MirrorMaker 2 async replication to a DR cluster (non-zero RPO), offset translation, and a failover/promotion plan — Kafka itself is single-region for sync durability.
How do you tune for high throughput vs low latency? hard
Throughput: larger batch.size/linger.ms, compression, more partitions. Latency: small linger, fewer in-flight; balance acks (all vs 1) and check broker disk/IO and ISR replication.
How does log compaction actually work internally? hard
A background cleaner rewrites segments keeping the latest record per key (and tombstones for deletes for a retention period); duplicates/old values may linger briefly until cleaned.
How do you size a Kafka cluster? hard
From target throughput, retention (disk = rate × retention × RF), partition count (parallelism vs overhead), and replication network; add headroom for spikes, recovery, and rebalancing.
Scenario-based
Consumer lag is growing. How do you diagnose and fix it? medium
Lag = consumers behind producers. Check the management/CLI (kafka-consumer-groups --describe): which partitions lag, is the group stable or perpetually rebalancing? Causes + fixes: too few consumers (add up to partition count), slow processing (batch/optimize, move heavy work off the poll thread), a poison message retried forever (DLQ), a hot/skewed partition (fix keying/repartition), or a rebalance loop from max.poll.interval.ms timeouts. Beyond partition count you can't add parallelism — that's the ceiling.
A broker goes down. What happens and what do you do? hard
The controller detects it and triggers leader election: for each partition that broker led, an in-sync follower is promoted, so traffic continues (if RF≥2 and ISR survives). Under-replicated partitions appear; watch UnderReplicatedPartitions and especially OfflinePartitions (no leader = real emergency). Diagnose the broker (disk/OOM/network/hang) and bring it back — it catches up from leaders and rejoins ISR. If permanently lost, replace + reassign partitions. Keep unclean leader election disabled so you never promote an out-of-sync replica (data loss).
You must guarantee no data loss on a topic. How do you configure it? hard
Layer the settings: RF=3 with rack/AZ awareness, min.insync.replicas=2, producer acks=all + idempotent producer + retries, and unclean.leader.election.enable=false. That trio means a write is only acked once ≥2 replicas have it, and a partition goes offline rather than losing committed data. Consumers: at-least-once with idempotent processing (or full EOS via transactions + read_committed). Trade-off: writes fail when too few replicas are in sync — you chose consistency over availability for that write.
You added brokers but the load didn't move to them. Why? medium
Adding brokers does not auto-rebalance existing partitions — new brokers only get new partitions. You must run partition reassignment (kafka-reassign-partitions.sh, or let Cruise Control do it) to move replicas onto the new brokers. Critically, throttle the reassignment (replication bandwidth limit) so the data movement doesn't saturate network/disk and starve live traffic — unthrottled rebalances are a classic self-inflicted outage. Do it incrementally and watch under-replicated partitions + latency.
Consumers are stuck in a rebalance loop. What's the cause and fix? hard
Processing a batch takes longer than max.poll.interval.ms, so the broker thinks the consumer is dead, evicts it, triggers a rebalance, it rejoins, gets evicted again — lag grows while the app "runs." Confirm via repeated "leaving group"/rebalance log lines. Fixes: lower max.poll.records so a batch finishes in time, raise max.poll.interval.ms, move heavy work off the poll thread, use cooperative rebalancing and static group membership (group.instance.id) to avoid reshuffles on quick restarts.
Throughput is high but tail latency is spiking. Where do you look? hard
Bisect producer→broker→consumer. Broker disk (Kafka is sequential-IO bound — IO wait, page-cache pressure, fsync stalls), lagging ISR followers delaying acks=all writes, a hot/skewed partition concentrating load on one leader, request-handler saturation (idle ratio, too few io/network threads), and JVM GC pauses. Producer side: tune linger.ms/batch.size/compression (too-small batches hammer brokers). Use per-broker + end-to-end latency metrics to localize, then fix the specific bottleneck.
Consumer lag is growing. Diagnose and fix. medium
Check kafka-consumer-groups: too few consumers (add up to partition count), slow processing, poison message, hot partition, or rebalance loop (max.poll.interval).
A broker goes down. Walk through it. hard
Leader election promotes ISR followers; watch under-replicated/offline partitions; diagnose the broker, bring it back to rejoin ISR or replace + reassign; unclean election stays off.
Guarantee no data loss on a topic. Configure. hard
RF=3 rack-aware, min.insync.replicas=2, acks=all + idempotent producer + retries, unclean election off; consumers idempotent/EOS.
Added brokers but load didn't move. Why? medium
New brokers only get new partitions; run throttled partition reassignment (or Cruise Control) to move replicas onto them.
Consumers stuck in a rebalance loop. Cause/fix? hard
Processing exceeds max.poll.interval.ms → evicted → rebalance → repeat; lower max.poll.records, raise the interval, move heavy work off the poll thread, use static membership + cooperative rebalancing.
Throughput high but tail latency spikes. Look where? hard
Broker disk IO/page-cache/fsync, lagging ISR followers delaying acks=all, hot/skewed partition, request-handler saturation, JVM GC; localize via per-broker + end-to-end metrics.
A poison message keeps failing a consumer. Handle? medium
Catch + route to a dead-letter topic after N retries (track attempts), so one bad message doesn't block the partition; alert + inspect the DLQ.
Disk filling on brokers. Respond. medium
Check retention settings vs ingest rate, reduce retention/size or add storage, ensure compaction working, and watch for an unconsumed topic ballooning; size disk = rate × retention × RF.
Ordering broke after increasing partitions. Why? hard
Increasing partitions changes the key→partition hash mapping, so a key's messages can land on a new partition — ordering per key isn't preserved across the change; avoid by over-provisioning upfront.
Need to migrate a topic to a new cluster with no loss. Approach? hard
MirrorMaker 2 to replicate data + offsets, dual-consume to verify, cut producers over, drain consumers on the old cluster, then switch consumers — validate offset translation.
Kafka-admin loops drill the durability/ordering model hard — expect to whiteboard RF +
acks + min.insync.replicas and explain exactly-once vs at-least-once. The most
common scenario questions: consumer lag ("lag is growing, debug it"), broker
failure (leader election, under-replicated/offline partitions, unclean leader election), and
scaling/rebalancing (partition reassignment + throttling, Cruise Control). They probe
partition-count trade-offs, retention vs compaction, and rebalancing pain. Senior loops add DR
(MirrorMaker 2, RPO), capacity sizing, and KRaft vs ZooKeeper. Answer with the specific config knobs and
a debugging method, not hand-waving.