Kafka is a distributed, partitioned, replicated commit log. Producers append records to partitions; consumers read at their own offset; the broker just stores an immutable, ordered log per partition and serves reads from the page cache. Everything — ordering, parallelism, delivery guarantees, throughput — falls out of "partition = the unit of everything". Senior interviews probe the why: replication, exactly-once, rebalancing, and the trade-offs.
1. The mental model
| Concept | What it is |
|---|---|
| Topic | A named stream, split into partitions. Logical; not a unit of parallelism on its own. |
| Partition | An ordered, immutable, append-only log. The unit of ordering, parallelism, replication, and retention. Ordering is guaranteed only within a partition. |
| Offset | Monotonic position of a record in a partition. Consumers track committed offsets; brokers don't track per-consumer position. |
| Broker | A server holding partition replicas, serving produce/fetch. |
| Replica / Leader / Follower | Each partition has R replicas; one leader (all reads/writes), the rest followers (replicate). |
| ISR | In-Sync Replicas — followers caught up within replica.lag.time.max.ms. Only ISR members are eligible to become leader (with unclean election off). |
| Consumer group | Consumers sharing a group.id; partitions are divided among members — each partition to exactly one member. |
| Consumer lag | log-end-offset − committed-offset. The key health metric. |
ordering is per-partition only
There is no global order across a topic. If you need ordering for an entity, route it to one
partition via a stable key. More partitions = more parallelism but weaker ordering scope.
2. Storage internals
- A partition is a directory of segment files (
.log) + indexes (.indexoffset→position,.timeindex). The active segment is appended; older segments roll atsegment.bytes/segment.ms. - Writes are sequential appends to the active segment — why Kafka is fast on spinning disk and SSD alike.
- Reads are served from the OS page cache; zero-copy (
sendfile) ships bytes from cache straight to the socket — no userspace copy. - Retention deletes whole segments past
retention.ms/retention.bytes— not individual records. - Records are stored in batches (compressed as a unit) — compression + batching are the throughput levers.
3. Producer internals & config
acks=all # leader waits for all ISR before ack (durability)
enable.idempotence=true # dedupe retries -> exactly-once *per partition* on produce
# (implies acks=all, retries>0, max.in.flight<=5)
linger.ms=10 # wait to batch (throughput vs latency)
batch.size=64KB # max bytes per partition batch
compression.type=zstd # lz4/snappy/zstd/gzip — big throughput win
max.in.flight.requests.per.connection=5 # >1 + idempotence keeps order
retries=Integer.MAX ; delivery.timeout.ms # bound total time, not retry count
- acks:
0fire-and-forget (lose on failure),1leader-only (lose if leader dies before replication),all+min.insync.replicas=2= no acknowledged-then-lost writes. - Idempotent producer stamps a producer ID + sequence number so a retried batch isn't written twice — exactly-once on the produce path, per partition.
- Partitioner: keyed →
hash(key) % partitions(sticky ordering per key); null key → sticky partitioning (batch to one partition then rotate).
acks=all is not "no data loss" alone
Pair
acks=all with min.insync.replicas=2 and RF=3. With RF=3/min.isr=2 you
survive one broker loss with no acknowledged-write loss; if ISR shrinks below min.isr, producers
get NotEnoughReplicas and block — durability over availability, by design.4. Consumer internals & config
group.id=orders-svc enable.auto.commit=false # commit manually AFTER processing (at-least-once) auto.offset.reset=earliest|latest # when no committed offset exists max.poll.records=500 ; max.poll.interval.ms=300000 # processing budget session.timeout.ms=45000 ; heartbeat.interval.ms=3000 partition.assignment.strategy=CooperativeStickyAssignor # incremental rebalance isolation.level=read_committed # skip aborted txn records (EOS reads)
- Offset commit: commit after processing for at-least-once; commit before = at-most-once. Offsets live in the internal
__consumer_offsetstopic. - Group coordinator (a broker) manages membership; a rebalance reassigns partitions when members join/leave or the topic changes.
- Heartbeats on a background thread; but exceeding
max.poll.interval.ms(slow processing) ejects the member → rebalance. This, not network, is the #1 rebalance cause. - Static membership (
group.instance.id) avoids rebalances on rolling restarts.
5. Rebalance protocols
| Protocol | Behaviour |
|---|---|
| Eager (range/round-robin) | Stop-the-world: every consumer revokes all partitions, then reassigns. Simple, disruptive. |
| Cooperative sticky | Incremental: only the partitions that must move are revoked; others keep processing. Modern default. |
| Static membership | Members keep identity across restarts (group.instance.id + longer session timeout) — no rebalance on a quick bounce. |
6. Replication & leader election
- Replication factor (RF) = copies per partition. RF=3 is standard for prod.
- ISR shrinks when a follower lags >
replica.lag.time.max.ms; expands when it catches up. - Leader election: on leader failure, the controller promotes an ISR member.
unclean.leader.election.enable=false(default) means it will not promote an out-of-sync replica — choosing consistency (no data loss) over availability. - Controller: one broker coordinates leadership/metadata. In KRaft mode the controllers form a Raft quorum (no ZooKeeper); ZooKeeper is removed in modern Kafka.
7. Delivery semantics & EOS
| Guarantee | How |
|---|---|
| At-most-once | Commit offset before processing — crash loses the record. |
| At-least-once | Commit after processing — crash redelivers. Default target; handlers must be idempotent. |
| Exactly-once (EOS) | Idempotent producer + transactions (transactional.id, initTransactions/commitTransaction) + consumer read_committed. Atomic "consume-process-produce". |
EOS is end-to-end only inside Kafka
Transactions give exactly-once for read-from-Kafka → write-to-Kafka (+ offset commit) atomically.
A side effect to an external system (DB, email) is not covered — make those idempotent or use the
outbox/transactional-outbox pattern.
8. Partitioning & keys
- Partition count caps consumer parallelism (active consumers ≤ partitions). Pick for peak throughput + headroom; you can add partitions but it breaks key→partition mapping (and ordering) for existing keys.
- Key choice = ordering scope + load distribution. A hot key (e.g. one big customer) creates a skewed/hot partition.
- Rule of thumb: target partitions so each handles a manageable MB/s; more partitions = more open files, more rebalance cost, longer leader-election.
9. Retention & compaction
- Delete (default): drop segments past
retention.ms/retention.bytes. - Log compaction (
cleanup.policy=compact): keep the latest value per key forever; a tombstone (null value) deletes a key. Powers changelog/state topics,__consumer_offsets, KTables. - Can combine
compact,delete.
10. Ecosystem
- Kafka Connect — declarative source/sink connectors (DB CDC via Debezium, S3, ES) — no code, scales as workers.
- Kafka Streams — JVM stream-processing library (KStream/KTable, joins, windows, exactly-once) running in your app.
- Schema Registry — Avro/Protobuf/JSON schemas + compatibility (backward/forward/full) so producers/consumers evolve safely.
- ksqlDB — SQL over streams.
11. CLI & ops
kafka-topics.sh --bootstrap-server B --create --topic t --partitions 12 --replication-factor 3 kafka-topics.sh --bootstrap-server B --describe --topic t # leaders, ISR, replicas kafka-topics.sh --bootstrap-server B --under-replicated-partitions kafka-console-producer.sh --bootstrap-server B --topic t kafka-console-consumer.sh --bootstrap-server B --topic t --from-beginning --group g kafka-consumer-groups.sh --bootstrap-server B --describe --group g # LAG per partition kafka-consumer-groups.sh --bootstrap-server B --group g --reset-offsets --to-earliest --topic t --execute kafka-configs.sh --bootstrap-server B --alter --entity-type topics --entity-name t \ --add-config retention.ms=604800000 kafka-reassign-partitions.sh ... # move replicas / rebalance
Key metrics: UnderReplicatedPartitions (must be 0), OfflinePartitionsCount,
ActiveControllerCount (=1), IsrShrinksPerSec, request handler idle ratio,
consumer records-lag-max.
12. Senior interview Q&A
- How does Kafka guarantee ordering?Only within a partition. Route records that must be ordered to the same partition via a stable key. No cross-partition/topic order.
- acks=all vs min.insync.replicas?acks=all = leader waits for all current ISR. min.insync.replicas = the floor of ISR required to accept a write; below it, produce fails. Together (RF3/min.isr2) = survive one broker loss without acknowledged-write loss.
- How does exactly-once work?Idempotent producer (PID+seq dedup) + transactions (atomic produce + offset commit) + consumer read_committed. Exactly-once within Kafka; external side effects need idempotency/outbox.
- What triggers a rebalance and how to avoid storms?Member join/leave, topic change, or exceeding max.poll.interval.ms (slow processing). Use cooperative-sticky assignor + static membership; size max.poll.records to your processing time.
- What is the ISR and why does it matter?In-sync replicas caught up to the leader. Only ISR members can be elected leader (clean election); shrinking ISR risks availability/durability trade-offs.
- Unclean leader election?Promoting an out-of-sync replica when no ISR survives — restores availability but loses data. Off by default (consistency first).
- Log compaction vs retention?Retention deletes old segments by time/size. Compaction keeps the latest value per key forever (tombstone deletes) — for changelog/state topics.
- Why is Kafka fast?Sequential append-only writes, OS page cache reads, zero-copy (sendfile), batching + compression, partition parallelism.
- How do you scale consumer throughput?Add consumers up to partition count; beyond that add partitions (mind key remapping); parallelize processing; raise fetch/poll sizes if you can keep up.
- Can you add partitions safely?You can increase (not decrease) — but it changes hash(key)%n, so existing keys may move partitions, breaking per-key ordering. Plan capacity up front.
- KRaft vs ZooKeeper?KRaft replaces ZooKeeper with a Raft-based controller quorum inside Kafka — simpler ops, faster metadata, no separate ZK cluster. ZK is deprecated/removed.
- How do you handle a poison message?Catch, route to a dead-letter topic, commit past it. Don't block the partition forever; bound retries.
- Kafka vs RabbitMQ?Kafka = durable replayable log, high-throughput streaming, pull, ordered per partition. RabbitMQ = smart broker, flexible routing, per-message ack/requeue, push, lower-throughput task queues.
- How to debug rising consumer lag?Check consumer-groups LAG per partition (skew?), processing time vs max.poll.interval, rebalance loops, partition count cap, and broker URP/ISR health.