Deep dive 3 of the KubeCon Mumbai 2026 series. Mary Vinothini S and Roopadharsini K of Fidelity Investments tackled a problem every team running Kafka on Kubernetes eventually hits: your dashboards say the cluster is "up," but you still can't answer where a problem started, how it propagated, or which component caused it. Their fix is a vendor-agnostic OpenTelemetry pipeline running natively on Kubernetes that unifies metrics, logs, and traces for Kafka — and a pragmatic, SRE-and-compliance-first modernisation journey to get there without drowning in metric cardinality.
Kafka is the central nervous system of a modern data platform, and at a firm like Fidelity it carries real-time, money-moving events. Moving it onto Kubernetes buys scalability and portability — but it also multiplies the number of moving parts you have to see. This talk is the honest account of making that visibility actually work, told by practitioners who've clearly been burned by "monitoring" that monitors the wrong things.
Why real-time, and why Kubernetes
The framing started with the value of event streaming: capture what just happened, stream it instantly, act on it now. Kafka is the "central nervous system for data in motion" — highly available with reliable replication, high-throughput and low-latency, and scalable to absorb massive data growth.
So why run it on Kubernetes at all? The talk's six-word answer: scalability, orchestration, standardization, portability, governance, efficiency. Kubernetes turns Kafka from a hand-tended pet cluster into a declaratively-managed, portable workload that fits the same platform every other service uses. But — and this is the whole point of the talk — that orchestration layer adds abstraction, and abstraction without observability is a fog.
The guessing game — monitoring vs observability
Their sharp phrase for the trap was "seeing the system, missing the signal." Baseline monitors typically cover a minimal set of metrics — CPU, memory, RAM, throughput — that tell you a broker is up but nothing about how it's behaving. The internals that actually predict a Kafka incident — event flow, consumer lag, ISR health, the effect of tuning linger or adding broker replicas — are invisible to that kind of monitoring. You can be green across the board and still be minutes from a cascading failure.
The three foundations of observability
The talk grounded the solution in the classic three signals, with a one-line mnemonic worth memorising:
| Signal | What it gives you | It answers… |
|---|---|---|
| Metrics | health and trends over time | that something is wrong |
| Logs | events and details | why it's wrong |
| Traces | the end-to-end request path | where it's wrong |
The insight isn't that you need all three — everyone says that — it's that they're only powerful correlated. A metric spike that you can pivot straight to the logs behind it and the trace it sits on is what turns a 2am guessing game into a five-minute diagnosis.
The OpenTelemetry pipeline — vendor-agnostic, Kubernetes-native
The architecture is a vendor-agnostic pipeline deployed natively on Kubernetes: Kafka and infrastructure emit telemetry into an OpenTelemetry pipeline, which serves the stakeholders (SREs, platform, app teams). The deliberate design choice is OpenTelemetry as the collection and shaping layer, decoupled from any particular backend.
Fig 1 — OTEL collects and shapes; backends are swappable. No vendor owns your telemetry.
The "legacy monitoring vs OTEL" contrast was the business case:
| Legacy monitoring | OpenTelemetry |
|---|---|
| Vendor lock-in | Vendor-neutral |
| Cost explosion | Open source |
| Data silos | Universal, correlated data |
Reading Kafka's signals — the patterns that matter
This was the most practically useful slide for anyone operating Kafka. The talk paired log patterns with the metrics they explain, with OpenSearch surfacing the log signal behind every metric:
| Healthy patterns | Bad patterns (act now) |
|---|---|
| "Expanding ISR" | "Shrinking ISR" |
| "Preferred replica leader election completed" | java.lang.OutOfMemoryError |
| "SSL handshake completed successfully" | "Error while fetching metadata" |
The talk also covered infra and client observability (broker-side health and producer/consumer-side behaviour like lag) and codifying SLOs/SLIs in OTEL — turning "is it healthy?" into measurable, alertable objectives rather than gut feel.
Critical constraints — the Kafka-on-K8s observability traps
The honesty here is what made the talk valuable. Instrumenting Kafka at scale runs into very specific walls:
- "353 topics × per-partition labels = metric overload." Per-partition metrics multiplied across hundreds of topics is a cardinality bomb that can cost more than the cluster it monitors. You must aggregate deliberately.
- "JMX names ≠ OTEL semantic conventions." Kafka exposes metrics via JMX with its own naming; mapping those onto OpenTelemetry's semantic conventions is real, fiddly work — not automatic.
- "Multiple collector agents — each competing with JVM heap." Stacking collection agents on the broker steals memory from the JVM that runs Kafka itself, so the observability can degrade the thing it observes.
- "Your metric names don't migrate with you" — the lock-in problem OTEL is meant to solve, stated as the motivating pain.
The modernisation journey
The closing playbook is a five-step rollout that any team can copy, with the soft lessons that make or break it:
| Step | Move |
|---|---|
| 1 · Assess | Audit current tools and pain points — know what you're replacing and why. |
| 2 · Pilot | Run the OTel Collector on one cluster. Prove value before fleet-wide commitment. |
| 3 · Instrument | Auto-instrumentation first, plus the key metrics that matter. Customise after you have data. |
| 4 · Visualize | Grafana dashboards + alerts — quality over quantity. Build dashboards that matter. |
| 5 · Scale | Roll out to all clusters once the pattern is proven. |
FAQ
Why OpenTelemetry instead of a turnkey vendor agent for Kafka?
Vendor neutrality. OTEL decouples instrumentation from the backend, so you avoid lock-in, cost explosions, and the "metric names don't migrate" trap. You can change visualization or storage backends later without re-instrumenting Kafka — a big deal for a long-lived platform.
What's the single most important Kafka signal to alert on?
ISR health. A shrinking In-Sync Replica set is the leading indicator of durability/availability trouble, and it's invisible to CPU/memory dashboards. Pair it with consumer lag and correlate to logs (OutOfMemoryError, "error while fetching metadata") for the why.
How do I avoid the metric-cardinality explosion?
Don't blindly export per-partition labels across hundreds of topics. Decide which dimensions you truly need, aggregate the rest at the collector, and treat cardinality as a managed budget. Also keep collector agents off the broker's JVM heap where you can.
Where should I start if my Kafka monitoring is "green but blind"?
Pilot an OTel Collector on one cluster, auto-instrument, and build a small set of dashboards around the signals that predict incidents (ISR, lag, leader elections) rather than just liveness. Prove value, then scale — exactly the five-step journey above.
Takeaways
- "Up" is not "healthy." Baseline monitoring sees liveness; observability sees behaviour — event flow, lag, ISR.
- Metrics/logs/traces only pay off correlated: that → why → where, pivotable in one place.
- OTEL is the vendor-neutral waist: instrument once, swap backends freely, escape lock-in and cost explosions.
- Watch ISR, not just CPU. Shrinking ISR is the Kafka incident you'll otherwise miss.
- Cardinality is a budget. Per-partition × hundreds of topics is a cost bomb — aggregate deliberately.
- Roll out small, SRE-and-compliance-first. Pilot on one cluster, auto-instrument, prove value, then scale.
Next in the series — Deep dive 04: KubeVela, which moves from observing the platform to delivering applications onto it with the Open Application Model.
References
- KubeCon Mumbai 2026 — Day 1 index · the rest of the series
- OpenTelemetry · the vendor-neutral telemetry standard
- Apache Kafka — monitoring · JMX metrics, ISR, and what to watch
- Deep dive 14 — The Lean Observability Stack · the cardinality-cost theme, expanded