KubeCon India 2026 (Mumbai) — Day 1 Deep Dives

03 · When Kafka Goes Cloud Native — Observability That Actually Works

Deep dive 3 of 17 · Platform engineering & app delivery

Jun 18, 2026 · conferences · 20 min read · 4500 words intermediate

When Kafka goes cloud native — observability that actually works.

conferences kubecon kafka observability opentelemetry

Deep dive 3 of the KubeCon Mumbai 2026 series. Mary Vinothini S and Roopadharsini K of Fidelity Investments tackled a problem every team running Kafka on Kubernetes eventually hits: your dashboards say the cluster is "up," but you still can't answer where a problem started, how it propagated, or which component caused it. Their fix is a vendor-agnostic OpenTelemetry pipeline running natively on Kubernetes that unifies metrics, logs, and traces for Kafka — and a pragmatic, SRE-and-compliance-first modernisation journey to get there without drowning in metric cardinality.

Kafka is the central nervous system of a modern data platform, and at a firm like Fidelity it carries real-time, money-moving events. Moving it onto Kubernetes buys scalability and portability — but it also multiplies the number of moving parts you have to see. This talk is the honest account of making that visibility actually work, told by practitioners who've clearly been burned by "monitoring" that monitors the wrong things.

Why real-time, and why Kubernetes

The framing started with the value of event streaming: capture what just happened, stream it instantly, act on it now. Kafka is the "central nervous system for data in motion" — highly available with reliable replication, high-throughput and low-latency, and scalable to absorb massive data growth.

So why run it on Kubernetes at all? The talk's six-word answer: scalability, orchestration, standardization, portability, governance, efficiency. Kubernetes turns Kafka from a hand-tended pet cluster into a declaratively-managed, portable workload that fits the same platform every other service uses. But — and this is the whole point of the talk — that orchestration layer adds abstraction, and abstraction without observability is a fog.

The guessing game — monitoring vs observability

The distinction that drives everything. Monitoring tells you whether predefined things are within thresholds. Observability is the ability to understand the internal state of a system from the outside — including states you didn't anticipate. Running clusters without it, the speakers said, is like telling a doctor "something is wrong" with no diagnostics. You cannot confidently answer: where did the problem start? how did it propagate? which component caused it?

Their sharp phrase for the trap was "seeing the system, missing the signal." Baseline monitors typically cover a minimal set of metrics — CPU, memory, RAM, throughput — that tell you a broker is up but nothing about how it's behaving. The internals that actually predict a Kafka incident — event flow, consumer lag, ISR health, the effect of tuning linger or adding broker replicas — are invisible to that kind of monitoring. You can be green across the board and still be minutes from a cascading failure.

The three foundations of observability

The talk grounded the solution in the classic three signals, with a one-line mnemonic worth memorising:

SignalWhat it gives youIt answers…
Metricshealth and trends over timethat something is wrong
Logsevents and detailswhy it's wrong
Tracesthe end-to-end request pathwhere it's wrong

The insight isn't that you need all three — everyone says that — it's that they're only powerful correlated. A metric spike that you can pivot straight to the logs behind it and the trace it sits on is what turns a 2am guessing game into a five-minute diagnosis.

The OpenTelemetry pipeline — vendor-agnostic, Kubernetes-native

The architecture is a vendor-agnostic pipeline deployed natively on Kubernetes: Kafka and infrastructure emit telemetry into an OpenTelemetry pipeline, which serves the stakeholders (SREs, platform, app teams). The deliberate design choice is OpenTelemetry as the collection and shaping layer, decoupled from any particular backend.

Vendor-agnostic OTEL pipeline on Kubernetes Kafka + Infra(JMX, logs) OTel Collectormetrics · logs · traces Grafana (dashboards) OpenSearch (logs) SLO/SLI alerting

Fig 1 — OTEL collects and shapes; backends are swappable. No vendor owns your telemetry.

The "legacy monitoring vs OTEL" contrast was the business case:

Legacy monitoringOpenTelemetry
Vendor lock-inVendor-neutral
Cost explosionOpen source
Data silosUniversal, correlated data
The strategic win of OTEL. Because instrumentation is decoupled from the backend, "your metric names don't migrate with you" stops being true — you can change the visualization or storage backend without re-instrumenting Kafka. This is the same lesson Lumenore reached in deep dive 01 and a thread through the whole conference: OTEL is the vendor-neutral waist of the observability hourglass.

Reading Kafka's signals — the patterns that matter

This was the most practically useful slide for anyone operating Kafka. The talk paired log patterns with the metrics they explain, with OpenSearch surfacing the log signal behind every metric:

Healthy patternsBad patterns (act now)
"Expanding ISR""Shrinking ISR"
"Preferred replica leader election completed"java.lang.OutOfMemoryError
"SSL handshake completed successfully""Error while fetching metadata"
Why ISR is the signal to watch. The In-Sync Replica set is the list of replicas fully caught up with the partition leader. Expanding ISR means replicas are healthy and rejoining; shrinking ISR means replicas are falling behind — the leading indicator of impending data-durability and availability problems. A dashboard that only shows CPU will miss a shrinking ISR entirely; that's exactly the "seeing the system, missing the signal" failure.

The talk also covered infra and client observability (broker-side health and producer/consumer-side behaviour like lag) and codifying SLOs/SLIs in OTEL — turning "is it healthy?" into measurable, alertable objectives rather than gut feel.

Critical constraints — the Kafka-on-K8s observability traps

The honesty here is what made the talk valuable. Instrumenting Kafka at scale runs into very specific walls:

  • "353 topics × per-partition labels = metric overload." Per-partition metrics multiplied across hundreds of topics is a cardinality bomb that can cost more than the cluster it monitors. You must aggregate deliberately.
  • "JMX names ≠ OTEL semantic conventions." Kafka exposes metrics via JMX with its own naming; mapping those onto OpenTelemetry's semantic conventions is real, fiddly work — not automatic.
  • "Multiple collector agents — each competing with JVM heap." Stacking collection agents on the broker steals memory from the JVM that runs Kafka itself, so the observability can degrade the thing it observes.
  • "Your metric names don't migrate with you" — the lock-in problem OTEL is meant to solve, stated as the motivating pain.
Cardinality is the silent killer. The single most important operational lesson here: per-partition labels across hundreds of topics will blow up your metrics bill and your storage. Decide up front which dimensions you actually need to slice by, aggregate the rest at the collector, and treat metric cardinality as a budget you manage — not a default you accept. This is the same theme the Lean Observability deep dive builds an entire talk around.

The modernisation journey

The closing playbook is a five-step rollout that any team can copy, with the soft lessons that make or break it:

StepMove
1 · AssessAudit current tools and pain points — know what you're replacing and why.
2 · PilotRun the OTel Collector on one cluster. Prove value before fleet-wide commitment.
3 · InstrumentAuto-instrumentation first, plus the key metrics that matter. Customise after you have data.
4 · VisualizeGrafana dashboards + alerts — quality over quantity. Build dashboards that matter.
5 · ScaleRoll out to all clusters once the pattern is proven.
The two soft lessons that decide success. First: involve SREs and Compliance as early as possible — in a regulated financial firm, observability data is itself governed, and retrofitting compliance is painful. Second: auto-instrument, then customise after you have data — don't hand-craft bespoke instrumentation before you know what questions you'll actually ask. Start small, prove value, expand.

FAQ

Why OpenTelemetry instead of a turnkey vendor agent for Kafka?

Vendor neutrality. OTEL decouples instrumentation from the backend, so you avoid lock-in, cost explosions, and the "metric names don't migrate" trap. You can change visualization or storage backends later without re-instrumenting Kafka — a big deal for a long-lived platform.

What's the single most important Kafka signal to alert on?

ISR health. A shrinking In-Sync Replica set is the leading indicator of durability/availability trouble, and it's invisible to CPU/memory dashboards. Pair it with consumer lag and correlate to logs (OutOfMemoryError, "error while fetching metadata") for the why.

How do I avoid the metric-cardinality explosion?

Don't blindly export per-partition labels across hundreds of topics. Decide which dimensions you truly need, aggregate the rest at the collector, and treat cardinality as a managed budget. Also keep collector agents off the broker's JVM heap where you can.

Where should I start if my Kafka monitoring is "green but blind"?

Pilot an OTel Collector on one cluster, auto-instrument, and build a small set of dashboards around the signals that predict incidents (ISR, lag, leader elections) rather than just liveness. Prove value, then scale — exactly the five-step journey above.

Takeaways

  • "Up" is not "healthy." Baseline monitoring sees liveness; observability sees behaviour — event flow, lag, ISR.
  • Metrics/logs/traces only pay off correlated: that → why → where, pivotable in one place.
  • OTEL is the vendor-neutral waist: instrument once, swap backends freely, escape lock-in and cost explosions.
  • Watch ISR, not just CPU. Shrinking ISR is the Kafka incident you'll otherwise miss.
  • Cardinality is a budget. Per-partition × hundreds of topics is a cost bomb — aggregate deliberately.
  • Roll out small, SRE-and-compliance-first. Pilot on one cluster, auto-instrument, prove value, then scale.

Next in the series — Deep dive 04: KubeVela, which moves from observing the platform to delivering applications onto it with the Open Application Model.

References

← prev: shared-first platforms next: kubevela →
© cvam — written in plaintext, served warm