KubeCon India 2026 (Mumbai) — Day 1 Deep Dives

14 · The Lean Observability Stack — Native Telemetry for Service Mesh

Deep dive 14 of 17 · Observability, edge & storage

Jun 18, 2026 · conferences · 20 min read · 4600 words intermediate

The lean observability stack — quick, native telemetry for a service mesh.

conferences kubecon istio observability service-mesh

Deep dive 14 opens the observability/edge/storage block. Arpitha Malavalli (Google) showed how to build a complete observability stack — golden signals, alerts, SLOs — for free from what Istio's sidecars already emit, no application changes and no custom instrumentation. The key insight: the "Istio Stats Filter" is a natively-compiled C++ extension inside every pod's Envoy, so telemetry is fully distributed and zero-latency. You get rich, labelled metrics out of the box; the real work is taming them — managing cardinality and exposing the extra Envoy stats you actually need.

This is the "lean" counterweight to the heavier observability talks: where DD03 and DD11 added OpenTelemetry pipelines, this one says — if you're already running a mesh, most of your observability is already there. Just harvest it.

The premise — telemetry under constraints

The setting: isolated or air-gapped cloud environments with tight infrastructure and connectivity constraints. The requirement: leverage native mesh telemetry to the absolute best to build a reliable observability stack quickly — without standing up a heavy custom pipeline. The topology is Istio's primary-remote: a Primary cluster hosts the Istiod control plane; User/System clusters are remote. Same-cluster workloads talk directly over mTLS; cross-cluster traffic routes through an east-west gateway; and Istiod watches all clusters' API servers for unified service discovery.

Where the metrics come from — the Istio Stats Filter

The architectural key. The "Istio Stats Filter" isn't a separate service — it's a natively-compiled C++ extension running inside every pod's Envoy sidecar. That means it completely bypasses central bottlenecks: telemetry is fully distributed, zero-latency, and (in the speaker's phrase) "zero-compromise." There's no metrics aggregator to scale or fall over — each sidecar produces its own metrics, and Prometheus just scrapes them.

The filter turns raw Envoy counters into clean, aggregated Istio metrics:

Istio aggregated metricRaw Envoy input
istio_requests_totaldownstream_rq_total, cluster upstream_rq_xx
istio_request_duration_msdownstream_rq_time
istio_request_bytes / istio_response_bytesbytesReceived() / bytesSent()
istio_tcp_conn_openeddownstream/upstream_cx_total
istio_tcp_sent/recvdownstream_cx_tx/rx_bytes_total

Istio's filters add metadata labels via xDS config (source workload, destination service, response code, protocol), and Prometheus/OTel scrapes the sidecar's :15020 port to feed dashboards, alerts, and SLOs.

Every request tells a story — the labels

The power is in the labels, present on all aggregated metrics. Every request carries a full identity on both ends:

GroupLabels
Sourcesource_workload, source_namespace, source_principal (SPIFFE), source_app, source_version, source_canonical_service, source_cluster
Destinationdestination_workload, destination_namespace, destination_principal, destination_app, destination_version, destination_service, destination_canonical_service, destination_cluster
Request/responsereporter (source or destination), request_protocol (http/grpc), response_code (200/404/503), response_flags (-, UH, DPE), connection_security_policy (mutual_tls)
Why this gives you golden signals for free. Those labels let you slice latency, traffic, errors, and saturation by any dimension in Prometheus — per service, per version, per cluster, per protocol — without changing a line of application code. The four golden signals fall straight out of istio_requests_total and istio_request_duration_ms grouped by these labels. The mesh already knows who called whom, with what result, over mTLS or not.

Out of the box — alerts, dashboards, SLOs

Because the metrics and labels are standardized, the alerting and SLI definitions become templates you point at any service. A 5xx error-rate alert:

- alert: MyService_HighErrorRate_5xx
  expr: |
    sum(rate(istio_requests_total{reporter="destination",
        destination_service_name="my-service", response_code=~"5.*"}[5m]))
    / sum(rate(istio_requests_total{reporter="destination",
        destination_service_name="my-service"}[5m])) > 0.01
  for: 5m
  labels: { severity: critical }

And SLOs — success-rate and latency — defined declaratively against the same metrics (e.g. P99 latency under 250ms via the istio_request_duration_milliseconds_bucket histogram). The properties that make this scale: universally applicable (change the label value to target another service), no application changes, and templateable with rich filtering (slice by source_workload, request_protocol, etc.). Grafana dashboards for latency/traffic/errors come entirely from native Istio/Envoy metrics — no custom code.

Want more labels, or a more granular view?

By default Istio doesn't expose every granular envoy_* metric to Prometheus — deliberately, to reduce CPU/memory overhead in large meshes. When you need deeper detail, you opt in:

  • Telemetry API customization — add/remove labels, or derive a label from a request header.
  • Per-workload via annotationssidecar.istio.io/statsInclusionPrefixes and statsInclusionRegexps.
  • Mesh-wide via IstioOperatormeshConfig.defaultProviders.prometheus.envoyStats.inclusionPrefixes (or regexes).

Taming cardinality — the real work

The lean stack's central discipline. Rich labels are a double-edged sword: high-cardinality labels (like pod_ip) multiply time series and blow up Prometheus memory and cost. This is exactly the trap the Kafka observability talk warned about — and the mesh makes it easy to fall into because the labels are so generous. The "lean" in lean observability is mostly about not keeping everything.

The control point is Prometheus metric_relabel_configs in the scrape config:

StrategyWhat it does
action: keepwhitelist specific Istio/Envoy metrics (e.g. istio_.*) via regex
action: dropremove noisy or redundant metrics to save memory
action: labeldropprune high-cardinality labels (like pod_ip) from otherwise useful metrics
value rewritingaggregate unique identifiers (IDs/IPs) into generic values
filteringexclude specific time series based on label values

Beyond metrics — proxy logs via the OTel Collector

The Envoy sidecar also generates access-log entries as it processes requests. Instead of writing to stdout, Envoy can use an OTLP exporter to send structured logs over gRPC to a dedicated OpenTelemetry Collector, which buffers, processes, and exports them to backends like Jaeger or Loki. You configure it mesh-wide via IstioOperator meshConfig.accessLogging with an otel extension provider, or per-workload via the Telemetry API.

The complete lean stack. Metrics from the in-Envoy stats filter (scraped by Prometheus) + access logs via OTLP to an OTel Collector (to Loki/Jaeger) gives you the three pillars — almost entirely from the mesh, with the application untouched. The cost discipline (cardinality control) and the opt-in granularity are the only real engineering. That's the "lean" promise: maximum signal, minimum bespoke pipeline.

FAQ

Do I need to instrument my app to get these metrics?

No. The Istio Stats Filter runs inside the Envoy sidecar, so istio_requests_total, duration, bytes, and TCP metrics — fully labelled with source/destination identity — appear without any application code change. That's the whole point.

Won't all those labels blow up Prometheus?

They can — high-cardinality labels like pod_ip are the danger. Use metric_relabel_configs: keep the metrics you want, labeldrop the high-cardinality labels, and rewrite unique IDs into generic buckets. Managing cardinality is the main job in a lean stack.

How do I get an Envoy metric that isn't showing up?

Istio hides many granular envoy_* stats by default to save overhead. Opt in per-workload with sidecar.istio.io/statsInclusionPrefixes/statsInclusionRegexps annotations, or mesh-wide via IstioOperator's envoyStats inclusion config.

Can I reuse the same SLO across services?

Yes — that's the templating win. Because every service emits the same metrics with the same label schema, you define a success-rate or latency SLO once and point it at any service by changing destination_service_name. No per-service instrumentation.

Takeaways

  • The mesh is already an observability source. Istio's in-Envoy C++ stats filter emits labelled metrics with zero app changes and no central bottleneck.
  • Golden signals for free — latency, traffic, errors, saturation, sliced by source/destination/version/protocol from istio_requests_total + duration.
  • Alerts and SLOs are templates — define once, point at any service via label values.
  • Cardinality is the discipline. Use metric_relabel_configs (keep/drop/labeldrop/rewrite) to keep Prometheus lean.
  • Opt into granularity only where needed via Telemetry API / annotations / IstioOperator.
  • Add logs cheaply — Envoy access logs over OTLP to an OTel Collector complete the three pillars.

Next in the series — Deep dive 15: KubeEdge Deep Dive, extending Kubernetes out to the edge.

References

← prev: serving and scaling next: kubeedge deep dive →
© cvam — written in plaintext, served warm