Deep dive 14 opens the observability/edge/storage block. Arpitha Malavalli (Google) showed how to build a complete observability stack — golden signals, alerts, SLOs — for free from what Istio's sidecars already emit, no application changes and no custom instrumentation. The key insight: the "Istio Stats Filter" is a natively-compiled C++ extension inside every pod's Envoy, so telemetry is fully distributed and zero-latency. You get rich, labelled metrics out of the box; the real work is taming them — managing cardinality and exposing the extra Envoy stats you actually need.
This is the "lean" counterweight to the heavier observability talks: where DD03 and DD11 added OpenTelemetry pipelines, this one says — if you're already running a mesh, most of your observability is already there. Just harvest it.
The premise — telemetry under constraints
The setting: isolated or air-gapped cloud environments with tight infrastructure and connectivity constraints. The requirement: leverage native mesh telemetry to the absolute best to build a reliable observability stack quickly — without standing up a heavy custom pipeline. The topology is Istio's primary-remote: a Primary cluster hosts the Istiod control plane; User/System clusters are remote. Same-cluster workloads talk directly over mTLS; cross-cluster traffic routes through an east-west gateway; and Istiod watches all clusters' API servers for unified service discovery.
Where the metrics come from — the Istio Stats Filter
The filter turns raw Envoy counters into clean, aggregated Istio metrics:
| Istio aggregated metric | Raw Envoy input |
|---|---|
istio_requests_total | downstream_rq_total, cluster upstream_rq_xx |
istio_request_duration_ms | downstream_rq_time |
istio_request_bytes / istio_response_bytes | bytesReceived() / bytesSent() |
istio_tcp_conn_opened | downstream/upstream_cx_total |
istio_tcp_sent/recv | downstream_cx_tx/rx_bytes_total |
Istio's filters add metadata labels via xDS config (source workload, destination service, response code, protocol), and Prometheus/OTel scrapes the sidecar's :15020 port to feed dashboards, alerts, and SLOs.
Every request tells a story — the labels
The power is in the labels, present on all aggregated metrics. Every request carries a full identity on both ends:
| Group | Labels |
|---|---|
| Source | source_workload, source_namespace, source_principal (SPIFFE), source_app, source_version, source_canonical_service, source_cluster |
| Destination | destination_workload, destination_namespace, destination_principal, destination_app, destination_version, destination_service, destination_canonical_service, destination_cluster |
| Request/response | reporter (source or destination), request_protocol (http/grpc), response_code (200/404/503), response_flags (-, UH, DPE), connection_security_policy (mutual_tls) |
istio_requests_total and istio_request_duration_ms grouped by these labels. The mesh already knows who called whom, with what result, over mTLS or not.Out of the box — alerts, dashboards, SLOs
Because the metrics and labels are standardized, the alerting and SLI definitions become templates you point at any service. A 5xx error-rate alert:
- alert: MyService_HighErrorRate_5xx
expr: |
sum(rate(istio_requests_total{reporter="destination",
destination_service_name="my-service", response_code=~"5.*"}[5m]))
/ sum(rate(istio_requests_total{reporter="destination",
destination_service_name="my-service"}[5m])) > 0.01
for: 5m
labels: { severity: critical }
And SLOs — success-rate and latency — defined declaratively against the same metrics (e.g. P99 latency under 250ms via the istio_request_duration_milliseconds_bucket histogram). The properties that make this scale: universally applicable (change the label value to target another service), no application changes, and templateable with rich filtering (slice by source_workload, request_protocol, etc.). Grafana dashboards for latency/traffic/errors come entirely from native Istio/Envoy metrics — no custom code.
Want more labels, or a more granular view?
By default Istio doesn't expose every granular envoy_* metric to Prometheus — deliberately, to reduce CPU/memory overhead in large meshes. When you need deeper detail, you opt in:
- Telemetry API customization — add/remove labels, or derive a label from a request header.
- Per-workload via annotations —
sidecar.istio.io/statsInclusionPrefixesandstatsInclusionRegexps. - Mesh-wide via IstioOperator —
meshConfig.defaultProviders.prometheus.envoyStats.inclusionPrefixes(or regexes).
Taming cardinality — the real work
pod_ip) multiply time series and blow up Prometheus memory and cost. This is exactly the trap the Kafka observability talk warned about — and the mesh makes it easy to fall into because the labels are so generous. The "lean" in lean observability is mostly about not keeping everything.The control point is Prometheus metric_relabel_configs in the scrape config:
| Strategy | What it does |
|---|---|
action: keep | whitelist specific Istio/Envoy metrics (e.g. istio_.*) via regex |
action: drop | remove noisy or redundant metrics to save memory |
action: labeldrop | prune high-cardinality labels (like pod_ip) from otherwise useful metrics |
| value rewriting | aggregate unique identifiers (IDs/IPs) into generic values |
| filtering | exclude specific time series based on label values |
Beyond metrics — proxy logs via the OTel Collector
The Envoy sidecar also generates access-log entries as it processes requests. Instead of writing to stdout, Envoy can use an OTLP exporter to send structured logs over gRPC to a dedicated OpenTelemetry Collector, which buffers, processes, and exports them to backends like Jaeger or Loki. You configure it mesh-wide via IstioOperator meshConfig.accessLogging with an otel extension provider, or per-workload via the Telemetry API.
FAQ
Do I need to instrument my app to get these metrics?
No. The Istio Stats Filter runs inside the Envoy sidecar, so istio_requests_total, duration, bytes, and TCP metrics — fully labelled with source/destination identity — appear without any application code change. That's the whole point.
Won't all those labels blow up Prometheus?
They can — high-cardinality labels like pod_ip are the danger. Use metric_relabel_configs: keep the metrics you want, labeldrop the high-cardinality labels, and rewrite unique IDs into generic buckets. Managing cardinality is the main job in a lean stack.
How do I get an Envoy metric that isn't showing up?
Istio hides many granular envoy_* stats by default to save overhead. Opt in per-workload with sidecar.istio.io/statsInclusionPrefixes/statsInclusionRegexps annotations, or mesh-wide via IstioOperator's envoyStats inclusion config.
Can I reuse the same SLO across services?
Yes — that's the templating win. Because every service emits the same metrics with the same label schema, you define a success-rate or latency SLO once and point it at any service by changing destination_service_name. No per-service instrumentation.
Takeaways
- The mesh is already an observability source. Istio's in-Envoy C++ stats filter emits labelled metrics with zero app changes and no central bottleneck.
- Golden signals for free — latency, traffic, errors, saturation, sliced by source/destination/version/protocol from
istio_requests_total+ duration. - Alerts and SLOs are templates — define once, point at any service via label values.
- Cardinality is the discipline. Use
metric_relabel_configs(keep/drop/labeldrop/rewrite) to keep Prometheus lean. - Opt into granularity only where needed via Telemetry API / annotations / IstioOperator.
- Add logs cheaply — Envoy access logs over OTLP to an OTel Collector complete the three pillars.
Next in the series — Deep dive 15: KubeEdge Deep Dive, extending Kubernetes out to the edge.
References
- KubeCon Mumbai 2026 — Day 1 index · the rest of the series
- Istio — standard metrics & labels · the aggregated metrics and their dimensions
- Istio Telemetry API · label customization & access logging
- Prometheus metric_relabel_configs · cardinality control