Deep dive 11 of the KubeCon Mumbai 2026 series. Ishan Jain (Grafana Labs) tackled the question that haunts everyone shipping the systems from deep dive 10: when an agent does something wrong, expensive, or weird — what actually happened? LLMs are black boxes, stochastic, and costly, so you can't reason about them from the outside. The answer is observability built on the three pillars (metrics, traces, logs), with the trace as the hero: a multi-turn waterfall of every LLM call and tool invocation in an agentic run. And the connective tissue is OpenTelemetry's GenAI semantic conventions — "the USB-C of observability."
This is the operational other-half of the agentic story. DD10 taught you to build agents; this one teaches you to see inside them in production — which, as that talk noted, is where token usage and agent traces become your new golden signals.
Why observability — three reasons
The talk opened on the infamous Google AI Overview that told people to add "about ⅛ cup of non-toxic glue" to keep cheese on pizza. Funny — and a perfect illustration of three reasons agents need observability:
| Problem | Why it demands observability |
|---|---|
| LLMs are black boxes | you can't read the logic; you can only observe input → output behaviour. |
| Cost & performance | tokens cost real money and calls are slow; you need to see where. |
| Hallucinations & trust | wrong answers erode trust; you must be able to explain what went wrong. |
For the glue answer, the diagnostic question is: what went wrong? Wrong context? An LLM hallucination? Bad chunking? A bad system prompt? You can't tell from the output alone. The conclusion, scrawled across the slide: "we need to see the trace."
And LLMs cost a lot
The cost slide showed on-demand usage running to $1,680 of a $3,000 cap. The questions you can only answer with good telemetry: Which steps cost the most? How much does a given model cost? Did my last change increase cost? Which users cost the most? Cost is a first-class observability signal for agents, not an afterthought.
The pillars — and why the trace is the hero
The three pillars are the familiar metrics, traces, logs. But for agents, the trace does the heavy lifting.
What is a trace?
A trace is a single request's journey, identified by a trace ID, visualized as a waterfall. The example: invoke_workflow crew running 46.37s, broken into environment context → crew created → invoke_agent (Travel Research Specialist) → task execution → chat gpt-4o → execute_tool "Search the internet with Serper" (many times) → POST …. You can see the agent's whole life: which sub-agents ran, which tools it called, how long each took.
What is a span?
A span is a single operation within a trace, carrying attributes — key-value metadata about that operation. For an LLM call, attributes include deployment.environment, gen_ai.client.token.usage (e.g. 634), and the actual system/user message content. Spans are where the "what did it think and what did it cost" lives.
Fig 1 — the trace exposes the agent's reasoning loop: repeated tool calls jump out as a bottleneck.
Building with AI is different — and intentional
The talk drew the now-familiar spectrum: Software (deterministic code) → LLM apps (a model call) → Agents (LLM → tool → loop). The key word over the agent box was "intentional" — the loop is by design, the agent chooses to call tools and iterate. That's exactly why two observability primitives matter:
| Primitive | What it captures | Used for |
|---|---|---|
| Single turn (LLM obs.) | one LLM request with input/outputs — prompt metadata, model params, response tokens | debugging individual model performance |
| Multi turn (agent obs.) | a complete agentic run — all LLM requests + tool usage, sequential and parallel | identifying bottlenecks and runaway loops in complex workflows |
The role of OpenTelemetry
The glue is OpenTelemetry — the second-fastest-growing CNCF project, the de facto standard across traces/metrics/logs, open and vendor-neutral. The speaker's line: "OTel is the USB-C of observability — one open standard for every signal."
Crucially, OTel now has semantic conventions for generative AI systems: standardized attribute and span names for GenAI inputs/outputs (events), operations (metrics), model operations (model spans), and agent operations (agent spans) — plus technology-specific conventions for Anthropic, Azure AI Inference, AWS Bedrock, OpenAI, and MCP. (They're still evolving; you opt into the latest with OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental.)
gen_ai.client.token.usage means the same thing everywhere — so your cost and latency views work regardless of which model or framework an agent uses. This is the same vendor-neutral-waist argument from the Kafka observability talk, now applied to agents.Adding observability to agents — from basic to zero-code
The talk walked a ladder of instrumentation effort:
Basic — auto-instrumentation flags
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true opentelemetry-instrument \ --traces_exporter console --metrics_exporter console --logs_exporter console \ --service_name dice-server \ python my_ai_agent.py
Slow — manual SDK instrumentation
Hand-rolling spans with the OTel SDK (set up a TracerProvider, wrap each LLM call in a span, set token-usage attributes). Maximum control, maximum boilerplate — labelled "SLOW" for a reason.
Better — GenAI-aware libraries
Purpose-built instrumentation that understands LLMs/agents out of the box: OpenLIT (import openlit; openlit.init() or openlit-instrument --service-name … --environment production python my_ai_agent.py), plus OpenInference and OpenLLMetry. Two lines and you get GenAI-convention spans with token usage, model params, and prompts.
2026 — zero-code via eBPF
The newest option: OpenTelemetry eBPF Instrumentation (OBI) — auto-instrumentation that uses eBPF to inspect application executables and the OS networking layer, capturing trace spans and RED (Rate/Errors/Duration) metrics for HTTP/gRPC with no code changes at all. For compiled languages where manual tracepoints are painful, OBI gets you spans for free.
Fig 2 — instrument with OBI + OpenLIT, ship traces & metrics to Tempo/Prometheus, visualize in Grafana.
Agent evaluations — testing reasoning, not code paths
The final idea reframed quality assurance for agents: "We are testing reasoning, not code paths." Traditional tests assert deterministic outputs; agents need evaluations of judgment:
- Single-turn eval: did the agent respond correctly according to the user prompt?
- Multi-turn eval: did the agent maintain context across a conversation and perform well end-to-end?
FAQ
Isn't normal APM enough for an LLM app?
For a single LLM call, almost — you need request/response, latency, and token usage. For an agent, no: an agent is a loop of many calls and tool invocations, so you need the multi-turn trace to see the whole reasoning chain and catch loops, runaway tool use, and per-step cost.
What should I instrument first?
Token usage and the multi-turn trace. Token usage answers the cost questions (which step/model/user is expensive); the trace answers "what did it do." Start with a GenAI-aware library (OpenLIT/OpenInference/OpenLLMetry) for two-line setup, or OBI for zero-code.
Why do the OTel GenAI semantic conventions matter?
They standardize attribute/span names (e.g. gen_ai.client.token.usage) so dashboards and alerts are portable across models and frameworks. Without them every tool emits different field names and you can't compare or reuse anything.
How do I test something stochastic?
With evaluations, not equality assertions. Score reasoning quality on single-turn (right answer to the prompt) and multi-turn (held context, good e2e behaviour), typically via rubrics or LLM-as-judge. Pair evals with traces and cost metrics for full accountability.
Takeaways
- You don't know what your agent will do until users use it — LLMs are black-box, stochastic, and costly, so observe behaviour.
- The trace is the hero. A multi-turn waterfall of LLM calls and tool invocations shows what the agent actually did.
- Single-turn ≠ enough. Agents are loops; only multi-turn traces reveal runaway loops and per-step bottlenecks.
- Cost is a first-class signal — instrument token usage to answer which step/model/user is expensive.
- OTel GenAI semantic conventions make telemetry portable; instrument with OpenLIT/OpenInference/OpenLLMetry, or zero-code via OBI (eBPF).
- Evaluate reasoning, not code paths — single- and multi-turn evals complete the accountability story alongside traces and metrics.
Next in the series — Deep dive 12: Zero GPU, moving from observing AI to paying for it.
References
- KubeCon Mumbai 2026 — Day 1 index · the rest of the series
- OTel — GenAI semantic conventions · agent spans, model spans, token metrics
- OpenLIT · OpenInference · OpenLLMetry · GenAI-aware instrumentation
- OpenTelemetry eBPF Instrumentation (OBI) · Grafana Tempo · zero-code tracing & the backend