KubeCon India 2026 (Mumbai) — Day 1 Deep Dives

11 · What Did My Agent Do? — Observability for AI Agents

Deep dive 11 of 17 · AI, agents, GPU & serving

Jun 18, 2026 · conferences · 21 min read · 4800 words intermediate

What did my agent do? — observability and accountability for AI agents.

conferences kubecon observability ai-agents opentelemetry

Deep dive 11 of the KubeCon Mumbai 2026 series. Ishan Jain (Grafana Labs) tackled the question that haunts everyone shipping the systems from deep dive 10: when an agent does something wrong, expensive, or weird — what actually happened? LLMs are black boxes, stochastic, and costly, so you can't reason about them from the outside. The answer is observability built on the three pillars (metrics, traces, logs), with the trace as the hero: a multi-turn waterfall of every LLM call and tool invocation in an agentic run. And the connective tissue is OpenTelemetry's GenAI semantic conventions — "the USB-C of observability."

This is the operational other-half of the agentic story. DD10 taught you to build agents; this one teaches you to see inside them in production — which, as that talk noted, is where token usage and agent traces become your new golden signals.

Why observability — three reasons

The talk opened on the infamous Google AI Overview that told people to add "about ⅛ cup of non-toxic glue" to keep cheese on pizza. Funny — and a perfect illustration of three reasons agents need observability:

ProblemWhy it demands observability
LLMs are black boxesyou can't read the logic; you can only observe input → output behaviour.
Cost & performancetokens cost real money and calls are slow; you need to see where.
Hallucinations & trustwrong answers erode trust; you must be able to explain what went wrong.

For the glue answer, the diagnostic question is: what went wrong? Wrong context? An LLM hallucination? Bad chunking? A bad system prompt? You can't tell from the output alone. The conclusion, scrawled across the slide: "we need to see the trace."

The line that frames everything: "You don't know what your agent will do until your users use it." In staging you test "how do I build an alert?"; in production a user asks "how do I build an alert based on XYZ datasource" — an input you never anticipated. LLMs are stochastic, so you can't enumerate behaviours in advance. Observability is how you learn what your agent actually does in the wild.

And LLMs cost a lot

The cost slide showed on-demand usage running to $1,680 of a $3,000 cap. The questions you can only answer with good telemetry: Which steps cost the most? How much does a given model cost? Did my last change increase cost? Which users cost the most? Cost is a first-class observability signal for agents, not an afterthought.

The pillars — and why the trace is the hero

The three pillars are the familiar metrics, traces, logs. But for agents, the trace does the heavy lifting.

What is a trace?

A trace is a single request's journey, identified by a trace ID, visualized as a waterfall. The example: invoke_workflow crew running 46.37s, broken into environment context → crew created → invoke_agent (Travel Research Specialist) → task execution → chat gpt-4o → execute_tool "Search the internet with Serper" (many times) → POST …. You can see the agent's whole life: which sub-agents ran, which tools it called, how long each took.

What is a span?

A span is a single operation within a trace, carrying attributes — key-value metadata about that operation. For an LLM call, attributes include deployment.environment, gen_ai.client.token.usage (e.g. 634), and the actual system/user message content. Spans are where the "what did it think and what did it cost" lives.

A multi-turn agent trace = nested spans invoke_workflow crew — 46.37s invoke_agent: Travel Research Specialist chat gpt-4o — tokens 634 execute_tool: Serper search execute_tool: Serper search execute_tool: Serper search (loop!)

Fig 1 — the trace exposes the agent's reasoning loop: repeated tool calls jump out as a bottleneck.

Building with AI is different — and intentional

The talk drew the now-familiar spectrum: Software (deterministic code) → LLM apps (a model call) → Agents (LLM → tool → loop). The key word over the agent box was "intentional" — the loop is by design, the agent chooses to call tools and iterate. That's exactly why two observability primitives matter:

PrimitiveWhat it capturesUsed for
Single turn (LLM obs.)one LLM request with input/outputs — prompt metadata, model params, response tokensdebugging individual model performance
Multi turn (agent obs.)a complete agentic run — all LLM requests + tool usage, sequential and parallelidentifying bottlenecks and runaway loops in complex workflows
The crucial agent-specific insight. Single-turn LLM observability (one prompt, one response) isn't enough for agents, because an agent is a loop of many LLM calls and tool invocations. You need the multi-turn trace to spot the failure modes unique to agents: a tool called 12 times in a row, a reasoning loop that won't terminate, a sub-agent silently eating the latency budget. The loop is the feature and the bug surface.

The role of OpenTelemetry

The glue is OpenTelemetry — the second-fastest-growing CNCF project, the de facto standard across traces/metrics/logs, open and vendor-neutral. The speaker's line: "OTel is the USB-C of observability — one open standard for every signal."

Crucially, OTel now has semantic conventions for generative AI systems: standardized attribute and span names for GenAI inputs/outputs (events), operations (metrics), model operations (model spans), and agent operations (agent spans) — plus technology-specific conventions for Anthropic, Azure AI Inference, AWS Bedrock, OpenAI, and MCP. (They're still evolving; you opt into the latest with OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental.)

Why GenAI semantic conventions matter. Without them, every framework names its telemetry differently and you can't build portable dashboards or compare across models. With them, gen_ai.client.token.usage means the same thing everywhere — so your cost and latency views work regardless of which model or framework an agent uses. This is the same vendor-neutral-waist argument from the Kafka observability talk, now applied to agents.

Adding observability to agents — from basic to zero-code

The talk walked a ladder of instrumentation effort:

Basic — auto-instrumentation flags

export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
opentelemetry-instrument \
  --traces_exporter console --metrics_exporter console --logs_exporter console \
  --service_name dice-server \
  python my_ai_agent.py

Slow — manual SDK instrumentation

Hand-rolling spans with the OTel SDK (set up a TracerProvider, wrap each LLM call in a span, set token-usage attributes). Maximum control, maximum boilerplate — labelled "SLOW" for a reason.

Better — GenAI-aware libraries

Purpose-built instrumentation that understands LLMs/agents out of the box: OpenLIT (import openlit; openlit.init() or openlit-instrument --service-name … --environment production python my_ai_agent.py), plus OpenInference and OpenLLMetry. Two lines and you get GenAI-convention spans with token usage, model params, and prompts.

2026 — zero-code via eBPF

The newest option: OpenTelemetry eBPF Instrumentation (OBI) — auto-instrumentation that uses eBPF to inspect application executables and the OS networking layer, capturing trace spans and RED (Rate/Errors/Duration) metrics for HTTP/gRPC with no code changes at all. For compiled languages where manual tracepoints are painful, OBI gets you spans for free.

The full stack InstrumentationOBI (eBPF) + OpenLIT BackendTempo + Prometheus Grafana (visualize)

Fig 2 — instrument with OBI + OpenLIT, ship traces & metrics to Tempo/Prometheus, visualize in Grafana.

Agent evaluations — testing reasoning, not code paths

The final idea reframed quality assurance for agents: "We are testing reasoning, not code paths." Traditional tests assert deterministic outputs; agents need evaluations of judgment:

  • Single-turn eval: did the agent respond correctly according to the user prompt?
  • Multi-turn eval: did the agent maintain context across a conversation and perform well end-to-end?
Why this is a mindset shift. You can't unit-test "is this answer good?" with an equality assertion. Agent evals score reasoning quality — often with an LLM-as-judge or a rubric — across both a single turn and a whole conversation. Combined with traces (what it did) and metrics (what it cost), evals (how well it reasoned) complete the accountability picture. This pairs directly with the loop-quality scoring from the agentic systems talk.

FAQ

Isn't normal APM enough for an LLM app?

For a single LLM call, almost — you need request/response, latency, and token usage. For an agent, no: an agent is a loop of many calls and tool invocations, so you need the multi-turn trace to see the whole reasoning chain and catch loops, runaway tool use, and per-step cost.

What should I instrument first?

Token usage and the multi-turn trace. Token usage answers the cost questions (which step/model/user is expensive); the trace answers "what did it do." Start with a GenAI-aware library (OpenLIT/OpenInference/OpenLLMetry) for two-line setup, or OBI for zero-code.

Why do the OTel GenAI semantic conventions matter?

They standardize attribute/span names (e.g. gen_ai.client.token.usage) so dashboards and alerts are portable across models and frameworks. Without them every tool emits different field names and you can't compare or reuse anything.

How do I test something stochastic?

With evaluations, not equality assertions. Score reasoning quality on single-turn (right answer to the prompt) and multi-turn (held context, good e2e behaviour), typically via rubrics or LLM-as-judge. Pair evals with traces and cost metrics for full accountability.

Takeaways

  • You don't know what your agent will do until users use it — LLMs are black-box, stochastic, and costly, so observe behaviour.
  • The trace is the hero. A multi-turn waterfall of LLM calls and tool invocations shows what the agent actually did.
  • Single-turn ≠ enough. Agents are loops; only multi-turn traces reveal runaway loops and per-step bottlenecks.
  • Cost is a first-class signal — instrument token usage to answer which step/model/user is expensive.
  • OTel GenAI semantic conventions make telemetry portable; instrument with OpenLIT/OpenInference/OpenLLMetry, or zero-code via OBI (eBPF).
  • Evaluate reasoning, not code paths — single- and multi-turn evals complete the accountability story alongside traces and metrics.

Next in the series — Deep dive 12: Zero GPU, moving from observing AI to paying for it.

References

← prev: agentic AI systems next: zero GPU →
© cvam — written in plaintext, served warm