Teams instrument their HTTP handlers and database queries without thinking twice, then ship an LLM feature with zero visibility into the most expensive and least predictable dependency they've ever added. A model call can take 200ms or 30 seconds. It can cost a fraction of a cent or several dollars. It can succeed, fail, or — worst of all — succeed with a bad answer. You can't manage any of that without data.
This guide covers what to capture, why each signal matters, and how to do it with open standards so your instrumentation isn't welded to one vendor.
Why LLM calls are different
Three properties make model calls unlike any other downstream dependency:
- Variable cost per request. A database query costs roughly the same every time. An LLM call's cost scales with input and output tokens — one verbose prompt template change can multiply your bill.
- Latency depends on output length. Generation is token-by-token, so p99 latency is dominated by how much the model decides to write. Time-to-first-token and total duration are different problems with different fixes.
- Failure is fuzzy. Besides hard errors (rate limits, timeouts), there are soft failures: truncated outputs, refusals, malformed JSON. Status code 200 doesn't mean it worked.
What to capture on every call
Treat each model call as a span with structured attributes. The minimum set that pays for itself immediately:
- Model identity — provider, model name, and version. Cost and behavior change across versions; you need to segment by them.
- Token counts — input tokens and output tokens, separately. They drive cost and latency, and they're your early-warning signal for prompt bloat.
- Latency — total duration plus time-to-first-token if you stream. TTFT is what users feel; total duration is what your timeout settings care about.
- Cost — computed from tokens and your price sheet, recorded at call time. Reconstructing cost later from logs is painful; attaching it to the span is free.
- Outcome — error type for hard failures, finish reason for soft ones (length-capped? content-filtered? tool call?).
- Context — which feature, tenant, or workflow triggered the call, so you can attribute spend to the thing that caused it.
Tracing multi-step AI workflows
Real AI features are rarely a single call. A typical RAG pipeline is: embed the query → search the vector store → assemble the prompt → call the model → post-process. If each step is a child span under one trace, a slow or expensive request decomposes instantly — you can see whether the retrieval, the prompt assembly, or the generation is the problem. This is ordinary distributed tracing; the only new part is the attributes.
The OpenTelemetry GenAI conventions
The OpenTelemetry community has standardized semantic conventions for generative-AI spans, so this data is portable across backends. The core attributes look like:
gen_ai.system = "openai" | "anthropic" | ... gen_ai.request.model = "gpt-4o" gen_ai.response.model = "gpt-4o-2024-08-06" gen_ai.usage.input_tokens = 1742 gen_ai.usage.output_tokens = 318 gen_ai.response.finish_reasons = ["stop"]
Libraries like OpenLLMetry auto-instrument popular SDKs and frameworks (OpenAI, Anthropic, LangChain, LlamaIndex, and others) and emit these conventions for you — the same drop-in story as HTTP auto-instrumentation.
Cost: the metric that surprises everyone
Almost every team that turns on LLM observability finds the same two things in the first week: a handful of callers generating a disproportionate share of token spend, and at least one prompt template that grew over time and nobody noticed. Watching cost per feature per day — not just total spend — is what turns the bill from a surprise into a dial you can tune.
Getting started
If you're already on OpenTelemetry, LLM observability is an extension, not a new system: add GenAI instrumentation, and your model calls land in the same traces as everything else. aiAxonIQ ingests these conventions natively (it's OpenLLMetry-compatible) and breaks down token usage, latency, and cost per model out of the box — so the day you ship an AI feature is the day you can see what it's doing.