You ship an agent. It works in the demo. Then a user asks it something slightly off-script, and it loops three tools, retrieves the wrong document, hallucinates a confident answer, and burns 40,000 tokens doing it. Your APM dashboard shows a healthy 200 response and a 9-second latency. It tells you nothing about why the agent was wrong. That gap is the entire reason agent observability exists.
A single agent run is not one LLM call. It is a tree of them: a planning prompt, a tool selection, the tool’s output fed back in, a retrieval step, a re-plan, and a final synthesis. Traditional monitoring sees the HTTP request and the wall-clock time. It cannot see the reasoning, the tool arguments the model chose, what the retriever actually returned, or where in that chain the run went sideways. To debug agents you need trace-level visibility: the full span tree, inputs and outputs at every step, token and cost accounting per call, and a way to attach evaluations to the trace after the fact.
We are AIToolsBakery. We are independent and we sell none of these tools. That matters here because the search results for “agent observability” are dominated by the vendors’ own landing pages and by affiliate roundups that rank tools by referral payout. We are neither. Below is how we actually think about the category, the tools worth knowing, and the honest limits of each. This piece is the agent-tracing companion to our broader work on LLM evaluation tools; evaluation scores a model’s output, observability shows you the path it took to get there.
The 30-second answer: Langfuse is the strongest open-source default for most teams. LangSmith fits if you live in LangChain or LangGraph. Arize Phoenix is the OpenTelemetry-native pick for self-hosters who also want evals. Braintrust and Helicone serve different needs, eval-heavy and proxy-simple. Confirm pricing on each vendor page before committing.
What “observability” means for agents, specifically
Three capabilities separate an agent observability tool from a generic logging library.
First, tracing and spans. Every LLM call, tool execution, retrieval, and nested sub-agent should appear as a span in a hierarchical trace, with timing, inputs, outputs, and metadata. Without this you are reading flat logs and reconstructing the call tree in your head.
Second, tool-call and reasoning visibility. Agents fail in the gaps between calls: a malformed tool argument, a retrieved chunk that was off-topic, a plan that ignored the user’s constraint. The tool has to surface the actual function arguments the model emitted and the actual content the tool returned, not just that “a tool ran.”
Third, online evaluation on real traces. Catching failures in production means running evaluators, LLM-as-judge, heuristics, or human review, against live traces, not only against an offline test set. This is where observability overlaps with, but does not replace, dedicated annotation tooling for model evaluation.
A fourth quality, less of a hard requirement but a strong differentiator, is how well the tool handles nested and multi-agent runs. A supervisor agent that delegates to sub-agents produces a deeply nested trace, and tools vary widely in how legibly they render that hierarchy. We test this deliberately: a flat list of 30 spans is not the same as a clean, collapsible tree you can navigate at 2 a.m. during an incident.
The standard underneath: OpenTelemetry GenAI semantic conventions
Before the tools, one piece of plumbing worth understanding. OpenTelemetry has been defining GenAI semantic conventions: a shared vocabulary for representing LLM and agent activity as spans (model name, token counts, tool calls, and so on). The practical payoff is portability. If your instrumentation emits OpenTelemetry-conformant traces, you can point them at any backend that speaks the same convention instead of being locked to one vendor’s proprietary SDK.
Several tools below are OpenTelemetry-native by design (Phoenix, Laminar, Langtrace). Others support it as one ingestion path among several. We treat strong OpenTelemetry support as a point in a tool’s favor, because it lowers your switching cost later. The conventions are still maturing, so verify current span attribute coverage against your stack rather than assuming full fidelity.
Langfuse: the open-source default

Langfuse is, for most teams, where we would start. It is open source under the MIT license, genuinely framework-agnostic, and self-hosting is a first-class path rather than an afterthought. It instruments cleanly against the OpenAI SDK, the Vercel AI SDK, Pydantic AI, LangChain, and others, so you are not forced into one agent framework to get good traces.
What you get: hierarchical tracing with full span trees, token and cost tracking, prompt management, datasets, and an evaluation layer you can run against production traces. The trace view is readable, and nested agent and tool spans render the way you actually need to debug them.
The honest limit: at high trace volume the managed cloud cost rises meaningfully, and self-hosting it well means running and maintaining infrastructure (it uses a ClickHouse-backed stack). That is a fair trade for the control and the license, but it is real operational work. Pricing has a usable free tier and scales with traces and seats; confirm current numbers on the Langfuse pricing page. If you are weighing it head to head against the obvious commercial alternative, we go deep in Langfuse vs LangSmith.
LangSmith: the LangChain-native choice
LangSmith is the commercial observability and evaluation platform from the LangChain team. If your agents are built on LangChain or LangGraph, this is the path of least resistance: instrumentation is close to automatic, and the integration with LangGraph (including agent state and graph-level views) is deeper than any third-party tool can offer.
What stands out: excellent tracing for LangChain and LangGraph apps, strong dataset and evaluation workflows, and tight coupling to the framework’s deployment and studio tooling. For teams committed to that ecosystem, the cohesion is the selling point.
The honest limit: it is proprietary SaaS. Self-hosting is gated behind enterprise plans, and the value drops noticeably if you are not using LangChain, because much of the magic is framework-specific. At high trace volumes it tends to be the more expensive option in this list. Pricing is per-seat plus usage with a free tier for individuals; confirm current tiers on the vendor page.
Arize Phoenix: OpenTelemetry-native, self-host friendly

Arize Phoenix is the open-source, OpenTelemetry-based tracing and evaluation tool from Arize. It is licensed under the Elastic License 2.0, which lets any company self-host it internally for free; the main restriction is that you cannot resell it as a competing hosted service. For most internal teams that distinction never bites.
What stands out: OpenTelemetry instrumentation throughout, a strong built-in evaluation layer (LLM-as-judge templates, datasets, experiment tracking), and a prompt playground that lets you replay traced calls. Because it is OpenTelemetry-native, it slots into existing tracing setups with less friction than proprietary SDKs. Phoenix is also the natural on-ramp to Arize’s larger production platform (AX) if you later need enterprise scale, though the two are separate products.
The honest limit: the Elastic License is not OSI-approved open source, so if your organization has a strict policy on license types, check it. And while Phoenix’s evals are good, very large-scale production monitoring is where Arize wants you on the paid platform. Verify the current self-host vs paid feature split on Arize’s site.
Braintrust: when evaluation is the center of gravity

Braintrust is a commercial platform built around evaluation, with observability attached rather than the other way round. If your team’s bottleneck is “how do we systematically score and compare agent versions” more than “how do we read one broken trace,” Braintrust is aimed squarely at you.
What stands out: a polished workflow for evals, experiments, and dataset management, with logging and tracing that feed those evaluations. It tends to fit teams running disciplined, repeated comparison of prompts and agent configurations. It overlaps with our prompt management coverage on the prompt-iteration side.
The honest limit: it is proprietary and pricing is quote-driven at the higher tiers, so budget predictability requires a conversation with sales. If pure trace debugging is your only need, a tracing-first tool may be a simpler fit. Confirm current plans and the free tier directly with Braintrust.
Helicone: the proxy-simple on-ramp

Helicone takes a different architectural route. It is an open-source (Apache 2.0) observability tool that can run as a proxy: you point your LLM calls through it and get logging, cost tracking, caching, and rate-limit visibility with very little code change. There is also an async SDK path if you prefer not to route traffic through a proxy.
What stands out: the lowest-friction start in this list. One configuration change and you are capturing requests, costs, and latencies across providers. For teams that mostly make direct LLM calls and want immediate cost and usage visibility, it is hard to beat on time-to-value.
The honest limit: the proxy model is excellent for per-call logging but less naturally suited to deep multi-step agent trace trees than the span-first tools above. For complex nested agents you may find the trace hierarchy less rich. Weigh the convenience against how agentic your workload really is. Confirm current limits and pricing on Helicone’s site.
Laminar and Langtrace: the OpenTelemetry-native challengers
Two newer, open-source, OpenTelemetry-native options are worth knowing.
Laminar is purpose-built for AI agents and is open source (Apache 2.0). Its tracing SDK auto-instruments many frameworks (Vercel AI SDK, LangChain, OpenAI, Anthropic, and browser-automation tools among them) with minimal code, and it adds agent-specific touches such as capturing browser session recordings synced to traces and a SQL editor for querying traces and events. For teams building browser or tool-heavy agents, that focus shows.
Langtrace is an open-source, OpenTelemetry-based end-to-end tracing tool for LLM apps, covering popular models, frameworks, and vector databases, with SDKs in Python and TypeScript. Its application is licensed under AGPL 3.0 while its SDKs are Apache 2.0, so check the application license against your policy before self-hosting.
The honest limit for both: they are younger and smaller than Langfuse, LangSmith, or Phoenix, which means smaller communities, faster-moving APIs, and fewer battle-tested large deployments. The OpenTelemetry-native design is a genuine strength; the maturity is the risk you are accepting. Confirm current feature coverage and hosting options on each vendor’s site.
How the agent observability tools compare
| Tool | What it does | Best for | License or tier |
|---|---|---|---|
| Langfuse | Full agent tracing, prompts, datasets, evals | Most teams; framework-agnostic; self-host | Open source (MIT) + paid cloud |
| LangSmith | Tracing + evals, deep LangGraph integration | Teams all-in on LangChain or LangGraph | Commercial SaaS; free tier for individuals |
| Arize Phoenix | OpenTelemetry tracing + evals + playground | Self-hosters wanting evals and OTel-native | Elastic License 2.0; free self-host |
| Braintrust | Eval-first platform with logging attached | Teams centered on systematic eval and comparison | Commercial; quote-driven at scale |
| Helicone | Proxy or async logging, cost and usage | Fast cost and usage visibility on direct calls | Open source (Apache 2.0) + paid cloud |
| Laminar | OTel-native agent tracing, browser sessions | Browser and tool-heavy agents | Open source (Apache 2.0) + cloud |
| Langtrace | OTel-native tracing across models and vector DBs | OTel-first teams wanting broad coverage | App AGPL 3.0, SDKs Apache 2.0 |
Pricing for all paid tiers above changes frequently and several are quote-based. Treat this table as a map of fit, not a price sheet, and confirm current numbers on each vendor page.
A lean way to start
You do not need to evaluate seven tools to make progress. A pragmatic path:
- Pick one default. For most teams that is Langfuse (open source, framework-agnostic). If you are committed to LangChain or LangGraph, start with LangSmith instead.
- Instrument one agent end to end. Get the full span tree visible: every LLM call, tool call, and retrieval, with inputs and outputs.
- Reproduce one real failure in the trace view. Confirm you can see the exact tool arguments and the exact retrieved content. If you cannot, the tool is wrong for you.
- Add one online evaluator. A single LLM-as-judge or heuristic check on production traces beats a perfect offline test suite that never runs.
- Only then weigh a second tool, and prefer OpenTelemetry-native instrumentation so switching stays cheap.
What these tools still cannot do
Observability shows you what your agent did and where it went wrong. It does not decide what “wrong” means. You still write the evaluators, define the success criteria, and read the hard traces yourself. The tools surface the malformed tool call; they do not tell you the prompt was ambiguous.
They also do not fix retrieval quality, prompt design, or a brittle agent architecture. A perfect trace of a badly designed agent is still a badly designed agent, now with excellent documentation of its failures. And online evals, especially LLM-as-judge, carry their own error: a judge model can be confidently wrong about whether the agent was right, which is why human review of a sampled slice never fully goes away.
Used well, these tools turn agent debugging from guesswork into reading. That is a large step, and it is the right first investment for any team putting agents in front of real users. Just hold the honest expectation: they make the problem visible, and visibility is where the actual work begins.



