Topic hub · 13 claims
LLM observability — tracing, logging, and evals for production AI
Once an LLM application reaches production, you need traces, evals, and feedback loops. This hub catalogs the production-grade observability platforms and what each is best for.
Why LLM observability is its own product category
Traditional APM (Datadog, New Relic, Honeycomb) handles HTTP latency, error rates, and infrastructure metrics. None of that captures what matters in LLM applications: which prompts produced bad outputs, how a chain spent its tokens, whether a fine-tune is regressing on the eval set, whether a user accepted a draft, what a multi-step agent's intermediate steps did. LLM observability emerged 2022-2024 as a distinct category because the questions changed.
The four production-grade platforms (as of 2025)
LangSmith (LangChain, hosted) is the default for LangChain users; built-in to the LangChain ecosystem with strong eval framework. Langfuse (open-source + hosted) is the open-source champion — self-hostable, OpenTelemetry-compatible, framework-agnostic. Helicone (open-source + hosted) routes via proxy + adds caching, retries, rate limits alongside observability. Vellum AI is the developer-platform-with-evaluation positioning — workflow builder + eval orchestration + production tracing.
Eval coverage matters more than trace volume
Production LLM observability isn't about gathering 100% of traces — it's about catching the 0.5% of regressions that matter. Best practice: define a small, curated eval set (50-500 cases) covering edge cases + safety + tone + format; run on every deployment; alert on regression. Volume-based tracing is the cheap part; eval discipline is where the value lives.
Why verification + observability complement each other
Observability tells you what your model did. Verification (per SourceScore VERITAS) tells you which assertions in the output are factually grounded. The pair: observability catches behavioral regressions; verification catches factual hallucinations. Both required for production-grade LLM systems.
Defined terms (5)
- LLM tracing
- Capturing each step of an LLM application — prompts, retrievals, tool calls, intermediate generations, final outputs — for debugging + replay + analysis.
- Eval set
- A curated collection of test cases (input + expected behavior) run against an LLM application to detect regressions across deployments. Distinct from training data; never leaked into the training set.
- LangSmith
- LangChain's hosted observability + eval platform (2023). Tracks all LangChain runs by default; adds eval framework + dataset management. Closed-source backend.
- Langfuse
- Open-source LLM observability + tracing platform (founded 2022, YC W23). Self-hostable; OpenTelemetry-compatible; framework-agnostic. MIT-licensed.
- Helicone
- Open-source LLM observability via proxy gateway (founded 2022, YC W23). Adds caching, retries, rate-limiting alongside trace capture. Apache 2.0.
All claims in this topic (13)
- AlpacaEval·introduced in Li et al. 2023 — LLM-as-judge evaluation benchmark(1.00 · 2 sources)
- Chatbot Arena·introduced in Zheng et al. 2023 — LMSYS open platform for evaluating LLMs by human preference(1.00 · 2 sources)
- GLUE benchmark·introduced in paper GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018)(1.00 · 2 sources)
- Helicone·founded in 2022 by Justin Torre + Cole Gottdank + Scott Nguyen — open-source LLM observability + analytics (YC W23)(1.00 · 2 sources)
- Langfuse·founded in 2022 by Marc Klingen + Max Deichmann + Clemens Rawert — open-source LLM observability + tracing platform(1.00 · 2 sources)
- LangSmith·publicly released on 2023-07-18 by LangChain — LLM observability + evaluation platform(1.00 · 2 sources)
- LMArena (Chatbot Arena)·founded in 2023 — LMSYS Chatbot Arena → LMArena.ai 2024(1.00 · 2 sources)
- LongBench·introduced in paper LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (Bai et al., THU + Zhipu AI 2023-08-28)(1.00 · 2 sources)
- MMLU benchmark·introduced in paper Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)(1.00 · 2 sources)
- MTEB benchmark·introduced in Muennighoff et al. 2022 — Massive Text Embedding Benchmark(1.00 · 2 sources)
- SuperGLUE benchmark·introduced in paper SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019)(1.00 · 2 sources)
- SWE-bench·introduced in Jimenez et al. 2024 — software engineering benchmark from GitHub issues(1.00 · 2 sources)
- Vellum AI·founded in 2023 by Akash Sharma + Sidd Seethepalli + Noa Flaherty — LLM application development platform (YC W23)(1.00 · 2 sources)