Topic hub · 8 claims
Evaluation, benchmarks, and the harness problem
The benchmarks that define "capable model" — and the methodology caveats that make cross-paper comparisons unreliable. Hand-verified primary sources for every benchmark cited in the literature.
Why benchmarks matter — and why they mislead
Benchmarks are how the field measures progress. MMLU, HumanEval, GLUE, SuperGLUE, Chatbot Arena — each tries to capture a different dimension of capability (knowledge breadth, code generation, language understanding, conversational quality). But the same benchmark name can produce different scores across different evaluation harnesses + prompt formats + decoding strategies, which is exactly why VERITAS does not ship performance-comparison claims (see /blog/why-no-performance-claims/).
The classics
GLUE (Wang et al. 2018) and SuperGLUE (Wang et al. 2019) were the first standardized natural-language-understanding benchmarks. ImageNet (Deng et al., CVPR 2009) preceded them in vision. BLEU (Papineni et al., ACL 2002) and ROUGE (Lin, ACL 2004) measured machine translation and summarization. These benchmarks shaped a decade of progress.
The LLM-era benchmarks
MMLU (Hendrycks et al. 2021) tests knowledge breadth across 57 subjects. HumanEval (Chen et al., OpenAI 2021) tests code generation. AlpacaEval (Tatsu Lab 2023) uses LLM-as-judge. Chatbot Arena (LMSYS 2023) uses pairwise human preferences. Each adds methodological subtlety: which split? which prompt? few-shot or zero-shot? chain-of-thought? The right reading is: track benchmarks as trend signals, not absolute rankings.
Defined terms (3)
- Benchmark
- A standardized dataset and evaluation protocol designed to measure a specific capability across multiple models.
- Evaluation harness
- Software that runs an LLM through a benchmark in a reproducible way. Different harnesses (LM Evaluation Harness, HELM, lm-eval) produce different scores for the same nominal benchmark.
- LLM-as-judge
- Evaluation approach where one LLM scores the outputs of another. Used by AlpacaEval and MT-Bench. Cheaper than human evaluation; biased toward judge-model preferences.
All claims in this topic (8)
- AlpacaEval·introduced in Li et al. 2023 — LLM-as-judge evaluation benchmark(1.00 · 2 sources)
- Chatbot Arena·introduced in Zheng et al. 2023 — LMSYS open platform for evaluating LLMs by human preference(1.00 · 2 sources)
- GLUE benchmark·introduced in paper GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018)(1.00 · 2 sources)
- HumanEval benchmark·introduced in paper Evaluating Large Language Models Trained on Code (Chen et al., 2021)(1.00 · 2 sources)
- LangSmith·publicly released on 2023-07-18 by LangChain — LLM observability + evaluation platform(1.00 · 2 sources)
- MMLU benchmark·introduced in paper Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)(1.00 · 2 sources)
- MTEB benchmark·introduced in Muennighoff et al. 2022 — Massive Text Embedding Benchmark(1.00 · 2 sources)
- SuperGLUE benchmark·introduced in paper SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019)(1.00 · 2 sources)