SourceScore

Concept · 2026-05-16

Evaluation harnesses — why the same model scores differently on the same benchmark

An evaluation harness is the software that runs an LLM through a benchmark in a reproducible way. Different harnesses produce different scores for the same model — sometimes 4–10 points apart. Here's why, and how to read benchmark numbers honestly.

Definition

An evaluation harness is the software that runs a language model through a benchmark dataset in a reproducible way — handling prompt formatting, decoding parameters, output parsing, and scoring. The harness determines whether the same model scores a 71 or a 78 on the same nominal benchmark. Different harnesses make different (defensible) choices on each of those steps.

The three major harnesses in 2026

  • LM Evaluation Harness (EleutherAI) — the de-facto open-source harness; powers the Hugging Face Open LLM Leaderboard. ~70 benchmarks, batch evaluation, log-likelihood scoring.
  • HELM (Stanford CRFM) — Holistic Evaluation of Language Models. Broader-scope: not just accuracy but calibration, robustness, fairness, bias, efficiency. Slower to run; richer report.
  • BIG-bench / BIG-bench Hard — community- sourced benchmark suite with 200+ tasks. Often run inside other harnesses rather than standalone.

Plus lab-internal harnesses (OpenAI evals, Anthropic's internal eval suite, Google's internal eval). Lab papers often report numbers from these — which is why reproducing a frontier-lab result with the open harnesses produces a different number.

What varies between harnesses

Six axes where harnesses make different choices:

  1. Prompt format. 0-shot vs. few-shot vs. chain-of-thought. Within few-shot: 5 examples vs. 25. Within CoT: which exemplar prompts, in what order.
  2. Scoring method. Log-likelihood (does the correct answer's token sequence have higher log-prob than incorrect alternatives?) vs. generation-then-parse (does the model emit the correct answer string after a stop token?). On the same multiple-choice question, these two scoring methods produce different numbers because some answers' tokens are more frequent in the model's general output distribution.
  3. Decoding parameters. Temperature, top-p, top-k, max-tokens, stop sequences. A model that scores 67% on HumanEval at T=0 might score 71% at T=0.2 (or vice versa).
  4. Output parsing. "The answer is C." vs. "C" vs. "(C) The Battle of Hastings." All three should count as "answer=C" but different harnesses parse these differently.
  5. Benchmark version. MMLU 2020 vs. MMLU 2024 (errata fixes, contamination cleanup). HumanEval original vs. HumanEval+ (extended test cases). Same nominal benchmark, different actual tests.
  6. Contamination handling. Did the harness check whether the model trained on the benchmark? Did the lab decontaminate beforehand? Two harnesses might run the same model but only one filters out leaked items, producing different scores.

How wide is the spread?

A 2024 reproducibility study found that running the same frontier model on MMLU through three different open harnesses (LM Eval, HELM, lab-default) produced scores spread across 4–10 percentage points. On HumanEval the spread was wider: some models showed 12+ point differences depending on prompt format alone.

This is not a flaw — every harness is making defensible choices. It IS a reason to treat any single benchmark number as conditional on the harness that produced it.

How to read a benchmark claim honestly

When you see "Model X scores 86.4% on MMLU", the questions you should be able to answer:

  • Which harness produced this number?
  • 0-shot or few-shot? How many shots?
  • Generation-scored or log-likelihood-scored?
  • Temperature setting?
  • Which MMLU version (date of dataset snapshot)?
  • Was decontamination applied? How rigorous?

Without those six pieces of information, "86.4%" is not directly comparable to any other model's 86.4%. They might have been measured under different harness conditions.

Implications for production decisions

Three takeaways for teams choosing models:

  1. Build your own eval. Public benchmarks are cheap to read but expensive to trust. A custom eval on your actual production prompts is the only number that's calibrated to your use case. Even 50 hand-graded examples give you a tighter signal than the public MMLU number for most domains.
  2. Triangulate. Don't pick a model from a single number. Look at 3+ harnesses + 3+ benchmarks + the arena leaderboard (LMSYS Chatbot Arena uses human-pairwise preferences, which is harness-agnostic). Disagreement is informative.
  3. Re-evaluate after frontier-model updates. "GPT-4 scored X" is conditional on the GPT-4 version active at that time. The model gets updated; the number may not move with it.

Why VERITAS doesn't ship benchmark claims

The six axes above are exactly why SourceScore's VERITAS catalog excludes performance-comparison claims (see why we don't ship them). A claim like "Llama 3 70B beats GPT-3.5 on HumanEval" requires specifying the harness, prompt format, decoding, version, and decontamination just to be meaningful. We'd rather ship 116 claims that are right for the next decade than 1,000 that are right under one harness on Thursday.