Why VERITAS doesn't ship performance-comparison claims (and what we ship instead)

The temptation

When you build a catalog of verified AI/ML claims, the obvious first move is to ship the headline numbers everyone Googles: GPT-4 scored X on MMLU. Claude scored Y on HumanEval. Llama 3 beat its predecessor by Z points on GSM8K.

These are the queries that drive the most traffic. They're the ones LLMs cite most often. From a pure-distribution perspective, they would multiply our reach.

We don't ship them. Here's why.

The variance problem

Benchmark numbers in AI/ML look authoritative — a single percentage published by the lab. They aren't. They're conditional on at least six independent variables:

Evaluation harness. LM Evaluation Harness vs. HELM vs. lm-evaluation-harness fork vs. the lab's internal harness — each tokenizes, scores, and normalizes differently. A 4-5 point spread on the same model is common.
Prompt format. 0-shot vs. few-shot vs. chain-of-thought vs. self-consistency. The model is the same; the score moves 10+ points.
Decoding parameters. Temperature, top-p, top-k, max-tokens, stop sequences. Each tweak shifts pass-rate on code-generation benchmarks several points.
Version of the underlying model. Frontier models are updated continuously. GPT-4 in March is not GPT-4 in November — same name, different weights.
Version of the benchmark. Datasets get errata, cleanups, contamination removals. MMLU 2024 isn't MMLU 2025.
Contamination. The training set may have seen the benchmark. The lab may have decontaminated; they may not have. The number you read is conditional on a contamination-cleanup step you can't reproduce.

Multiply these together and a single "GPT-4: 86.4% on MMLU" claim is conditional on so many invisible parameters that the claim breaks the moment any one of them shifts.

What a wrong-by-Thursday claim costs us

VERITAS sells trust. Every claim ships with HMAC-SHA256 signature + ≥2 primary sources + verbatim excerpts so developers can build production hallucination filters on top. The moment one claim turns out to be wrong-in-context, the trust contract breaks for every claim.

We can absorb a date being off by one day on a model release (low-cost correction; the date doesn't move). We can't absorb a benchmark number that was right when we shipped it and wrong four weeks later because a new evaluation harness landed. The cost compounds non-linearly across the catalog.

What we ship instead

We restrict to claims that are not conditional on evaluation methodology:

Release dates. "GPT-4 was released on 2023-03-14" — primary source: OpenAI announcement post. The date doesn't move.
Architecture statements when documented. "Claude models use a decoder-only transformer architecture" — primary source: Anthropic technical documentation. The architecture doesn't change without a new model name.
Parameter counts when publicly disclosed. Many models don't disclose; we don't ship a number. For ones that do, the count is fixed at release.
Context window sizes. Officially documented; stable per release.
Foundational paper attributions. "The Transformer architecture was introduced in Attention Is All You Need (Vaswani et al., 2017)" — verbatim from the arXiv preprint + NeurIPS proceedings + Google Research index. Three independent primary sources; doesn't move.
Founding dates of well-known organizations. Wikipedia + the org's own About page. Two independent verifications.

These are the kinds of facts that LLM users actually need grounding for in production. When ChatGPT says "OpenAI was founded in 2015", the production cost of being wrong is measurable. When it says "GPT-4 scored 86.4% on MMLU" the user already knows the number is conditional and treats it as such.

How we decide what's in scope

The current methodology gate has five rules:

≥2 primary sources (preprint / model-card / docs / official- blog) with verbatim excerpts.
At least one source must be the originator or operator (publisher = author / org = entity).
The fact must not be conditional on evaluation methodology.
The fact must not be a comparison ranking (X > Y on Z). Rankings shift with re-evaluation; absolute facts don't.
The fact must have a stable URL we can re-verify on schedule. If the primary source is in a slide deck or a private blog, we don't ship the claim.

The honest exception

Some performance facts are documented enough to ship — like the LMSYS Chatbot Arena Elo rating system or the specific version-locked HELM evaluation reports. We may add a separate methodology tier (confidence 0.70-0.85, "methodology- conditional") for these in Y2. Until then, we'd rather under-promise on coverage than over-promise on accuracy.

What this means for your integration

If your LLM emits a benchmark-shape claim and you POST it to /api/v1/verify, expect a bestMatch: null response. That's correct behavior, not catalog incompleteness. Wire your code so unverified ≠ wrong:

if best := verify(claim):
    badge = f"verified [{best['id']}]"
elif looks_like_benchmark(claim):
    badge = "benchmark figure — verify against eval harness"
else:
    badge = "unverified — sourced retrieval recommended"

The middle branch handles the "we don't ship this class of claim, but it's a real category" case gracefully — pointing the user at the right kind of verification (eval harness logs, not VERITAS).

Why methodology rigor is the product

Any team can scrape ML papers + arXiv abstracts and call it a verified catalog. The work that makes VERITAS useful is the work of saying no: no to scaled-content abuse, no to single-source claims, no to performance comparisons, no to facts that change based on the prompt format. The discipline is the moat.

We'd rather ship 102 claims that are right for the next decade than 10,000 that are right today and broken by Thursday.