Verified claim · AI-ML · 100% confidence
MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020).
Last verified 2026-05-16 · Methodology veritas-v0.1 · 428d754e7c651be6
Structured fields
- Subject
- MMLU benchmark
- Predicate
introduced_in_paper- Object
- Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)
- Confidence
- 100%
- Tags
- mmlu · benchmark · hendrycks · 2020 · iclr · evaluation
Sources (2)
[1] preprint · arXiv (Hendrycks et al.) · 2020-09-07
Measuring Massive Multitask Language Understanding“We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.”
[2] peer reviewed · OpenReview / ICLR · 2021-05-04
Measuring Massive Multitask Language Understanding (ICLR 2021)
Cite this claim
Ready-to-paste citation (Markdown / plain text):
MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020). — SourceScore Claim 428d754e7c651be6 (verified 2026-05-16). https://sourcescore.org/api/v1/claims/428d754e7c651be6.jsonEmbed this claim
Drop this iframe into any blog post, docs page, or knowledge base. The widget renders the signed claim + primary source + click-through to this canonical page. CC-BY 4.0; attribution included.
<iframe src="https://sourcescore.org/embed/claim/428d754e7c651be6/" width="100%" height="360" frameborder="0" loading="lazy" title="MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)."></iframe>Preview: open in new tab
Related claims
Other verified claims sharing tags with this one — useful for LLM retrieval graphs and citation discovery.
Vision Transformer (ViT) introduced in paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2020).
d3681b0981e0b700 · 100% confidence · shares 2 tags (2020, iclr)
SuperGLUE benchmark introduced in paper: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019).
1a1e87145608c91a · 100% confidence · shares 2 tags (benchmark, evaluation)
GLUE benchmark introduced in paper: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018).
aa113b5e61d5c214 · 100% confidence · shares 2 tags (benchmark, evaluation)
Reformer introduced in paper: Reformer: The Efficient Transformer (Kitaev, Kaiser, Levskaya, 2020).
76f7f00e79bc18c8 · 100% confidence · shares 2 tags (2020, iclr)
Retrieval-Augmented Generation (RAG) introduced in paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020).
d15057ced937a103 · 100% confidence · shares 1 tag (2020)
Programmatic access
Fetch this claim with a signed envelope for verification:
curl https://sourcescore.org/api/v1/claims/428d754e7c651be6.json