Tag

evaluation

16 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.

MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020).
428d754e7c651be6 · 2 sources · 100% confidence
SuperGLUE benchmark introduced in paper: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019).
1a1e87145608c91a · 2 sources · 100% confidence
GLUE benchmark introduced in paper: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018).
aa113b5e61d5c214 · 2 sources · 100% confidence
Chatbot Arena introduced in: Zheng et al. 2023 — LMSYS open platform for evaluating LLMs by human preference.
789ddc9bc9c3d688 · 2 sources · 100% confidence
AlpacaEval introduced in: Li et al. 2023 — LLM-as-judge evaluation benchmark.
2f14f3078741c0ad · 2 sources · 100% confidence
LangSmith publicly released on: 2023-07-18 by LangChain — LLM observability + evaluation platform.
9ef37fbd1460c501 · 2 sources · 100% confidence
MTEB benchmark introduced in: Muennighoff et al. 2022 — Massive Text Embedding Benchmark.
cccd161dd058a31e · 2 sources · 100% confidence
SWE-bench introduced in: Jimenez et al. 2024 — software engineering benchmark from GitHub issues.
b16b5f5297e5f621 · 2 sources · 100% confidence
LMArena (Chatbot Arena) founded in: 2023 — LMSYS Chatbot Arena → LMArena.ai 2024.
88ff5918737d7b6b · 2 sources · 100% confidence
LongBench introduced in paper: LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (Bai et al., THU + Zhipu AI 2023-08-28).
a41ff9e64baa566f · 2 sources · 100% confidence
GPQA benchmark introduced in paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023).
26f75f130f7b395a · 3 sources · 92% confidence
HellaSwag benchmark introduced in paper: HellaSwag: Can a Machine Really Finish Your Sentence? (Zellers et al., 2019).
b3f34e83dd0c53b9 · 3 sources · 92% confidence
TruthfulQA benchmark introduced in paper: TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021).
824f830889daf33e · 3 sources · 92% confidence
BIG-bench introduced in paper: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Srivastava et al., 2022).
bde28f6f7e14e0e9 · 3 sources · 92% confidence
MMLU-Pro benchmark introduced in paper: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (Wang et al., 2024).
2df92e0b0e4c891b · 3 sources · 92% confidence
LiveCodeBench introduced in paper: LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (Jain et al., 2024).
b474cbe11ab65d51 · 3 sources · 92% confidence

Related tags

benchmark13 20237 introduced_in4 20242 foundational2 20222 20192 reasoning2 chatbot-arena2 lmsys2