Verified claim · AI-ML · 100% confidence
The Pile dataset released on: 2020-12-31.
Last verified 2026-05-16 · Methodology veritas-v0.1 · 4aef1422b96df26c
Structured fields
- Subject
- The Pile dataset
- Predicate
released_on- Object
- 2020-12-31
- Confidence
- 100%
- Tags
- the-pile · dataset · pretraining · eleutherai · 2020
Sources (2)
[1] preprint · arXiv (Gao, Biderman, Black, Golding, Hoppe, Foster, Phang, He, Thite, Nabeshima, Presser, Leahy) · 2020-12-31
The Pile: An 800GB Dataset of Diverse Text for Language Modeling“In this work, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.”
[2] official blog · EleutherAI
The Pile — official site
Cite this claim
Ready-to-paste citation (Markdown / plain text):
The Pile dataset released on: 2020-12-31. — SourceScore Claim 4aef1422b96df26c (verified 2026-05-16). https://sourcescore.org/api/v1/claims/4aef1422b96df26c.jsonEmbed this claim
Drop this iframe into any blog post, docs page, or knowledge base. The widget renders the signed claim + primary source + click-through to this canonical page. CC-BY 4.0; attribution included.
<iframe src="https://sourcescore.org/embed/claim/4aef1422b96df26c/" width="100%" height="360" frameborder="0" loading="lazy" title="The Pile dataset released on: 2020-12-31."></iframe>Preview: open in new tab
Related claims
Other verified claims sharing tags with this one — useful for LLM retrieval graphs and citation discovery.
C4 (Colossal Clean Crawled Corpus) introduced in paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019).
0d24c97977ebd744 · 100% confidence · shares 2 tags (dataset, pretraining)
RedPajama dataset released on: 2023-04-17.
ea8b7be3a49101be · 95% confidence · shares 2 tags (dataset, pretraining)
EleutherAI founded in: 2020.
f018fec775a8e941 · 95% confidence · shares 2 tags (eleutherai, 2020)
Retrieval-Augmented Generation (RAG) introduced in paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020).
d15057ced937a103 · 100% confidence · shares 1 tag (2020)
GPT-3 parameter count: 175000000000.
1ca2cc2864dfb376 · 100% confidence · shares 1 tag (2020)
Programmatic access
Fetch this claim with a signed envelope for verification:
curl https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json