SourceScore

Verified claim · AI-ML · 100% confidence

The Pile dataset released on: 2020-12-31.

Last verified 2026-05-16 · Methodology veritas-v0.1 · 4aef1422b96df26c

Structured fields

Subject
The Pile dataset
Predicate
released_on
Object
2020-12-31
Confidence
100%
Tags
the-pile · dataset · pretraining · eleutherai · 2020

Sources (2)

  1. [1] preprint · arXiv (Gao, Biderman, Black, Golding, Hoppe, Foster, Phang, He, Thite, Nabeshima, Presser, Leahy) · 2020-12-31

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling
    In this work, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.
  2. [2] official blog · EleutherAI

    The Pile — official site

Cite this claim

Ready-to-paste citation (Markdown / plain text):

The Pile dataset released on: 2020-12-31. — SourceScore Claim 4aef1422b96df26c (verified 2026-05-16). https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json

Embed this claim

Drop this iframe into any blog post, docs page, or knowledge base. The widget renders the signed claim + primary source + click-through to this canonical page. CC-BY 4.0; attribution included.

<iframe src="https://sourcescore.org/embed/claim/4aef1422b96df26c/" width="100%" height="360" frameborder="0" loading="lazy" title="The Pile dataset released on: 2020-12-31."></iframe>

Preview: open in new tab

Related claims

Other verified claims sharing tags with this one — useful for LLM retrieval graphs and citation discovery.

Programmatic access

Fetch this claim with a signed envelope for verification:

curl https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json

API docs · Pricing · Methodology JSON