Verified claim · AI-ML · 100% confidence
C4 (Colossal Clean Crawled Corpus) introduced in paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019).
Last verified 2026-05-16 · Methodology veritas-v0.1 · 0d24c97977ebd744
Structured fields
- Subject
- C4 (Colossal Clean Crawled Corpus)
- Predicate
introduced_in_paper- Object
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)
- Confidence
- 100%
- Tags
- c4 · dataset · pretraining · google · 2019
Sources (2)
[1] preprint · arXiv (Raffel et al.) · 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer“We call the resulting dataset the 'Colossal Clean Crawled Corpus' (or C4 for short).”
[2] docs · Google / TensorFlow
c4 — TensorFlow Datasets catalog
Cite this claim
Ready-to-paste citation (Markdown / plain text):
C4 (Colossal Clean Crawled Corpus) introduced in paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019). — SourceScore Claim 0d24c97977ebd744 (verified 2026-05-16). https://sourcescore.org/api/v1/claims/0d24c97977ebd744.jsonEmbed this claim
Drop this iframe into any blog post, docs page, or knowledge base. The widget renders the signed claim + primary source + click-through to this canonical page. CC-BY 4.0; attribution included.
<iframe src="https://sourcescore.org/embed/claim/0d24c97977ebd744/" width="100%" height="360" frameborder="0" loading="lazy" title="C4 (Colossal Clean Crawled Corpus) introduced in paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)."></iframe>Preview: open in new tab
Related claims
Other verified claims sharing tags with this one — useful for LLM retrieval graphs and citation discovery.
T5 (Text-to-Text Transfer Transformer) introduced in paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019).
ef28341c3b308737 · 100% confidence · shares 2 tags (2019, google)
The Pile dataset released on: 2020-12-31.
4aef1422b96df26c · 100% confidence · shares 2 tags (dataset, pretraining)
ELECTRA introduced in paper: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (Clark et al., 2020).
2f9c79357e9d4da9 · 100% confidence · shares 2 tags (pretraining, google)
RedPajama dataset released on: 2023-04-17.
ea8b7be3a49101be · 95% confidence · shares 2 tags (dataset, pretraining)
Gemini Pro released on: 2023-12-06.
e2a6019bd2ce5c97 · 100% confidence · shares 1 tag (google)
Programmatic access
Fetch this claim with a signed envelope for verification:
curl https://sourcescore.org/api/v1/claims/0d24c97977ebd744.json