Tag

pretraining

4 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.

C4 (Colossal Clean Crawled Corpus) introduced in paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019).
0d24c97977ebd744 · 2 sources · 100% confidence
The Pile dataset released on: 2020-12-31.
4aef1422b96df26c · 2 sources · 100% confidence
RedPajama dataset released on: 2023-04-17.
ea8b7be3a49101be · 2 sources · 95% confidence
ELECTRA introduced in paper: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (Clark et al., 2020).
2f9c79357e9d4da9 · 2 sources · 100% confidence

Related tags

dataset3 google2 20202 foundational1 20231 20191 eleutherai1 open-source1 c41 discriminator1