NVIDIA Releases Nemotron-CC, a 6.3-Trillion-Token English Language Common Crawl Dataset
NVIDIA is excited to announce the release of Nemotron-CC, a 6.3-trillion-token English language Common Crawl dataset for pretraining highly accurate large language models (LLMs), including 1.9 trillion tokens of synthetically generated data.
Results
Shown in Figure 1 are MMLU scores when training 8B parameter models for 1 trillion tokens, varying only the 73% English Common Crawl portion of the training data. Compared to the leading open English Common Crawl dataset DCLM, the high-quality subset Nemotron-CC-HQ increases the MMLU by +5.6.
Figure 1. MMLU scores for 8B parameter models trained for 1 trillion tokens
Furthermore, the full 6.3-trillion-token dataset matches DCLM on MMLU, but contains four times more unique real tokens. This unlocks effective training over a long token horizon: an 8 billion parameter model trained for 15 trillion tokens, of which 7.2 trillion came from Nemotron-CC, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks.
Key Insights
Some of the key insights that led to these results include:
- Ensembling different model-based classifiers can help select a larger and more diverse set of high-quality tokens.
- Rephrasing can effectively reduce noise and errors in low-quality data and produce diverse variants with fresh unique tokens from high-quality data, leading to better results in downstream tasks.
- Disabling traditional non-learned heuristic filters for high-quality data can further boost high-quality token yield without hurting accuracy.
Data Curation Steps
Using NVIDIA NeMo Curator, we extracted and cleaned data from Common Crawl and then:
- Filtered it for the English language
- Performed global fuzzy deduplication as well as exact substring deduplication
- Leveraged model-based filters such as DCLM, fineweb-edu for quality classification
- Applied various heuristic and perplexity filters to further remove lower-quality data
We also leveraged synthetic data generation pipelines to generate ~2 trillion tokens of synthetic data.
Conclusion
Nemotron-CC is an open, large, high-quality English Common Crawl dataset that enables pretraining highly accurate LLMs over both short and long token horizons. In the future, we hope to release more datasets that are key ingredients for state-of-the-art LLM pretraining, such as a specialized math pretraining dataset.
Acknowledgments
We thank the Common Crawl Foundation for hosting the dataset. We thank Pedro Ortiz Suarez for valuable feedback that improved the paper and Greg Lindahl for help with improving the data formatting and layout.
FAQs
Q: What is Nemotron-CC?
A: Nemotron-CC is a 6.3-trillion-token English language Common Crawl dataset for pretraining highly accurate large language models (LLMs).
Q: What is the difference between Nemotron-CC-HQ and DCLM?
A: Nemotron-CC-HQ increases the MMLU by +5.6 compared to DCLM, and has 4x more data.
Q: How was the dataset curated?
A: The dataset was curated using NVIDIA NeMo Curator, which extracted and cleaned data from Common Crawl, filtered it for the English language, performed deduplication, and applied various filters to remove lower-quality data.
Q: What is the purpose of the synthetic data generation pipelines?
A: The synthetic data generation pipelines were used to generate ~2 trillion tokens of synthetic data to further increase the size and diversity of the dataset.

