Nemotron-CC: A Trillion-Token English Language Dataset

NVIDIA Releases Nemotron-CC, a 6.3-Trillion-Token English Language Common Crawl Dataset

NVIDIA is excited to announce the release of Nemotron-CC, a 6.3-trillion-token English language Common Crawl dataset for pretraining highly accurate large language models (LLMs), including 1.9 trillion tokens of synthetically generated data.

Results

Shown in Figure 1 are MMLU scores when training 8B parameter models for 1 trillion tokens, varying only the 73% English Common Crawl portion of the training data. Compared to the leading open English Common Crawl dataset DCLM, the high-quality subset Nemotron-CC-HQ increases the MMLU by +5.6.

Figure 1. MMLU scores for 8B parameter models trained for 1 trillion tokens

Furthermore, the full 6.3-trillion-token dataset matches DCLM on MMLU, but contains four times more unique real tokens. This unlocks effective training over a long token horizon: an 8 billion parameter model trained for 15 trillion tokens, of which 7.2 trillion came from Nemotron-CC, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks.

Key Insights

Some of the key insights that led to these results include:

Ensembling different model-based classifiers can help select a larger and more diverse set of high-quality tokens.
Rephrasing can effectively reduce noise and errors in low-quality data and produce diverse variants with fresh unique tokens from high-quality data, leading to better results in downstream tasks.
Disabling traditional non-learned heuristic filters for high-quality data can further boost high-quality token yield without hurting accuracy.

Data Curation Steps

Using NVIDIA NeMo Curator, we extracted and cleaned data from Common Crawl and then:

Filtered it for the English language
Performed global fuzzy deduplication as well as exact substring deduplication
Leveraged model-based filters such as DCLM, fineweb-edu for quality classification
Applied various heuristic and perplexity filters to further remove lower-quality data

We also leveraged synthetic data generation pipelines to generate ~2 trillion tokens of synthetic data.

Conclusion

Nemotron-CC is an open, large, high-quality English Common Crawl dataset that enables pretraining highly accurate LLMs over both short and long token horizons. In the future, we hope to release more datasets that are key ingredients for state-of-the-art LLM pretraining, such as a specialized math pretraining dataset.

Acknowledgments

We thank the Common Crawl Foundation for hosting the dataset. We thank Pedro Ortiz Suarez for valuable feedback that improved the paper and Greg Lindahl for help with improving the data formatting and layout.

FAQs

Q: What is Nemotron-CC?
A: Nemotron-CC is a 6.3-trillion-token English language Common Crawl dataset for pretraining highly accurate large language models (LLMs).

Q: What is the difference between Nemotron-CC-HQ and DCLM?
A: Nemotron-CC-HQ increases the MMLU by +5.6 compared to DCLM, and has 4x more data.

Q: How was the dataset curated?
A: The dataset was curated using NVIDIA NeMo Curator, which extracted and cleaned data from Common Crawl, filtered it for the English language, performed deduplication, and applied various filters to remove lower-quality data.

Q: What is the purpose of the synthetic data generation pipelines?
A: The synthetic data generation pipelines were used to generate ~2 trillion tokens of synthetic data to further increase the size and diversity of the dataset.

Post Views: 32

Nemotron-CC: A Trillion-Token English Language Dataset

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter