Train Highly Accurate LLMs with Zyda-2
Open-source datasets have significantly democratized access to high-quality data, lowering the barriers of entry for developers and researchers to train cutting-edge generative AI models. By providing free access to diverse, high-quality, and well-curated datasets, open-source datasets enable the open-source community to train models at or close to the frontier, facilitating the rapid advancement of AI.
Building Blocks of Zyda-2
Zyda-2 combines the existing sources of open high-quality tokens such as DCLM, FineWeb-edu, Dolma, and Zyda-1. It performs robust filtering and cross-deduplication to improve the performance of each dataset alone. Zyda-2 combines the best elements of these datasets with many high-quality educational samples for logical reasoning and factual knowledge, while its other Zyda-1 component provides more diversity and variety and excels at more linguistic and writing tasks.
NeMo Curator’s Role in Creating the Dataset
NeMo Curator is a GPU-accelerated data curation library that improves generative AI model performance by processing large-scale, high-quality datasets for pretraining and customization. Yury Tokpanov, dataset lead at Zyphra said, "NeMo Curator played a crucial role in bringing the dataset to market faster. By using GPUs to accelerate the data processing pipelines, our team reduced the total cost of ownership (TCO) by 2x and processed the data 10x faster (from 3 weeks to 2 days)."
Train Highly Accurate LLMs with Zyda-2
Zyda-2 is ideal for general high-quality language model pretraining that is especially focused on language proficiency, as opposed to code and math that require additional specialized datasets. This is because Zyda-2 possesses the strengths of the top existing datasets while improving on their weaknesses.
Get Started
Download the Zyda-2 dataset directly from Hugging Face and train higher-accuracy models. It comes with an ODC-By license which enables you to train on or build off of Zyda-2 subject to the license agreements and terms of use of the original data sources.
Conclusion
In conclusion, Zyda-2 is a significant step forward in the field of natural language processing, offering a high-quality, diverse, and well-curated dataset that can be used to train highly accurate language models. With the power of NeMo Curator, Zyphra has created a dataset that is well-suited for developing advanced language models.
Frequently Asked Questions
Q: What is Zyda-2?
A: Zyda-2 is a high-quality, pretraining dataset for language models.
Q: What is the size of Zyda-2?
A: Zyda-2 is 5T tokens in English, making it 5x the size of Zyda-1.
Q: What are the strengths of Zyda-2?
A: Zyda-2 possesses the strengths of the top existing datasets while improving on their weaknesses, making it ideal for general high-quality language model pretraining.
Q: How do I get started with Zyda-2?
A: You can download the Zyda-2 dataset directly from Hugging Face and train higher-accuracy models.

