Open-source large language models (LLMs) excel in English but struggle with other languages, especially the languages of Southeast Asia. This is primarily due to a lack of training data in these languages, limited understanding of local cultures, and insufficient tokens to capture unique linguistic structures and expressions.
To fully meet customer needs, enterprises in non-English-speaking countries must go beyond generic models and customize them to capture the nuances of their local languages, ensuring a seamless and impactful customer experience.
In this blog post, we explore how Viettel Solutions, a fast-growing subsidiary of Viettel Corporation, has leveraged NVIDIA NeMo Curator to process high-quality Vietnamese data for training Llama 3 ViettelSolution 8B, a state-of-the-art LLM that now ranks among the top of the VMLU leaderboard. NeMo Curator is a GPU-accelerated data-curation tool that enables large-scale, high-quality datasets for pretraining LLMs.
Prerequisites and Environment Setup
To follow along with the steps presented in this post, make sure you have the following set up:
- Installation
- Install NeMo Curator by following the instructions to install the CPU and CUDA-accelerated modules in the README file of the NeMo Curator repository.
- Install the datasets and jsonlines packages, which will be needed later.
- pip install datasets
- pip install jsonlines
- To proceed with data processing requires setting up a Dask environment. Dask is a flexible, open-source library that enables parallel and distributed computing in Python, which enables you to scale computations across multiple cores or even clusters. By distributing tasks, Dask makes the data handling process significantly faster and more efficient.
- We ran this experiment on an NVIDIA DGX A100 with a 128-core CPU and 2TB of RAM to handle the dataset size. Depending on your dataset and computing resources, you may need to adjust the Dask worker configuration accordingly. You can start a Dask cluster using the following commands:
import nemo_curator from dask.distributed import Client, LocalCluster # Start a Dask cluster with 12 workers, each limited at 64GB of memory. You might need to adjust these numbers according to your computing resources cluster = LocalCluster(n_workers=12, processes=True, memory_limit= '64GB') client = Client(cluster)
Data Processing Pipeline Overview
The data curation pipeline includes the following key steps:
- Download and Sharding: The datasets are downloaded from various sources, then combined and sharded for efficient distributed processing.
- Unicode Reformatting: Texts are standardized into a consistent Unicode format.
- Exact Deduplication: Removes exact duplicates to reduce redundancy.
- Quality Filtering
- Heuristic Filtering: Applies rules-based filters to remove low-quality content.
- Classifier-based Filtering: Uses machine learning to classify and filter documents based on quality.
Data Collecting
We sourced content from multiple datasets to enrich the diversity and volume of our training data for LLMs. These datasets include:
- The Vietnamese subset of the C4 dataset, a large and diverse collection of web-crawled text data.
- The Vietnamese subset of the OSCAR dataset, version 23.01, an aggregation of web-crawled data.
- Wikipedia’s Vietnamese articles, providing well-structured and informative content.
- A Vietnamese news corpus, offering locally relevant news articles.
Conclusion
This blog post showcases the data curation pipeline Viettel Solutions used for Vietnamese text data, along with an analysis to explore how each stage of the curation process impacts the dataset. The pipeline uses NVIDIA NeMo Curator, a valuable tool for preparing large datasets for pretraining language models, focusing on quality, efficiency, and scalability. It offers a range of significant advantages in the data curation process, including:
- Improving dataset quality by removing noise and harmful content using heuristic and classifier-based filters.
- Preserving the essential structure of the dataset, ensuring that the core characteristics remain intact post-curation.
- Adapting to different datasets, providing a tailored approach that meets the specific needs of each corpus.
FAQs
Q: What is the main challenge in training large language models for non-English languages?
A: The main challenge is the lack of training data, limited understanding of local cultures, and insufficient tokens to capture unique linguistic structures and expressions.
Q: Why is data curation important for language models?
A: Data curation is important because it enables the creation of high-quality datasets that can be used to train accurate and effective language models.
Q: What is NVIDIA NeMo Curator, and how does it help with data curation?
A: NVIDIA NeMo Curator is a GPU-accelerated data-curation tool that enables large-scale, high-quality datasets for pretraining LLMs. It helps with data curation by providing a range of features, including exact and fuzzy deduplication, heuristic and classifier-based filtering, and more.
Q: How does Viettel Solutions use NVIDIA NeMo Curator for Vietnamese text data?
A: Viettel Solutions uses NVIDIA NeMo Curator to process high-quality Vietnamese data for training Llama 3 ViettelSolution 8B, a state-of-the-art LLM that now ranks among the top of the VMLU leaderboard.

