Overview of NVIDIA NeMo Curator
NVIDIA NeMo Curator is a powerful tool designed to improve generative AI model accuracy by processing text, image, and video data at scale for training and customization. It provides prebuilt pipelines for generating synthetic data to customize and evaluate generative AI systems.
Accelerated Large-Scale Inference with NeMo Curator
NeMo Curator provides an out-of-the-box solution to scale inference pipelines for these models to a multinode, multi-GPU setup, while also accelerating inference through the CrossFit library from RAPIDS. This approach improves throughput by leveraging intelligent batching and utilizing cuDF for efficient IO operations, ensuring both scalability and performance optimization.
Classifier Models
The NVIDIA NeMo Curator team has released four new classifier models:
- Prompt Task and Complexity Classifier: A multiheaded model that classifies English text prompts across 11 task types and six complexity dimensions.
- Instruction Data Guard: A deep learning classification model that helps identify LLM poisoning attacks in datasets and generates a score to predict whether the input data is benign or poisonous.
- Multilingual Domain Classifier: A multilingual text classification model that categorizes content in 52 languages across 26 domains.
- Content Type Classifier DeBERTa: A text classification model that categorizes documents into 11 distinct content types.
Example Input and Output
Instruction
What is the average lifespan of a Golden Retriever?
Context
Golden Retrievers are a generally healthy breed; they have an average lifespan of 12 to 13 years. Irresponsible breeding to meet high demand has led to the prevalence of inherited health problems in some breed lines, including allergic skin conditions, eye problems and sometimes snappiness. These problems are rarely encountered in dogs bred from responsible breeders.
Response
The average lifespan of a Golden Retriever is 12 to 13 years.
score=0.000792806502431631
prediction = (score>0.5) = 0
Action:
The threshold for the model score is 0.5, and the prediction is set to 0 below it and to 1 above it.
prediction 0 means the prompt was classified as benign.
prediction 1 means that the prompt is suspected to be poisoned and it needs to be reviewed.
Multilingual Domain Classifier
Multilingual Domain Classifier is a powerful tool designed to help developers automatically categorize text content across 52 common languages, including English and many widely spoken languages including Chinese, Arabic, Spanish, and Hindi. The model can classify text into 26 different domains, ranging from Arts and Entertainment to Business, Science, and Technology.
Content Type Classifier DeBERTa
Content Type Classifier DeBERTa is an advanced text analysis model that enables automatic categorization of documents into 11 distinct content types, ranging from news articles and blog posts to product websites and analytical pieces.
Get Started
These four new classifier models are now available on Hugging Face. Additionally, the example notebooks are hosted in the NVIDIA/NeMo-Curator GitHub repo, providing step-by-step guidance for using these classifier models. Don’t forget to bookmark the repository to stay updated on future releases and improvements.
Conclusion
NeMo Curator provides a suite of powerful classifier models that can be used to enhance data quality, add metadata, and streamline data preparation. By leveraging these models, developers can build more accurate and robust AI systems that can handle large-scale data processing and analysis.
FAQs
Q: What are the benefits of using NeMo Curator?
A: NeMo Curator provides a suite of powerful classifier models that can be used to enhance data quality, add metadata, and streamline data preparation.
Q: How do I get started with NeMo Curator?
A: You can start by accessing the four new classifier models on Hugging Face and following the example notebooks in the NVIDIA/NeMo-Curator GitHub repo.
Q: What are the limitations of NeMo Curator?
A: NeMo Curator is designed for use with large-scale data processing and analysis. It may not be suitable for small-scale or low-resource applications.

