Customizing and Evaluating Embedding Models for Retrieval-Augmented Generation (RAG) Pipelines
As large language models (LLM) gain popularity in various question-answering systems, retrieval-augmented generation (RAG) pipelines have become a focal point. RAG pipelines combine the generation power of LLMs with external data sources and retrieval mechanisms, enabling models to access domain-specific information that may not have existed during fine-tuning.
Challenges in Customizing and Evaluating Embedding Models
Embedding models play a critical role in RAG systems by converting both the document corpus and user queries into dense numerical vectors. However, pretrained embedding models often fail to capture the nuances of domain-specific data, leading to unreliable search results, missed connections, and poor RAG performance.
Creating Evaluation and Customization Data for Embedding Models is Challenging
Publicly available datasets often lack relevance when applied to enterprise-specific data. Creating human-annotated enterprise-specific datasets is both expensive and time-consuming, requiring domain experts to label large volumes of data.
Generating High-Quality Synthetic Data with NVIDIA NeMo Curator
NVIDIA NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides prebuilt pipelines for generating synthetic data to customize and evaluate embedding models.
The SDG pipeline for generating RAG evaluation data consists of three key components:
- QA Pair-Generating LLM: This component uses an NVIDIA NIM LLM to generate QA pairs from seed documents, with optimized system prompts that guide the LLM to create more context-aware and relevant questions.
- Embedding Model-as-a-Judge for Question Easiness: This component evaluates and ranks the complexity of each question using an embedding model, filtering out generated questions based on their cosine similarity with context documents.
- Answerability Filter for Grounding: This component ensures that each generated question is directly grounded in the seed document, preventing irrelevant or hallucinated questions from being included in the dataset.
Understanding Hard-Negative Mining
Hard negatives play a crucial role in enhancing the performance of contrastive learning for embedding models. By incorporating hard negatives, models are forced to learn more discriminative features, improving their ability to differentiate between similar yet distinct data points.
Hard-Negative Mining Methods
There are three methods for generating hard negatives:
- Top-K Selection: The system identifies the top K negative documents that have the highest cosine similarity to the question.
- Threshold-Based Selection: An alternative approach is to set minimum and maximum thresholds for cosine similarity between negatives and the question and select the top K negative documents that lie within these thresholds.
- Positive-Aware Mining: This method uses the positive relevance score as an anchor to more effectively remove false negatives.
Summary
In this post, we discussed how the SDG pipelines from NeMo Curator simplify generating high-quality datasets, enabling the precise evaluation and customization of text embedding models. With these enhanced datasets, you can effectively evaluate and fine-tune RAG performance, gaining insights into how well your retriever systems perform and identifying ways to improve accuracy and relevance.
Conclusion
Customizing and evaluating embedding models for RAG pipelines is a critical step in achieving accurate and relevant results. By using NVIDIA NeMo Curator’s SDG pipelines, you can generate high-quality datasets and optimize your RAG applications at scale with significantly lower costs.
Frequently Asked Questions
- Q: What is Retrieval-Augmented Generation (RAG) pipeline?
A: RAG pipeline combines the generation power of large language models (LLMs) with external data sources and retrieval mechanisms, enabling models to access domain-specific information that may not have existed during fine-tuning. - Q: What is the role of embedding models in RAG pipelines?
A: Embedding models convert both the document corpus and user queries into dense numerical vectors, enabling efficient retrieval of relevant documents. - Q: Why do pretrained embedding models fail to capture domain-specific data?
A: Pretrained embedding models often fail to capture the nuances of domain-specific data, leading to unreliable search results, missed connections, and poor RAG performance. - Q: How can I generate high-quality synthetic data for customizing and evaluating embedding models?
A: You can use NVIDIA NeMo Curator’s SDG pipeline, which consists of three key components: QA pair-generating LLM, embedding model-as-a-judge for question easiness, and answerability filter for grounding. - Q: What is hard-negative mining?
A: Hard negatives play a crucial role in enhancing the performance of contrastive learning for embedding models, forcing them to learn more discriminative features and improve their ability to differentiate between similar yet distinct data points. - Q: How can I generate hard negatives?
A: You can use one of three methods: Top-K selection, threshold-based selection, or positive-aware mining.

