Rapid Development of Solutions using Retrieval Augmented Generation (RAG) for Question-and-Answer LLM Workflows
The rapid development of solutions using Retrieval Augmented Generation (RAG) for question-and-answer LLM workflows has led to new types of system architectures. Our work at NVIDIA using AI for internal operations has led to several important findings for finding alignment between system capabilities and user expectations.
User Expectations
We found that regardless of the intended scope or use case, users generally want to be able to execute non-RAG tasks such as performing document translation, editing emails, or even writing code. A vanilla RAG application might be implemented so that it executes a retrieval pipeline on every message, leading to excess usage of tokens and unwanted latency as irrelevant results are included.
User Preferences
We also found that users appreciate having access to a web search and summarization capability, even if the application is designed for accessing internal private data. As an example, we used Perplexity’s search API to meet this need.
Basic Architecture
In this post, we share a basic architecture for addressing these issues, using routing and multi-source RAG to produce a chat application that is capable of answering a broad range of questions. This is a slimmed-down version of an application, and there are many ways to build a RAG-based application, but this can help get you started. For more information, see the NVIDIA/GenerativeAIExamples GitHub repo.
System Architecture
Figure 1. System architecture for the chat application
NIM Inference Microservices for LLM Deployment
Our project was built around NVIDIA NIM microservices for several models, including the following:
- llama-3.1-70b-instruct
- llama-3.1-8b-instruct
- llama-3.1-405b-instruct
LlamaIndex Workflow Events
We used LlamaIndex’s ChatEngine class, which provided a turnkey solution for deploying a conversational AI assistant backed by a vector database. While this worked well, we found that we wanted to inject additional steps to augment context and toggle features in a way that required more extensibility.
LlamaIndex Workflow
Figure 2. LlamaIndex Workflow event used to answer user questions
User Interface via Chainlit
Chainlit includes several features that helped speed up our development and deployment. It supports progress indicators and step summaries using the chainlit.Step decorator, and LlamaIndexCallbackHandler enables automatic tracing. We used a Step decorator for each LlamaIndex Workflow event to expose the application’s inner workings without overwhelming the user.
Setting up the Project Environment
To deploy this project, clone the repository located at NVIDIA/GenerativeAIExamples and create a virtual Python environment, running the following commands to create and activate the environment before installing dependencies:
mkdir .venv
pip -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Configuration
After installing the dependencies, make sure that you have a .env file located in the top-level directory of the project with values for the following:
NVIDIA_API_KEY: Required. You can get an API key for NVIDIA’s services from build.nvidia.com.PERPLEXITY_API_KEY: Optional. If it is not provided, then the application runs without using Perplexity’s search API. To obtain an API key for Perplexity, follow the instructions.
Project Structure
We organized the project code into separate files:
LlamaIndex Workflow(workflow.py): Routes queries and aggregates responses from multiple sources.Document Ingestion(ingest.py): Loads documents into a Milvus Lite database, which is a simple way to start with Milvus without containers. Milvus Lite’s main limitation is inefficient vector lookup, so consider switching to a dedicated cluster when document collections grow.Chainlit Application(chainlit_app.py): The Chainlit application contains functions triggered by events, with the main function (on_message) activating on user messages.Configuration(config.py): To play around with different model types, edit the default values. Here, you can select different models for routing and chat completion as well as the number of past messages used from chat history for each completion, and the type of model used by Perplexity for web search and summarization.
Building the Core Functionality
This application integrates LlamaIndex and NIM microservices via Chainlit. To show how to implement this logic, we’ll work through the following steps:
- Creating the User Interface
- Implementing the Workflow Event
- Integrating NIM Microservices
Conclusion
We hope this post has been a useful resource for you as you learn more about generative AI and the ways that NIM microservices and LlamaIndex Workflow events can be used together for the fast development of advanced chat functionality.
Frequently Asked Questions
Q: What is Retrieval Augmented Generation (RAG)?
A: RAG is a technique that uses a combination of retrieval and generation to answer user queries.
Q: What is LlamaIndex?
A: LlamaIndex is a vector database that is designed for efficient retrieval and ranking of documents.
Q: What is Chainlit?
A: Chainlit is a framework for building chat applications that provides a simple and intuitive way to create conversational AI assistants.
Q: What are some potential features to add to this project?
A: Some potential features to add include multimodal ingestion, user chat history with Chainlit’s Postgres connector, RAG reranking with the NVIDIA Mistral-based reranker, and error handling and timeout management to enhance reliability.

