RAG-Based LLM Workflows at NVIDIA

Rapid Development of Solutions using Retrieval Augmented Generation (RAG) for Question-and-Answer LLM Workflows

The rapid development of solutions using Retrieval Augmented Generation (RAG) for question-and-answer LLM workflows has led to new types of system architectures. Our work at NVIDIA using AI for internal operations has led to several important findings for finding alignment between system capabilities and user expectations.

User Expectations

We found that regardless of the intended scope or use case, users generally want to be able to execute non-RAG tasks such as performing document translation, editing emails, or even writing code. A vanilla RAG application might be implemented so that it executes a retrieval pipeline on every message, leading to excess usage of tokens and unwanted latency as irrelevant results are included.

User Preferences

We also found that users appreciate having access to a web search and summarization capability, even if the application is designed for accessing internal private data. As an example, we used Perplexity’s search API to meet this need.

Basic Architecture

In this post, we share a basic architecture for addressing these issues, using routing and multi-source RAG to produce a chat application that is capable of answering a broad range of questions. This is a slimmed-down version of an application, and there are many ways to build a RAG-based application, but this can help get you started. For more information, see the NVIDIA/GenerativeAIExamples GitHub repo.

System Architecture

Figure 1. System architecture for the chat application

NIM Inference Microservices for LLM Deployment

Our project was built around NVIDIA NIM microservices for several models, including the following:

llama-3.1-70b-instruct
llama-3.1-8b-instruct
llama-3.1-405b-instruct

LlamaIndex Workflow Events

We used LlamaIndex’s ChatEngine class, which provided a turnkey solution for deploying a conversational AI assistant backed by a vector database. While this worked well, we found that we wanted to inject additional steps to augment context and toggle features in a way that required more extensibility.

LlamaIndex Workflow

Figure 2. LlamaIndex Workflow event used to answer user questions

User Interface via Chainlit

Chainlit includes several features that helped speed up our development and deployment. It supports progress indicators and step summaries using the chainlit.Step decorator, and LlamaIndexCallbackHandler enables automatic tracing. We used a Step decorator for each LlamaIndex Workflow event to expose the application’s inner workings without overwhelming the user.

Setting up the Project Environment

To deploy this project, clone the repository located at NVIDIA/GenerativeAIExamples and create a virtual Python environment, running the following commands to create and activate the environment before installing dependencies:

mkdir .venv
pip -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

After installing the dependencies, make sure that you have a .env file located in the top-level directory of the project with values for the following:

NVIDIA_API_KEY: Required. You can get an API key for NVIDIA’s services from build.nvidia.com.
PERPLEXITY_API_KEY: Optional. If it is not provided, then the application runs without using Perplexity’s search API. To obtain an API key for Perplexity, follow the instructions.

Project Structure

We organized the project code into separate files:

LlamaIndex Workflow (workflow.py): Routes queries and aggregates responses from multiple sources.
Document Ingestion (ingest.py): Loads documents into a Milvus Lite database, which is a simple way to start with Milvus without containers. Milvus Lite’s main limitation is inefficient vector lookup, so consider switching to a dedicated cluster when document collections grow.
Chainlit Application (chainlit_app.py): The Chainlit application contains functions triggered by events, with the main function (on_message) activating on user messages.
Configuration (config.py): To play around with different model types, edit the default values. Here, you can select different models for routing and chat completion as well as the number of past messages used from chat history for each completion, and the type of model used by Perplexity for web search and summarization.

Building the Core Functionality

This application integrates LlamaIndex and NIM microservices via Chainlit. To show how to implement this logic, we’ll work through the following steps:

Creating the User Interface
Implementing the Workflow Event
Integrating NIM Microservices

Conclusion

We hope this post has been a useful resource for you as you learn more about generative AI and the ways that NIM microservices and LlamaIndex Workflow events can be used together for the fast development of advanced chat functionality.

Frequently Asked Questions

Q: What is Retrieval Augmented Generation (RAG)?
A: RAG is a technique that uses a combination of retrieval and generation to answer user queries.

Q: What is LlamaIndex?
A: LlamaIndex is a vector database that is designed for efficient retrieval and ranking of documents.

Q: What is Chainlit?
A: Chainlit is a framework for building chat applications that provides a simple and intuitive way to create conversational AI assistants.

Q: What are some potential features to add to this project?
A: Some potential features to add include multimodal ingestion, user chat history with Chainlit’s Postgres connector, RAG reranking with the NVIDIA Mistral-based reranker, and error handling and timeout management to enhance reliability.

Post Views: 44

RAG-Based LLM Workflows at NVIDIA

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter