Building a Multimodal Retrieval-Augmented Generation (RAG) System
Building a multimodal RAG system is challenging. The difficulty comes from capturing and indexing information from across multiple modalities, including text, images, tables, audio, video, and more. In our previous post, An Easy Introduction to Multimodal Retrieval-Augmented Generation, we discussed how to tackle text and images. This post extends this conversation to audio and videos. Specifically, we explore how to build a multimodal RAG pipeline to search information in videos.
Building RAG for Text, Images, and Videos
Building on first principles, we can say that there are three approaches for building a RAG pipeline that works across multiple modalities, as detailed below and in Figure 1.
Using a Common Embedding Space
The first approach for building a RAG pipeline that works across multiple modalities is using a common embedding space. This approach relies on a single model to project representations of information stored across different modalities in the same embedding space. Using models like CLIP that have encoders for both images and text falls into this category. The upside for using this approach is reduced architectural complexity. Depending on the diversity of data used to train the model, the flexibility of applicable use cases can also be considered.
Building N Parallel Retrieval Pipelines (Brute Force)
A second method is to make a modality or even a submodality native search and query all pipelines. This will result in multiple sets of chunks that spread across different modalities. In this case, two issues arise. First, the number of tokens that need to be ingested by a large language model (LLM) to generate an answer has been massively increased, thus increasing the cost of running the RAG pipeline. Second, an LLM that can absorb information across multiple modalities is needed. This approach simply moves the problem from the retrieval phase to the generation phase and increases the cost, but in turn simplifies the ingest process and infrastructure.
Grounding in a Common Modality
Lastly, information can be ingested from all modalities and grounded in one common modality, such as text. What this means is that all the key information from images, PDFs, videos, audio, and so on, needs to be converted into text for setting up the pipeline. This approach incurs some ingestion cost, and can lead to lossy embeddings, but can be used to unify all modalities effectively for both retrieval and generation.
Figure 1. Three different approaches that can be adopted to build a multimodal retrieval pipeline
Complexities with Retrieving Videos
When it comes to retrieving videos, there are several complexities to consider. One of the main challenges is dealing with the sheer volume of data. A single minute of video can contain up to 3,600 frames, making it difficult to process and retrieve relevant information.

Figure 5. Structural Similarity Index plotted across a scene
Blending Audio and Video Information
Once the representative frames are extracted, the next step is to extract all possible information from them. For this, we use a Llama-3-90B VLM NIM. We prompt the VLM to generate transcription for all the text and information on the screen as well as generate a semantic description.
Setting up the Retriever
Post text grounded audio-visual blending, we now have a coherent text description of the video. We also retain the timestamps for word-level utterances and frames along with file level metadata, such as the name of the file. Using this information, we create chunks augmented with the metadata and generate embeddings using an embedding model. These embeddings are stored in a vector database along with chunk-level timestamps as metadata.
Generating Answers
With the vector store set up, it’s now possible to talk to your videos. With an incoming user query, embed the query to retrieve, and then rerank to get the most relevant chunk. These chunks are then served to the LLM as context, to generate an answer. Appropriate metadata attached to the chunks can help provide the referenced videos and the timestamps, which were referred to answer the question.
Post Views: 45

