Multimodal Retrieval-Augmented Generation for Video and Audio

Building a Multimodal Retrieval-Augmented Generation (RAG) System

Building a multimodal RAG system is challenging. The difficulty comes from capturing and indexing information from across multiple modalities, including text, images, tables, audio, video, and more. In our previous post, An Easy Introduction to Multimodal Retrieval-Augmented Generation, we discussed how to tackle text and images. This post extends this conversation to audio and videos. Specifically, we explore how to build a multimodal RAG pipeline to search information in videos.

Building RAG for Text, Images, and Videos

Building on first principles, we can say that there are three approaches for building a RAG pipeline that works across multiple modalities, as detailed below and in Figure 1.

Using a Common Embedding Space

The first approach for building a RAG pipeline that works across multiple modalities is using a common embedding space. This approach relies on a single model to project representations of information stored across different modalities in the same embedding space. Using models like CLIP that have encoders for both images and text falls into this category. The upside for using this approach is reduced architectural complexity. Depending on the diversity of data used to train the model, the flexibility of applicable use cases can also be considered.

Building N Parallel Retrieval Pipelines (Brute Force)

A second method is to make a modality or even a submodality native search and query all pipelines. This will result in multiple sets of chunks that spread across different modalities. In this case, two issues arise. First, the number of tokens that need to be ingested by a large language model (LLM) to generate an answer has been massively increased, thus increasing the cost of running the RAG pipeline. Second, an LLM that can absorb information across multiple modalities is needed. This approach simply moves the problem from the retrieval phase to the generation phase and increases the cost, but in turn simplifies the ingest process and infrastructure.

Grounding in a Common Modality

Lastly, information can be ingested from all modalities and grounded in one common modality, such as text. What this means is that all the key information from images, PDFs, videos, audio, and so on, needs to be converted into text for setting up the pipeline. This approach incurs some ingestion cost, and can lead to lossy embeddings, but can be used to unify all modalities effectively for both retrieval and generation.

Figure 1. Three different approaches that can be adopted to build a multimodal retrieval pipeline

Complexities with Retrieving Videos

When it comes to retrieving videos, there are several complexities to consider. One of the main challenges is dealing with the sheer volume of data. A single minute of video can contain up to 3,600 frames, making it difficult to process and retrieve relevant information.

Structural Similarity Index (SSIM) calculated for successive frames.

Figure 5. Structural Similarity Index plotted across a scene

Blending Audio and Video Information

Once the representative frames are extracted, the next step is to extract all possible information from them. For this, we use a Llama-3-90B VLM NIM. We prompt the VLM to generate transcription for all the text and information on the screen as well as generate a semantic description.

Setting up the Retriever

Post text grounded audio-visual blending, we now have a coherent text description of the video. We also retain the timestamps for word-level utterances and frames along with file level metadata, such as the name of the file. Using this information, we create chunks augmented with the metadata and generate embeddings using an embedding model. These embeddings are stored in a vector database along with chunk-level timestamps as metadata.

Generating Answers

With the vector store set up, it’s now possible to talk to your videos. With an incoming user query, embed the query to retrieve, and then rerank to get the most relevant chunk. These chunks are then served to the LLM as context, to generate an answer. Appropriate metadata attached to the chunks can help provide the referenced videos and the timestamps, which were referred to answer the question.

Post Views: 45

Multimodal Retrieval-Augmented Generation for Video and Audio

Building a Multimodal Retrieval-Augmented Generation (RAG) System

Building RAG for Text, Images, and Videos

Using a Common Embedding Space

Building N Parallel Retrieval Pipelines (Brute Force)

Grounding in a Common Modality

Complexities with Retrieving Videos

Blending Audio and Video Information

Setting up the Retriever

Generating Answers

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter