Solving the Challenges of Traditional Video Analytics with VLMs
Building a question-answering chatbot with large language models (LLMs) is now a common workflow for text-based interactions. What about creating an AI system that can answer questions about video and image content? This presents a far more complex task.
Challenges in Traditional Video Analytics
Traditional video analytics tools struggle with limited functionality and a narrow focus on predefined objects, making it difficult to build general-purpose systems that understand and extract rich context from video streams. Developers face the following core challenges:
- Limited understanding: Computer vision models struggle with contextual insights beyond predefined objects.
- Retaining context: Capturing and maintaining systems’ relevant context over time for videos is challenging.
- Integration complexity: Building a seamless user experience requires integrating multiple AI technologies.
Introducing the AI Blueprint for Video Search and Summarization
To overcome these challenges, we introduce a solution using the NVIDIA AI Blueprint for video search and summarization. This approach enables the development of a visual AI agent capable of multi-step reasoning over video streams.
Combining AI Technologies for a Seamless User Experience
To build a system that not only understands the video but also interacts with users through speech, we combine multiple technologies – video analysis, speech recognition, reasoning, and audio output. We use REST APIs for individual services and orchestrate a cohesive workflow, simplifying scaling, maintenance, and the addition of new features.
Visual AI Agent Workflow Overview
In this workflow, we create a visual AI agent question-answering tool for videos. The tool performs complex multi-step reasoning based on a video stream, providing a hands-free user interface by taking in speech input and delivering audio output. We set up this workflow and demonstrate its broad contextual understanding by providing it with live first-person point-of-view video streams of everyday activities.
Sample Use Case: Open-World Question-Answering on First-Person Video Streams
We showcase the blueprint’s broad contextual understanding by providing it with live first-person point-of-view video streams of everyday activities, not limited by any specific contextual scope. For example, "Where did I leave my concert tickets?" and "What was the name of the coworker I just met?"
Integrating a Reasoning Pipeline, Speech Inputs, and Audio Outputs with AI Blueprint
To build this type of workflow, we need the following components:
- AI Blueprint for video search and summarization
- NVIDIA Morpheus SDK
- Riva NIM ASR and TTS microservices
- LLM NIM microservice for generating the final response
Architecture of the Agentic RAG Workflow
Figure 1 shows the architecture of the agentic RAG workflow with NVIDIA Morpheus, Riva, and AI Blueprint. This workflow consists of the following steps:
- Video processing: Stored or streamed video is processed using the AI Blueprint to create natural-language summaries of the events. The blueprint also creates a knowledge graph of the video, which can be queried later through REST APIs.
- Speech-to-text conversion: User audio queries are transcribed into text using the Riva Parakeet model for automatic speech recognition.
- Reasoning pipeline: The Morpheus SDK powers the LLM reasoning pipeline, generating actionable checklists based on the user query.
- Context retrieval: Relevant information is fetched from three parallel pipelines:
- Querying the blueprint to fetch answers from pre-existing summaries and knowledge graphs.
- Sending new queries to the blueprint to retrieve specific insights from the video that weren’t captured during initial processing.
- Performing an internet search to supplement the video insights with additional facts relevant to the scene and user query.
- Final response generation: An LLM NIM microservice synthesizes the gathered context to produce a summarized answer to the user.
- Text-to-speech conversion: The Riva text-to-speech FastPitch model outputs an audio version of the answer.
Conclusion
Traditional video analytics applications and their development workflows are typically built on a collection of fixed-function and limited models that are designed to detect and identify only a select set of predefined objects. With generative AI and vision foundation models, it is now possible to build applications with fewer models, each of which possesses incredibly complex and broad perception as well as rich contextual understanding. This new generation of VLMs gives rise to powerful visual AI agents.
FAQs
Q: What are the core challenges in traditional video analytics?
A: Limited understanding, retaining context, and integration complexity.
Q: What is the AI Blueprint for video search and summarization?
A: A cloud-native solution to accelerate the development of visual AI agents, offering a modular architecture with customizable model support and exposing REST APIs for easy integration with other technologies.
Q: What is Morpheus SDK?
A: A powerful LLM engine module that provides native support for NIM microservices, enabling parallelized inference calls on GPU.
Q: What is the LLM reasoning pipeline?
A: A reference architecture for a dynamic agentic reasoning pipeline that generates a preliminary checklist of actionable items to gather context helpful towards answering the user’s query.

