Agentic Video Workflow with Search and Summarization

Solving the Challenges of Traditional Video Analytics with VLMs

Building a question-answering chatbot with large language models (LLMs) is now a common workflow for text-based interactions. What about creating an AI system that can answer questions about video and image content? This presents a far more complex task.

Challenges in Traditional Video Analytics

Traditional video analytics tools struggle with limited functionality and a narrow focus on predefined objects, making it difficult to build general-purpose systems that understand and extract rich context from video streams. Developers face the following core challenges:

Limited understanding: Computer vision models struggle with contextual insights beyond predefined objects.
Retaining context: Capturing and maintaining systems’ relevant context over time for videos is challenging.
Integration complexity: Building a seamless user experience requires integrating multiple AI technologies.

Introducing the AI Blueprint for Video Search and Summarization

To overcome these challenges, we introduce a solution using the NVIDIA AI Blueprint for video search and summarization. This approach enables the development of a visual AI agent capable of multi-step reasoning over video streams.

Combining AI Technologies for a Seamless User Experience

To build a system that not only understands the video but also interacts with users through speech, we combine multiple technologies – video analysis, speech recognition, reasoning, and audio output. We use REST APIs for individual services and orchestrate a cohesive workflow, simplifying scaling, maintenance, and the addition of new features.

Visual AI Agent Workflow Overview

In this workflow, we create a visual AI agent question-answering tool for videos. The tool performs complex multi-step reasoning based on a video stream, providing a hands-free user interface by taking in speech input and delivering audio output. We set up this workflow and demonstrate its broad contextual understanding by providing it with live first-person point-of-view video streams of everyday activities.

Sample Use Case: Open-World Question-Answering on First-Person Video Streams

We showcase the blueprint’s broad contextual understanding by providing it with live first-person point-of-view video streams of everyday activities, not limited by any specific contextual scope. For example, "Where did I leave my concert tickets?" and "What was the name of the coworker I just met?"

Integrating a Reasoning Pipeline, Speech Inputs, and Audio Outputs with AI Blueprint

To build this type of workflow, we need the following components:

AI Blueprint for video search and summarization
NVIDIA Morpheus SDK
Riva NIM ASR and TTS microservices
LLM NIM microservice for generating the final response

Architecture of the Agentic RAG Workflow

Figure 1 shows the architecture of the agentic RAG workflow with NVIDIA Morpheus, Riva, and AI Blueprint. This workflow consists of the following steps:

Video processing: Stored or streamed video is processed using the AI Blueprint to create natural-language summaries of the events. The blueprint also creates a knowledge graph of the video, which can be queried later through REST APIs.
Speech-to-text conversion: User audio queries are transcribed into text using the Riva Parakeet model for automatic speech recognition.
Reasoning pipeline: The Morpheus SDK powers the LLM reasoning pipeline, generating actionable checklists based on the user query.
Context retrieval: Relevant information is fetched from three parallel pipelines:
- Querying the blueprint to fetch answers from pre-existing summaries and knowledge graphs.
- Sending new queries to the blueprint to retrieve specific insights from the video that weren’t captured during initial processing.
- Performing an internet search to supplement the video insights with additional facts relevant to the scene and user query.
Final response generation: An LLM NIM microservice synthesizes the gathered context to produce a summarized answer to the user.
Text-to-speech conversion: The Riva text-to-speech FastPitch model outputs an audio version of the answer.

Conclusion

Traditional video analytics applications and their development workflows are typically built on a collection of fixed-function and limited models that are designed to detect and identify only a select set of predefined objects. With generative AI and vision foundation models, it is now possible to build applications with fewer models, each of which possesses incredibly complex and broad perception as well as rich contextual understanding. This new generation of VLMs gives rise to powerful visual AI agents.

FAQs

Q: What are the core challenges in traditional video analytics?
A: Limited understanding, retaining context, and integration complexity.

Q: What is the AI Blueprint for video search and summarization?
A: A cloud-native solution to accelerate the development of visual AI agents, offering a modular architecture with customizable model support and exposing REST APIs for easy integration with other technologies.

Q: What is Morpheus SDK?
A: A powerful LLM engine module that provides native support for NIM microservices, enabling parallelized inference calls on GPU.

Q: What is the LLM reasoning pipeline?
A: A reference architecture for a dynamic agentic reasoning pipeline that generates a preliminary checklist of actionable items to gather context helpful towards answering the user’s query.

Post Views: 64

Agentic Video Workflow with Search and Summarization

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter