Date:

Accelerate Custom Video Pipelines with NVIDIA NeMo Framework

Generative AI for Video Foundation Models

Generative AI has evolved from text-based models to multimodal models, with a recent expansion into video, opening up new potential uses across various industries. Video models can create new experiences for users or simulate scenarios for training autonomous agents at scale. They are helping revolutionize various industries, including robotics, autonomous vehicles, and entertainment.

The Development of Video Foundation Models

The development of video foundation models presents unique challenges due to the vast and varied nature of video data. This also underscores the necessity of scalable pipelines for curating data and effectively training models that can comprehend temporal and spatial dynamics.

NeMo Framework

We are announcing brand new video foundation model capabilities in the NVIDIA NeMo framework, an end-to-end training framework that enables you to pretrain and fine-tune your own video foundation models. The framework includes high-throughput data curation, efficient multimodal data loading functionality, scalable model training, and parallelized in-framework inference.

High-Throughput Video Curation through Optimized Pipelines

NeMo Curator improves generative AI model accuracy by efficiently processing and preparing high-quality data, including large video datasets. Using NeMo Curator’s scalable data pipelines, you can efficiently clip, annotate, and filter 100 PB or more of videos. To remove bottlenecks and optimize performance, NeMo Curator uses a combination of:

  • NVDEC: Hardware decoder
  • NVENC: Hardware encoder
  • Ray: Compute framework for scaling AI applications

Efficient In-Framework Inference

The NeMo framework accelerates inference by distributing denoising operations across multiple GPUs through context parallelism. After parallel denoising, the latent tensors are combined to reconstruct the video sequence before decoding with the Cosmos video tokenizer.

Conclusion

In this post, we covered all the features of the NVIDIA NeMo framework that will help you pretrain or fine-tune video foundation models in an effective and efficient manner. NeMo Curator offers high-throughput data curation through clipping and sharding pipelines, and the Megatron Energon library offers efficient multimodal data loading. NeMo Frameworks enable scalable video foundation model training by supporting various model parallelism techniques, specifically optimized on diffusion and autoregressive models. Additionally, it provides efficient in-framework inference by distributing denoising operations across multiple GPUs and incorporating FP8 Multi-Head Attention.

Acknowledgments

Thanks to the following contributors: Parth Mannan, Xiaowei Ren, Zhuoyao Wang, Carl Wang, Jack Chang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Linnan Wang, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Jacob Huffman, Tommy Huang, Nima Tajbakhsh, and Ashwath Aithal.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here