Generative AI: Revolutionizing Industries with Multimodal Capabilities
Generative AI has rapidly evolved from text-based models to multimodal capabilities, performing tasks like image captioning and visual question answering, reflecting a shift toward more human-like AI. The community is now expanding from text and images to video, opening new possibilities across industries.
Video AI Models: Revolutionizing Industries
Video AI models are poised to revolutionize industries such as robotics, automotive, and retail. In robotics, they enhance autonomous navigation in complex, ever-changing environments, which is vital for sectors like manufacturing and warehouse management. In the automotive industry, video AI is propelling autonomous driving, boosting vehicle perception, safety, and predictive maintenance to improve efficiency.
Building Image and Video Foundation Models
To build image and video foundation models, developers must curate and preprocess a large amount of training data, tokenize the resulting high-quality data at high fidelity, train or customize pretrained models efficiently and at scale, and then generate high-quality images and videos during inference.
Announcing NVIDIA NeMo for Multimodal Generative AI
NVIDIA NeMo is an end-to-end platform for developing, customizing, and deploying generative AI models.
NVIDIA just announced the expansion of NeMo to support the end-to-end pipeline for developing multimodal models. NeMo enables you to easily curate high-quality visual data, accelerate training and customization with highly efficient tokenizers and parallelism techniques, and reconstruct high-quality visuals during inference.
Accelerated Video and Image Data Curation
High-quality training data ensures high-accuracy results from an AI model. However, developers face various challenges in building data processing pipelines, ranging from scaling to data orchestration.
NeMo Curator streamlines the data curation process, making it easier and faster for you to build multimodal generative AI models. Its out-of-the-box experience minimizes the total cost of ownership (TCO) and accelerates time-to-market.
While working with visuals, organizations can easily reach petabyte-scale data processing. NeMo Curator provides an orchestration pipeline that can load balance on multiple GPUs at each stage of the data curation. As a result, you can reduce video processing time by 7x compared to a naive GPU-based implementation. The scalable pipelines can efficiently process over 100 PB of data, ensuring the seamless handling of large datasets.
NVIDIA Cosmos Tokenizers
Tokenizers map redundant and implicit visual data into compact and semantic tokens, enabling efficient training of large-scale generative models and democratizing their inference on limited computational resources.
Today’s open video and image tokenizers often generate poor data representations, leading to lossy reconstructions, distorted images, and temporally unstable videos and placing a cap on the capability of generative models built on top of the tokenizers.. Inefficient tokenization processes also result in slow encoding and decoding and longer training and inference times, negatively impacting both developer productivity and the user experience.
NVIDIA Cosmos tokenizers are open models that offer superior visual tokenization with exceptionally large compression rates and cutting-edge reconstruction quality across diverse image and video categories.
Cosmos Tokenizer Architecture
A Cosmos tokenizer uses a sophisticated encoder-decoder structure designed for high efficiency and effective learning. At its core, it employs 3D causal convolution blocks, which are specialized layers that jointly process spatiotemporal information, and uses causal temporal attention that captures long-range dependencies in data.
Build Your Own Multimodal Models with NeMo
The expansion of the NVIDIA NeMo platform with at-scale data processing using NeMo Curator and high-quality tokenization and visual reconstruction using the Cosmos tokenizer empowers you to build state-of-the-art multimodal, generative AI models.
Join the waitlist and be notified when NeMo Curator is available. The tokenizer is available now on the /NVIDIA/cosmos-tokenizer GitHub repo and Hugging Face.
Conclusion
The NVIDIA NeMo platform has expanded to support the development of multimodal generative AI models, enabling the creation of high-quality images and videos. With the Cosmos tokenizer and NeMo Curator, developers can now build state-of-the-art multimodal models that can be used across various industries.
FAQs
Q: What is NeMo Curator?
A: NeMo Curator is a data curation tool that streamlines the data curation process, making it easier and faster for developers to build multimodal generative AI models.
Q: What is the Cosmos tokenizer?
A: The Cosmos tokenizer is an open model that offers superior visual tokenization with exceptionally large compression rates and cutting-edge reconstruction quality across diverse image and video categories.
Q: How does NeMo Curator reduce video processing time?
A: NeMo Curator reduces video processing time by 7x compared to a naive GPU-based implementation, allowing for efficient processing of large datasets.
Q: Is the Cosmos tokenizer available now?
A: Yes, the Cosmos tokenizer is available now on the /NVIDIA/cosmos-tokenizer GitHub repo and Hugging Face.

