Llama 3.2: High Performance on NVIDIA GPUs

Accelerating Llama 3.2 AI Inference Throughput

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and cost-efficient serving across millions of GPUs worldwide – from our most powerful data center and cloud GPUs to local NVIDIA RTX workstations and even low-power edge devices with NVIDIA Jetson.

Accelerating Llama 3.2 AI Inference Throughput

The Llama 3.2 11B and Llama 3.2 90B models include a vision encoder with a text decoder. The encoder is optimized for high-performance inference using the NVIDIA TensorRT library and the text decoder is optimized using the NVIDIA TensorRT-LLM library.

Delivering High Throughput and Low Latency

Table 1 shows maximum throughput performance, representing offline use cases, across a range of input and output sequence lengths and single input image with maximum supported resolution of 1120 x 1120 pixels. Using a system based on the NVIDIA HGX H200 platform, we run the Llama 3.2 90B model on eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, connected through NVLink and NVLink Switch, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs.

Minimum Latency Performance

Table 2 shows minimum latency performance using the same input and output sequence lengths and input image size.

Throughput Performance of GeForce RTX 4090 with ONNX Runtime on NVIDIA RTX

For Windows deployments, NVIDIA has optimized Llama 3.2 SLMs to work efficiently using the ONNX Runtime Generative API, with a DirectML backend. Performance measurements are made using the model checkpoint available on the NGC catalog.

Better Performance on Llama 3.2 across Platforms

With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with the most performant Llama 3.2 models on any platform – from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers NVIDIA TensorRT optimized inference on Llama 3.2 and other models from NVIDIA and its partner ecosystem.

Acknowledgments

We would like to thank George Yuan, Alex Settle, and Chenjie Luo for their efforts in supporting this post.

FAQs

Q: What are the key features of Llama 3.2?
A: Llama 3.2 is a series of vision language models (VLMs) that support both text and image inputs, with variants including 11B and 90B parameters, as well as text-only small language model (SLM) variants with 1B and 3B parameters.

Q: What is the purpose of the Llama 3.2 optimization?
A: The purpose of the Llama 3.2 optimization is to provide great performance and cost-efficient serving across millions of GPUs worldwide, from data centers and cloud to local workstations and edge devices.

Q: What is the minimum latency performance of Llama 3.2?
A: The minimum latency performance of Llama 3.2 is shown in Table 2, which demonstrates low latency and high throughput.

Post Views: 55

Llama 3.2: High Performance on NVIDIA GPUs

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter