Accelerating Llama 3.2 AI Inference Throughput
Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and cost-efficient serving across millions of GPUs worldwide – from our most powerful data center and cloud GPUs to local NVIDIA RTX workstations and even low-power edge devices with NVIDIA Jetson.
Accelerating Llama 3.2 AI Inference Throughput
The Llama 3.2 11B and Llama 3.2 90B models include a vision encoder with a text decoder. The encoder is optimized for high-performance inference using the NVIDIA TensorRT library and the text decoder is optimized using the NVIDIA TensorRT-LLM library.
Delivering High Throughput and Low Latency
Table 1 shows maximum throughput performance, representing offline use cases, across a range of input and output sequence lengths and single input image with maximum supported resolution of 1120 x 1120 pixels. Using a system based on the NVIDIA HGX H200 platform, we run the Llama 3.2 90B model on eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, connected through NVLink and NVLink Switch, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs.
Minimum Latency Performance
Table 2 shows minimum latency performance using the same input and output sequence lengths and input image size.
Throughput Performance of GeForce RTX 4090 with ONNX Runtime on NVIDIA RTX
For Windows deployments, NVIDIA has optimized Llama 3.2 SLMs to work efficiently using the ONNX Runtime Generative API, with a DirectML backend. Performance measurements are made using the model checkpoint available on the NGC catalog.
Better Performance on Llama 3.2 across Platforms
With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with the most performant Llama 3.2 models on any platform – from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers NVIDIA TensorRT optimized inference on Llama 3.2 and other models from NVIDIA and its partner ecosystem.
Acknowledgments
We would like to thank George Yuan, Alex Settle, and Chenjie Luo for their efforts in supporting this post.
FAQs
Q: What are the key features of Llama 3.2?
A: Llama 3.2 is a series of vision language models (VLMs) that support both text and image inputs, with variants including 11B and 90B parameters, as well as text-only small language model (SLM) variants with 1B and 3B parameters.
Q: What is the purpose of the Llama 3.2 optimization?
A: The purpose of the Llama 3.2 optimization is to provide great performance and cost-efficient serving across millions of GPUs worldwide, from data centers and cloud to local workstations and edge devices.
Q: What is the minimum latency performance of Llama 3.2?
A: The minimum latency performance of Llama 3.2 is shown in Table 2, which demonstrates low latency and high throughput.

