NVIDIA Accelerates Inference on Meta Llama

The New Generation of Llama AI Models

The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick. Accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.

Native Multimodal and Multilingual Capabilities

The Llama 4 models are now natively multimodal and multilingual using a mixture-of-experts (MoE) architecture. The Llama 4 models deliver a variety of multimodal capabilities, driving advances in scale, speed, and efficiency that enable you to build more personalized experiences.

Llama 4 Scout and Llama 4 Maverick

Llama 4 Scout is a 109B-parameter model, 17B active per token, with a configuration of 16 experts boasting a 10M context-length window, and optimized and quantized to int4 for a single NVIDIA H100 GPU. This enables a variety of use cases, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.

Llama 4 Maverick is a 400B-parameter model, 17B active per token, with a configuration of 128 experts accepting 1M context length. The model delivers high-performance image and text understanding.

Optimized for NVIDIA TensorRT-LLM

NVIDIA optimized both Llama 4 Scout and Llama 4 Maverick models for NVIDIA TensorRT-LLM. TensorRT-LLM is an open-source library used to accelerate LLM inference performance for the latest foundation models on NVIDIA GPUs.

Performance Optimizations

TensorRT-LLM delivers a throughput of over 40K tokens per second with an NVIDIA-optimized FP8 version of Llama 4 Scout as well as over 30K tokens per second on Llama 4 Maverick.

Blackwell B200 GPU

Blackwell delivers massive performance leaps due to architectural innovations, including a second-generation Transformer Engine, fifth-generation NVLink, and FP8, FP6, and FP4 precision that enable higher performance for both training and inference. For Llama 4, these advancements provide you with 3.4x faster throughput and 2.6x better cost per token compared to NVIDIA H200.

Post-train Llama Models for Higher Accuracy

Fine-tuning the Llama models is seamless with NVIDIA NeMo, an end-to-end framework built for customizing large language models (LLMs) with your enterprise data. Start by curating high-quality pretraining or fine-tuning datasets using NeMo Curator, which helps extract, filter, and deduplicate structured and unstructured data at scale.

Simplifying Deployments with NVIDIA NIM

To ensure that enterprises can leverage them, the Llama 4 models will be packaged as NVIDIA NIM microservices, making it easy to deploy them on any GPU-accelerated infrastructure with flexibility, data privacy, and enterprise-grade security.

Get Started Today

Try the Llama 4 NIM microservices to experiment with your own data and build a proof of concept by integrating the NVIDIA-hosted API endpoint into your application.

Conclusion

The Llama 4 models represent a significant leap forward in AI capabilities, with native multimodal and multilingual capabilities, accelerated performance, and simplified deployments. With the combination of NVIDIA open-source software and NVIDIA NIM microservices, developers, researchers, and businesses can now innovate responsibly across a wide variety of applications.

FAQs

Q: What are the key features of Llama 4 Scout and Llama 4 Maverick?
A: Llama 4 Scout is a 109B-parameter model with a configuration of 16 experts, while Llama 4 Maverick is a 400B-parameter model with a configuration of 128 experts.

Q: What is the performance of Llama 4 Scout and Llama 4 Maverick on NVIDIA GPUs?
A: Llama 4 Scout can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, while Llama 4 Maverick can achieve over 30K tokens per second.

Q: How can I fine-tune Llama models for higher accuracy?
A: You can fine-tune Llama models using NVIDIA NeMo, an end-to-end framework built for customizing large language models (LLMs) with your enterprise data.

Q: How can I deploy Llama models in production?
A: You can deploy Llama models as NVIDIA NIM microservices, which makes it easy to deploy them on any GPU-accelerated infrastructure with flexibility, data privacy, and enterprise-grade security.

Post Views: 57

NVIDIA Accelerates Inference on Meta Llama

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter