The New Generation of Llama AI Models
The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick. Accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices.
Native Multimodal and Multilingual Capabilities
The Llama 4 models are now natively multimodal and multilingual using a mixture-of-experts (MoE) architecture. The Llama 4 models deliver a variety of multimodal capabilities, driving advances in scale, speed, and efficiency that enable you to build more personalized experiences.
Llama 4 Scout and Llama 4 Maverick
Llama 4 Scout is a 109B-parameter model, 17B active per token, with a configuration of 16 experts boasting a 10M context-length window, and optimized and quantized to int4 for a single NVIDIA H100 GPU. This enables a variety of use cases, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.
Llama 4 Maverick is a 400B-parameter model, 17B active per token, with a configuration of 128 experts accepting 1M context length. The model delivers high-performance image and text understanding.
Optimized for NVIDIA TensorRT-LLM
NVIDIA optimized both Llama 4 Scout and Llama 4 Maverick models for NVIDIA TensorRT-LLM. TensorRT-LLM is an open-source library used to accelerate LLM inference performance for the latest foundation models on NVIDIA GPUs.
Performance Optimizations
TensorRT-LLM delivers a throughput of over 40K tokens per second with an NVIDIA-optimized FP8 version of Llama 4 Scout as well as over 30K tokens per second on Llama 4 Maverick.
Blackwell B200 GPU
Blackwell delivers massive performance leaps due to architectural innovations, including a second-generation Transformer Engine, fifth-generation NVLink, and FP8, FP6, and FP4 precision that enable higher performance for both training and inference. For Llama 4, these advancements provide you with 3.4x faster throughput and 2.6x better cost per token compared to NVIDIA H200.
Post-train Llama Models for Higher Accuracy
Fine-tuning the Llama models is seamless with NVIDIA NeMo, an end-to-end framework built for customizing large language models (LLMs) with your enterprise data. Start by curating high-quality pretraining or fine-tuning datasets using NeMo Curator, which helps extract, filter, and deduplicate structured and unstructured data at scale.
Simplifying Deployments with NVIDIA NIM
To ensure that enterprises can leverage them, the Llama 4 models will be packaged as NVIDIA NIM microservices, making it easy to deploy them on any GPU-accelerated infrastructure with flexibility, data privacy, and enterprise-grade security.
Get Started Today
Try the Llama 4 NIM microservices to experiment with your own data and build a proof of concept by integrating the NVIDIA-hosted API endpoint into your application.
Conclusion
The Llama 4 models represent a significant leap forward in AI capabilities, with native multimodal and multilingual capabilities, accelerated performance, and simplified deployments. With the combination of NVIDIA open-source software and NVIDIA NIM microservices, developers, researchers, and businesses can now innovate responsibly across a wide variety of applications.
FAQs
Q: What are the key features of Llama 4 Scout and Llama 4 Maverick?
A: Llama 4 Scout is a 109B-parameter model with a configuration of 16 experts, while Llama 4 Maverick is a 400B-parameter model with a configuration of 128 experts.
Q: What is the performance of Llama 4 Scout and Llama 4 Maverick on NVIDIA GPUs?
A: Llama 4 Scout can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, while Llama 4 Maverick can achieve over 30K tokens per second.
Q: How can I fine-tune Llama models for higher accuracy?
A: You can fine-tune Llama models using NVIDIA NeMo, an end-to-end framework built for customizing large language models (LLMs) with your enterprise data.
Q: How can I deploy Llama models in production?
A: You can deploy Llama models as NVIDIA NIM microservices, which makes it easy to deploy them on any GPU-accelerated infrastructure with flexibility, data privacy, and enterprise-grade security.

