Mistral-NeMo-Minitron 8B: Unparalleled Accuracy

Mistral NeMo 12B: A State-of-the-Art Large Language Model

Introduction

Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading state-of-the-art large language model (LLM). Consistently outperforming similarly sized models on a wide range of benchmarks, Mistral NeMo 12B has set a new standard for language modeling.

Mistral NeMo Minitron 8B

Building on the success of Mistral NeMo 12B, we announced Mistral-NeMo-Minitron 8B, one of the most advanced open-access models in its size class. This model consistently delivers leading accuracy on nine popular benchmarks. The Mistral-NeMo-Minitron 8B base model was obtained by width-pruning the Mistral NeMo 12B base model, followed by a light retraining process using knowledge distillation.

Model Pruning and Distillation

Model pruning is the process of making a model smaller and leaner, either by dropping layers (depth pruning) or dropping neurons and attention heads and embedding channels (width pruning). Pruning is often accompanied by some amount of retraining for accuracy recovery. Model distillation is a technique used to transfer knowledge from a large, complex model, often called the teacher model, to a smaller, simpler student model.

Iterative Pruning and Distillation

The combination of model pruning followed by light retraining through distillation is an effective and cost-efficient approach to train a family of models. For each additional model, just 100-400B tokens are used for retraining—a greater than 40x reduction compared to training from scratch.

Best Practices for Structured Weight Pruning and Knowledge Distillation

The learning from extensive ablation studies has been summarized into 10 best practices for structured weight pruning combined with knowledge distillation. We found that width pruning consistently outperforms depth pruning and, most importantly, pruned and distilled models outperform models trained from scratch in quality.

Mistral-NeMo-Minitron 8B

Following our best practices, we width-pruned the Mistral NeMo 12B model to obtain an 8B target model. This section details the steps and parameters used to obtain the Mistral-NeMo-Minitron 8B base model, as well as its performance.

Teacher Fine-Tuning

To correct for the distribution shift across the original dataset, we first fine-tuned the unpruned Mistral NeMo 12B model on our dataset using 127B tokens. Experiments showed that, without correcting for the distribution shift, the teacher provides suboptimal guidance on the dataset when being distilled.

Width-Only Pruning

Given our goal of obtaining the strongest 8B model possible, we proceeded with width-only pruning. We pruned both the embedding (hidden) and MLP intermediate dimensions along the width axis to compress Mistral NeMo 12B.

Distillation Parameters

We distilled the model with peak learning rate=1e-4, minimum learning rate=4.5e-7, linear warm up of 60 steps, cosine decay schedule, and a global batch size of 768 using 380B tokens (the same dataset used in teacher fine-tuning).

Mistral-NeMo-Minitron-8B-Instruct

We applied an advanced alignment technique consisting of two-stage instruction fine-tuning and two-stage preference optimization, resulting in a state-of-the-art instruct model with excellent performance in instruction following, language reasoning, function calling, and safety benchmarks.

Performance Benchmarks

We optimized the Mistral-NeMo-Minitron-8B-Base model, the teacher Mistral-NeMo-12B model, and the LLama-3.1-8B model with NVIDIA TensorRT-LLM, an open-source toolkit for optimized LLM inference.

Conclusion

Mistral-NeMo-Minitron-8B provides class-leading accuracy and consistently outperforms recently introduced state-of-the-art models of similar size. Mistral-NeMo-Minitron-8B is our first work on the distillation of the Mistral-NeMo-12B model and provides strong support for our structured weight pruning combined with knowledge distillation best practices.

Frequently Asked Questions

Q: What is the key innovation in Mistral NeMo 12B?

A: The key innovation is the combination of model pruning followed by light retraining through distillation, which is an effective and cost-efficient approach to train a family of models.

Q: What is the difference between depth pruning and width pruning?

A: Depth pruning involves dropping layers, while width pruning involves dropping neurons and attention heads and embedding channels.

Q: How does Mistral NeMo Minitron 8B compare to other models of similar size?

A: Mistral NeMo Minitron 8B consistently outperforms other models of similar size on a wide range of benchmarks.

Q: How does the performance of Mistral NeMo Minitron 8B change when deployed in FP8 precision?

A: Deployment in FP8 delivers a performance boost of ~1.4x across all three models compared to BF16.

Post Views: 35

Mistral-NeMo-Minitron 8B: Unparalleled Accuracy

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter