NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

NVIDIA Announces World-Record DeepSeek-R1 Inference Performance

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. These rapid advancements in performance at both ends of the performance spectrum were made possible by improvements to the NVIDIA open ecosystem of inference developer tools, now optimized for the NVIDIA Blackwell architecture.

These performance records will improve as the NVIDIA platform continues to push the limits of inference on the latest NVIDIA Blackwell Ultra GPUs and NVIDIA Blackwell GPUs.

Figure 1: NVIDIA B200 GPUs in an NVL8 configuration, running TensorRT-LLM software, deliver the highest published tokens per second per user on the full DeepSeek-R1 671B model

Crafting High-Performance Blackwell Kernels with CUTLASS

CUTLASS, since its 2017 debut, has been instrumental for researchers and developers implementing high-performance CUDA kernels on NVIDIA GPUs. By providing developers with comprehensive tools to design custom operations, such as GEMMs and Convolutions, targeting NVIDIA Tensor Cores, it has been critical for the development of hardware-aware algorithms, powering breakthroughs like FlashAttention and establishing itself as a cornerstone for GPU-accelerated computing.

With the release of CUTLASS 3.8, we’re extending support to NVIDIA Blackwell architecture, enabling developers to harness next-generation Tensor Cores with support for all new data types. This includes the new narrow precision MX formats and NVIDIA’s own FP4, empowering developers to optimize custom algorithms and production workloads with the latest innovations in accelerated computing. Figure 7 shows that we are able to achieve up to 98% relative peak performance for Tensor Core operations.

Diagram showing performance comparisons between FP16, BF16, TF32, INT8, FP8, and FP4 for CUTLASS Blackwell GEMMs.

Benchmarks were performed on a B200 system. M=K=16384 and N=17290.

OpenAI Triton Support for Blackwell

OpenAI Triton compiler also now supports Blackwell, enabling developers and researchers to leverage the latest Blackwell architecture features with a Python-based compiler. OpenAI Triton can now take advantage of the latest architectural innovations in the Blackwell architecture and can achieve near-optimal performance on several critical use cases. To learn more, see OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability, co-authored by NVIDIA and OpenAI.

Summary

NVIDIA Blackwell architecture incorporates many breakthrough capabilities that help accelerate generative AI inference, including second-generation Transformer Engine with FP4 Tensor Cores and fifth-generation NVLink with NVLink Switch. NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model.

A rich suite of libraries, now optimized for NVIDIA Blackwell, will enable developers to achieve significant increases in inference performance for both today’s AI models and tomorrow’s evolving landscape. Learn more about the NVIDIA AI Inference platform and stay informed about the latest AI inference performance updates.

Acknowledgements

This work would not have been possible without the exceptional contributions of many, including Matthew Nicely, Nick Comly, Gunjan Mehta, Rajeev Rao, Dave Michael, Yiheng Zhang, Brian Nguyen, Asfiya Baig, Akhil Goel, Paulius Micikevicius, June Yang, Alex Settle, Kai Xu, Zhiyu Cheng, and Chenjie Luo.

Frequently Asked Questions

Q1: What is NVIDIA Blackwell?

A: NVIDIA Blackwell is a new

Post Views: 58

NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

NVIDIA Announces World-Record DeepSeek-R1 Inference Performance

Figure 1: NVIDIA B200 GPUs in an NVL8 configuration, running TensorRT-LLM software, deliver the highest published tokens per second per user on the full DeepSeek-R1 671B model

Crafting High-Performance Blackwell Kernels with CUTLASS

OpenAI Triton Support for Blackwell

Summary

Acknowledgements

Frequently Asked Questions

Q1: What is NVIDIA Blackwell?

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter