Date:

NVIDIA Blackwell Boosts AI Performance and Programmability

Performance Advances on NVIDIA Blackwell

The NVIDIA Blackwell architecture introduces substantial improvements in both raw computing power and architectural innovations. NVIDIA’s collaboration with OpenAI has focused on leveraging these capabilities transparently through Triton’s compiler infrastructure, particularly in two key areas: matrix multiplications and new precision formats.

Matrix Multiplications

The NVIDIA Blackwell architecture adds a brand-new Tensor Core designed from the ground up for improved throughput and energy efficiency. By extending Triton’s Matrix Multiply-Accumulate (MMA) pipelining machinery, we’ve enabled automatic exploitation of NVIDIA Blackwell’s new Tensor Cores. This required careful analysis of memory access patterns and sophisticated compiler transformations to ensure correct and efficient compute/data-movement overlap.

The result is exceptional performance for both FP8 and FP16 GEMM operations out of the box, with these optimizations automatically applying to any kernel using Triton’s tl.dot primitive. Overall, Triton manages to achieve near-optimal performance, comparable to library implementations across several critical use cases.

Figure 1. Performance improvements with Triton on NVIDIA Blackwell

Flash Attention

Flash attention, a crucial primitive in modern transformer architectures, sees significant speedups on NVIDIA Blackwell through Triton, with up to 1.5x for FP16 attention over the NVIDIA Hopper GPU architecture. While we continue to optimize absolute performance through ongoing compiler enhancements on FP8 and other precisions, the current work helps customers readily transition to NVIDIA Blackwell on Day 0 for existing products. Another important aspect to note here is the ability to deliver this performance gain “for free” with existing Triton flash attention implementations, requiring no code changes.

A bar chart shows flash attention performance for NVIDIA Blackwell compared to NVIDIA Hopper.

Figure 2. Large performance gains for more complex workloads

New Precision Formats

NVIDIA Blackwell introduces revolutionary block-scaled floating point formats, including the Open Computing Project’s microscaling formats, which Triton now unlocks for NVIDIA Blackwell-powered hardware acceleration. These formats provide higher average precision at higher performance than the non-native block-scaling techniques emulated frequently in LLM inference projects today. For OCP format support, MXFP8 GEMMs on Triton showcase exceptional performance similar to the FP8 GEMMs performance accelerated and shown earlier in this post, while natively allowing for scaling in the Tensor Core. Similarly, MXFP4 provides a new operating point in the precision-performance trade-off space but while offering double the hardware-accelerated performance of FP8 and MXFP8 GEMMs.

To learn more about the new block-scaled floating point support, take a look at the new Triton tutorial dedicated to this functionality.

Areas of Improvement Going Forward

The layout and packing of sub-byte datatype formats like MXFP4 still require care by the end user. We look forward to working with the community to improve the ergonomics for kernel authors and seamless framework integrations.

More Information

Phillippe Tillet, the creator of Triton, and NVIDIA will be diving into the details of this NVIDIA Blackwell work and the resulting performance at the NVIDIA GTC conference on March 17.

Register to attend GTC 2025 virtually or attend live.

This release establishes a powerful foundation for NVIDIA Blackwell support in Triton—but it’s just the beginning. Here’s how you can help shape what’s next:

Start building with Triton on NVIDIA Blackwell today and unlock the full potential of NVIDIA’s latest architecture while maintaining complete control over your development.

Have ideas or encountered issues? Contact our NVIDIA product manager Matthew Nicely by tagging him on GitHub.

FAQs

Q: What are the key benefits of Triton on NVIDIA Blackwell?
A: Triton on NVIDIA Blackwell provides exceptional performance for both FP8 and FP16 GEMM operations, with automatic exploitation of NVIDIA Blackwell’s new Tensor Cores.

Q: How does Triton improve performance for flash attention?
A: Triton’s flash attention performance on NVIDIA Blackwell achieves up to 1.5x speedup for FP16 attention over NVIDIA Hopper GPU architecture.

Q: What are the new precision formats supported by Triton on NVIDIA Blackwell?
A: Triton on NVIDIA Blackwell supports revolutionary block-scaled floating point formats, including Open Computing Project’s microscaling formats, providing higher average precision at higher performance than non-native block-scaling techniques.

Q: How can I get started with Triton on NVIDIA Blackwell?
A: Start building with Triton on NVIDIA Blackwell today and unlock the full potential of NVIDIA’s latest architecture while maintaining complete control over your development.

Latest stories

Read More

Worst on Critical Bioweapons Data Safety Test

DeepSeek Competitor Raises Fears of Potential Bioweapon Information Generation Concerns...

Supercharging Your Docker Workflow with Ask Gordon

The Struggle of Managing Docker Efficiently Working with Docker can...

Built to Last

Velocity Micro Vector Z35 Review Key Specifications Specification Details CPU Intel Core i5-14400 Processor,...

Unlocking Insights with HPC, Big Data, and AI

GenAI and HPC: A Virtuous Cycle What about HPC? Aside from...

PlayStation Network Down

PlayStation Network (PSN) Experiences Multi-Hour Outage Outage Widespread and Growing PlayStation...

Gemini Can Now Watch YouTube for You

Gemini Flash 2.0 Upgrades with YouTube Video Analysis Capabilities New...

Anduril Close to $28 Billion Valuation

Anduril Secures $28 Billion Valuation in New Funding Round Anduril,...

Krea AI Chat: Open Beta + Pika Pikaddition VIDEO AI

Introducing Krea AI Chat and Pika Pikaddition: The Future...

LEAVE A REPLY

Please enter your comment!
Please enter your name here