Date:

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

The Need for Optimized Attention Kernels and Associated Challenges

Attention is a key concept that has revolutionized the development of large language models (LLMs). It allows AI models to focus selectively on the most relevant parts of input data, making better predictions and finding hidden patterns. However, the computational complexity of the attention operation grows quadratically with the input sequence length, motivating the need for optimized lower-level implementations (GPU kernels) to prevent runtime errors and improve computational efficiency.

Creating an optimized GPU kernel for attention is a challenging task that requires significant skill and time, even for experienced software engineers. This is because it involves developing a deep understanding of the problem, designing a effective solution, and implementing it efficiently. Recent LLMs like DeepSeek-R1 have shown promise in code generation tasks, but they still face challenges in creating optimized code on the first try. This makes it necessary to use other strategies at inference time to generate optimized code.

Inference-Time Scaling for Generating Optimized GPU Kernels

To generate optimized GPU kernels, NVIDIA engineers have created a new workflow that includes a special verifier along with the DeepSeek-R1 model during inference in a closed-loop fashion for a predetermined duration. The workflow is initialized by a manual prompt, and the DeepSeek-R1 model generates the GPU code (kernel) in the first pass. The verifier runs on an NVIDIA H100 GPU, analyzing the generated kernel and creating new prompts that are provided as input to the DeepSeek-R1 model.

The closed-loop approach makes the code generation process better by guiding it in a different way each time. The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel.

Results

The workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford’s KernelBench benchmark. The Level-1 solving rate in KernelBench refers to the numerical correct metric used to evaluate the ability of LLMs to generate efficient GPU kernels for specific computational tasks.

Figure 3 shows the performance of automatically generated optimized attention kernels with flex attention. The workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems.

Inference-Time Scaling Results

Figure 4 shows the number of numerically correct kernels generated over time. The line approaches 95% at ~10 minutes and 100% at 20 minutes.

Conclusion

These results show how the latest DeepSeek-R1 model can be used to generate better GPU kernels by using more computing power during inference time. This is still a new research area with early results on a promising approach that automatically generates effective attention kernels. While we are off to a good start, more work is needed to generate better results consistently for a wider variety of problems. We’re excited about the recent developments in DeepSeek-R1 and its potential.

FAQs

Q: What is the need for optimized attention kernels?
A: The need for optimized attention kernels arises from the computational complexity of the attention operation, which grows quadratically with the input sequence length, motivating the need for optimized lower-level implementations (GPU kernels) to prevent runtime errors and improve computational efficiency.

Q: What is the DeepSeek-R1 model?
A: The DeepSeek-R1 model is a recent large language model that has shown promise in code generation tasks.

Q: What is the workflow for generating optimized GPU kernels?
A: The workflow involves a special verifier along with the DeepSeek-R1 model during inference in a closed-loop fashion for a predetermined duration.

Q: What are the results of the workflow?
A: The workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford’s KernelBench benchmark.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here