Large Language Models for Code Generation: Unlocking the Power of Lookahead Decoding
Qwen2.5-Coder models
The Qwen2.5-Coder models have achieved state-of-the-art performance across popular academic benchmarks. NVIDIA TensorRT-LLM has optimized three popular models from the Qwen2.5-Coder family – the 1.5B, 7B, and 32B versions – for high throughput and low latency.
Lookahead Decoding
Lookahead decoding is a speculative decoding technique that addresses the slow autoregressive nature of LLMs. Unlike the single-token generation in autoregressive decoding, lookahead decoding generates multiple tokens simultaneously, utilizing the parallel processing capabilities of the GPU, leveraging computation (FLOPs) for latency reduction.
Benefits of Lookahead Decoding
- Improves GPU utilization and reduces latency
- Increases throughput without additional training or fine-tuning
- Does not require a separate draft model
Steps to Run Lookahead Decoding with TensorRT-LLM
- Install TensorRT-LLM
- Run lookahead decoding in TensorRT-LLM using the high-level API
Performance Gains
The Qwen2.5-Coder models have demonstrated a 3.4x throughput boost on NVIDIA DGX H200 with TensorRT-LLM lookahead decoding.
Summary
Lookahead speculative decoding enables throughput boost on LLMs without any additional training, fine-tuning, or draft models. We presented benchmarked performance improvements on Qwen2.5-Coder models. Visit build.nvidia.com to try the Qwen2.5-Coder models optimized with NVIDIA TensorRT-LLM for free.
Acknowledgments
We would like to thank Liwei Ma, Fanrong Li, Nikita Korobov, and Martin Marciniszyn Mehringer for their efforts in supporting this post.
FAQs
Q: What is lookahead decoding?
A: Lookahead decoding is a speculative decoding technique that generates multiple tokens simultaneously, utilizing the parallel processing capabilities of the GPU.
Q: What are the benefits of lookahead decoding?
A: Lookahead decoding improves GPU utilization and reduces latency, increasing throughput without additional training or fine-tuning.
Q: How do I run lookahead decoding with TensorRT-LLM?
A: Follow the steps provided in the article to install and run lookahead decoding with TensorRT-LLM using the high-level API.

