TensorRT-LLM Speculative Decoding Boosts Inference Throughput

Achieving Throughput Speedups with Speculative Decoding

TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs.

Speculative Decoding

Speculative decoding, also referred to as speculative sampling, works by paying a small additional computation cost to speculatively generate the next several tokens and then using the target model to perform a built-in verification step to ensure the quality of output generation while giving a throughput boost.

Achieving Throughput Speedups

Table 1 and Figure 1 show the difference in throughput (output tokens/second) between no draft model (that is, no speculative decoding) to varying-sized draft models along with the Llama 3.1 405B target model.

One GPU

Table 2 and Figure 2 demonstrate the difference in throughput (output tokens/second) between no draft model (that is, no speculative decoding) to varying-sized draft models along with the Llama 3.1 70B target model.

Steps to Get Speculative Decoding Working

Download draft models
- git clone https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- git clone https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
Install dependencies, TensorRT-LLM requires Python 3.10
- apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
- git clone -b v0.14.0 https://github.com/NVIDIA/TensorRT-LLM.git
- cd TensorRT-LLM
- pip3 install tensorrt_llm -U –extra-index-url https://pypi.nvidia.com
- python3 -c "import tensorrt_llm"
Compile the downloaded model checkpoints into the draft and target TRT engines
- cd examples
- python3 quantization/quantize.py –model_dir –dtype float16 –qformat fp8 –kv_cache_dtype fp8 –output_dir /ckpt-draft –calib_size 512 –tp_size 4
- python3 quantization/quantize.py –model_dir= –dtype float16 –qformat fp8 –kv_cache_dtype fp8 –output_dir ./ckpt-target-405b –calib_size 512 –tp_size 4
- trtllm-build –checkpoint_dir ./ckpt-draft –output_dir ./draft-engine –gpt_attention_plugin float16 –workers 4 –gemm_plugin=fp8 –reduce_fusion disable –use_paged_context_fmha=enable –use_fused_mlp enable –multiple_profiles enable –max_batch_size=32 –max_num_tokens=8192 –max_seq_len=131072 –low_latency_gemm_plugin fp8 –speculative_decoding_mode=draft_tokens_external –max_draft_len 10
- trtllm-build –checkpoint_dir ./ckpt-target-405b –output_dir ./target-engine –gpt_attention_plugin float16 –workers 4 –gemm_plugin=fp8 –use_paged_context_fmha=enable –use_fused_mlp enable –multiple_profiles enable –max_batch_size=32 –max_num_tokens=8192 –max_seq_len=131072 –low_latency_gemm_plugin fp8 –speculative_decoding_mode=draft_tokens_external –max_draft_len 10
Run speculative decoding in TensorRT-LLM
- mpirun -n 8 –allow-run-as-root python3 ./run.py –tokenizer_dir –draft_engine_dir ./draft-engine –engine_dir ./target-engine –draft_target_model_config = "[10,[0,1,2,3,4,5,6,7],[0,1,2,3,4,5,6,7], False]" –kv_cache_free_gpu_memory_fraction=0.35 –max_output_len=1024 –kv_cache_enable_block_reuse –input_text="Implement a program to find the common elements in two arrays without using any extra data structures."

Benchmarking Throughput Performance without Speculative Decoding

To benchmark throughput performance without speculative decoding, follow the steps in the following code example:

trtllm-build –checkpoint_dir ./ckpt-target-405b –output_dir /data/405B-TRT/ –gpt_attention_plugin float16 –workers 4 –max_batch_size 32 –max_seq_len 131072 –max_num_tokens 8192 –use_fused_mlp enable –reduce_fusion enable –use_paged_context_fmha enable –multiple_profiles enable –gemm_plugin fp8
python3 /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py –output token-norm-dist.json –tokenizer /llama-3_1-405b/ token-norm-dist –num-requests 1000 –input-mean 500 –input-stdev 0 –output-mean 200 –output-stdev 0 > /tmp/synthetic.txt
trtllm-bench –model latency –engine_dir /data/405b-TRT/ –dataset /tmp/synthetic.txt

TensorRT-LLM with Triton Inference Server

TensorRT-LLM speculative decoding is also supported with the NVIDIA Triton Inference Server backend for production-ready deployments.

Summary

TensorRT-LLM provides several features for optimizing and efficiently running large language models of different model architectures. For more information about low-latency optimizations and improved throughput, see the following posts:

Post Views: 61

TensorRT-LLM Speculative Decoding Boosts Inference Throughput

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter