Achieving Throughput Speedups with Speculative Decoding
TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs.
Speculative Decoding
Speculative decoding, also referred to as speculative sampling, works by paying a small additional computation cost to speculatively generate the next several tokens and then using the target model to perform a built-in verification step to ensure the quality of output generation while giving a throughput boost.
Achieving Throughput Speedups
Table 1 and Figure 1 show the difference in throughput (output tokens/second) between no draft model (that is, no speculative decoding) to varying-sized draft models along with the Llama 3.1 405B target model.
One GPU
Table 2 and Figure 2 demonstrate the difference in throughput (output tokens/second) between no draft model (that is, no speculative decoding) to varying-sized draft models along with the Llama 3.1 70B target model.
Steps to Get Speculative Decoding Working
- Download draft models
- Install dependencies, TensorRT-LLM requires Python 3.10
- apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
- git clone -b v0.14.0 https://github.com/NVIDIA/TensorRT-LLM.git
- cd TensorRT-LLM
- pip3 install tensorrt_llm -U –extra-index-url https://pypi.nvidia.com
- python3 -c "import tensorrt_llm"
- Compile the downloaded model checkpoints into the draft and target TRT engines
- cd examples
- python3 quantization/quantize.py –model_dir –dtype float16 –qformat fp8 –kv_cache_dtype fp8 –output_dir /ckpt-draft –calib_size 512 –tp_size 4
- python3 quantization/quantize.py –model_dir= –dtype float16 –qformat fp8 –kv_cache_dtype fp8 –output_dir ./ckpt-target-405b –calib_size 512 –tp_size 4
- trtllm-build –checkpoint_dir ./ckpt-draft –output_dir ./draft-engine –gpt_attention_plugin float16 –workers 4 –gemm_plugin=fp8 –reduce_fusion disable –use_paged_context_fmha=enable –use_fused_mlp enable –multiple_profiles enable –max_batch_size=32 –max_num_tokens=8192 –max_seq_len=131072 –low_latency_gemm_plugin fp8 –speculative_decoding_mode=draft_tokens_external –max_draft_len 10
- trtllm-build –checkpoint_dir ./ckpt-target-405b –output_dir ./target-engine –gpt_attention_plugin float16 –workers 4 –gemm_plugin=fp8 –use_paged_context_fmha=enable –use_fused_mlp enable –multiple_profiles enable –max_batch_size=32 –max_num_tokens=8192 –max_seq_len=131072 –low_latency_gemm_plugin fp8 –speculative_decoding_mode=draft_tokens_external –max_draft_len 10
- Run speculative decoding in TensorRT-LLM
- mpirun -n 8 –allow-run-as-root python3 ./run.py –tokenizer_dir –draft_engine_dir ./draft-engine –engine_dir ./target-engine –draft_target_model_config = "[10,[0,1,2,3,4,5,6,7],[0,1,2,3,4,5,6,7], False]" –kv_cache_free_gpu_memory_fraction=0.35 –max_output_len=1024 –kv_cache_enable_block_reuse –input_text="Implement a program to find the common elements in two arrays without using any extra data structures."
Benchmarking Throughput Performance without Speculative Decoding
To benchmark throughput performance without speculative decoding, follow the steps in the following code example:
- trtllm-build –checkpoint_dir ./ckpt-target-405b –output_dir /data/405B-TRT/ –gpt_attention_plugin float16 –workers 4 –max_batch_size 32 –max_seq_len 131072 –max_num_tokens 8192 –use_fused_mlp enable –reduce_fusion enable –use_paged_context_fmha enable –multiple_profiles enable –gemm_plugin fp8
- python3 /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py –output token-norm-dist.json –tokenizer /llama-3_1-405b/ token-norm-dist –num-requests 1000 –input-mean 500 –input-stdev 0 –output-mean 200 –output-stdev 0 > /tmp/synthetic.txt
- trtllm-bench –model latency –engine_dir /data/405b-TRT/ –dataset /tmp/synthetic.txt
TensorRT-LLM with Triton Inference Server
TensorRT-LLM speculative decoding is also supported with the NVIDIA Triton Inference Server backend for production-ready deployments.
Summary
TensorRT-LLM provides several features for optimizing and efficiently running large language models of different model architectures. For more information about low-latency optimizations and improved throughput, see the following posts:

