Date:

TensorRT-LLM Speculative Decoding Boosts Inference Throughput

Achieving Throughput Speedups with Speculative Decoding

TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs.

Speculative Decoding

Speculative decoding, also referred to as speculative sampling, works by paying a small additional computation cost to speculatively generate the next several tokens and then using the target model to perform a built-in verification step to ensure the quality of output generation while giving a throughput boost.

Achieving Throughput Speedups

Table 1 and Figure 1 show the difference in throughput (output tokens/second) between no draft model (that is, no speculative decoding) to varying-sized draft models along with the Llama 3.1 405B target model.

One GPU

Table 2 and Figure 2 demonstrate the difference in throughput (output tokens/second) between no draft model (that is, no speculative decoding) to varying-sized draft models along with the Llama 3.1 70B target model.

Steps to Get Speculative Decoding Working

  1. Download draft models
  2. Install dependencies, TensorRT-LLM requires Python 3.10
  3. Compile the downloaded model checkpoints into the draft and target TRT engines
    • cd examples
    • python3 quantization/quantize.py –model_dir –dtype float16 –qformat fp8 –kv_cache_dtype fp8 –output_dir /ckpt-draft –calib_size 512 –tp_size 4
    • python3 quantization/quantize.py –model_dir= –dtype float16 –qformat fp8 –kv_cache_dtype fp8 –output_dir ./ckpt-target-405b –calib_size 512 –tp_size 4
    • trtllm-build –checkpoint_dir ./ckpt-draft –output_dir ./draft-engine –gpt_attention_plugin float16 –workers 4 –gemm_plugin=fp8 –reduce_fusion disable –use_paged_context_fmha=enable –use_fused_mlp enable –multiple_profiles enable –max_batch_size=32 –max_num_tokens=8192 –max_seq_len=131072 –low_latency_gemm_plugin fp8 –speculative_decoding_mode=draft_tokens_external –max_draft_len 10
    • trtllm-build –checkpoint_dir ./ckpt-target-405b –output_dir ./target-engine –gpt_attention_plugin float16 –workers 4 –gemm_plugin=fp8 –use_paged_context_fmha=enable –use_fused_mlp enable –multiple_profiles enable –max_batch_size=32 –max_num_tokens=8192 –max_seq_len=131072 –low_latency_gemm_plugin fp8 –speculative_decoding_mode=draft_tokens_external –max_draft_len 10
  4. Run speculative decoding in TensorRT-LLM
    • mpirun -n 8 –allow-run-as-root python3 ./run.py –tokenizer_dir –draft_engine_dir ./draft-engine –engine_dir ./target-engine –draft_target_model_config = "[10,[0,1,2,3,4,5,6,7],[0,1,2,3,4,5,6,7], False]" –kv_cache_free_gpu_memory_fraction=0.35 –max_output_len=1024 –kv_cache_enable_block_reuse –input_text="Implement a program to find the common elements in two arrays without using any extra data structures."

Benchmarking Throughput Performance without Speculative Decoding

To benchmark throughput performance without speculative decoding, follow the steps in the following code example:

  • trtllm-build –checkpoint_dir ./ckpt-target-405b –output_dir /data/405B-TRT/ –gpt_attention_plugin float16 –workers 4 –max_batch_size 32 –max_seq_len 131072 –max_num_tokens 8192 –use_fused_mlp enable –reduce_fusion enable –use_paged_context_fmha enable –multiple_profiles enable –gemm_plugin fp8
  • python3 /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py –output token-norm-dist.json –tokenizer /llama-3_1-405b/ token-norm-dist –num-requests 1000 –input-mean 500 –input-stdev 0 –output-mean 200 –output-stdev 0 > /tmp/synthetic.txt
  • trtllm-bench –model latency –engine_dir /data/405b-TRT/ –dataset /tmp/synthetic.txt

TensorRT-LLM with Triton Inference Server

TensorRT-LLM speculative decoding is also supported with the NVIDIA Triton Inference Server backend for production-ready deployments.

Summary

TensorRT-LLM provides several features for optimizing and efficiently running large language models of different model architectures. For more information about low-latency optimizations and improved throughput, see the following posts:

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here