Date:

Boost Llama 3.3: 70B Inference Throughput 3x with NVIDIA TensorRT-LLM

Achieving Throughput Speedups with Draft Target Speculative Decoding

Meta’s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including math, reasoning, coding, and multilingual support.

NVIDIA TensorRT-LLM, a powerful inference engine that delivers state-of-the-art performance on the latest LLMs, incorporates many optimizations to deliver outstanding Llama 3.3 70B inference throughput. These include in-flight batching, KV caching, custom FP8 quantization, speculative decoding, and more for fast, cost-efficient LLM serving.

Optimizations for High-Performance Deep Learning Inference

TensorRT-LLM supports batching multiple different requests at the same time for higher serving throughput. By interleaving requests in context and generation phases, in-flight batching reduces latency and improves GPU utilization by executing new requests while older requests are still in flight. Finished requests are evicted from the batch, making room for the next set of requests.

Caching the values of the key-value elements of previous tokens saves from expensive recomputation of these tensors in the generation phase for the next set of tokens. Computational savings effectively lead to higher throughput. However, KV cache grows linearly in size with number of batched requests and sequence context lengths, leading to higher memory requirements.

TensorRT-LLM KV caching addresses these challenges through several optimizations, including support for paged KV cache, quantized KV cache, circular buffer KV cache, and KV cache reuse. Each of these optimizations addresses the challenging balance between growing memory size and avoiding unnecessary and expensive recomputation.

Speculative decoding is a popular technique for faster and cost-effective LLM inference with built-in verification for the quality of output generation. It’s based on the premise that generating multiple sequences of future (draft) tokens is more efficient than processing a single token in autoregressive decoding, an inherently time-consuming process. The target model determines how many of these draft tokens to accept, which is far more efficient than having to generate one token per iteration. TensorRT-LLM supports a growing list of speculative decoding techniques, including draft target, Medusa, Eagle, and lookahead decoding, among others.

Achieving Throughput Speedups with Draft Target Speculative Decoding

Table 1 and Figure 2 highlight the throughput (output tokens/second) speedups between no draft model (that is, no speculative decoding) versus draft models of various sizes with Llama 3.3 70B target model.

Steps to Reproduce Performance Gains

Download the following model checkpoints from Hugging Face and store them in a directory for easy access through the setup process.

git lfs install

# Download target models
git clone https://huggingface.co/meta-llama/Meta-Llama-3.3-70B-Instruct

# Download draft models
git clone https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
git clone https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

After the model checkpoints have been downloaded, install TensorRT-LLM.

# Obtain and start the basic docker image environment (optional).
docker run –rm –ipc=host –runtime=nvidia –gpus all –entrypoint
/bin/bash -it nvidia/cuda:12.5.1-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin
libopenmpi-dev git git-lfs

# Fetch the library
git clone -b v0.15.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

# Install the latest version (corresponding to the main branch) of TensorRT-LLM.
pip3 install tensorrt_llm -U –extra-index-url https://pypi.nvidia.com

# Check installation
python3 -c “import tensorrt_llm”

Next, compile the downloaded model checkpoints into draft and target TensorRT engines. These engines are optimized to run inference with best accuracy and highest throughput.

cd examples

# Steps to build target and draft models in FP8 precision on 1 H200

# Create FP8 checkpoints

python3 quantization/quantize.py –model_dir –dtype float16 –qformat fp8 –kv_cache_dtype fp8
–output_dir /ckpt-draft –calib_size 512 –tp_size 1

python3 quantization/quantize.py \
–model_dir= \
–output_dir=./ckpt-target-70b \
–dtype=float16 –qformat fp8 –kv_cache_dtype fp8 \
–calib_size 512 –tp_size 1

# Build draft and target engines
# Important flags for the engine build process:
# –use_paged_context_fmha=enable must be specified since we need KVcache reuse for the draft/target model.

# –speculative_decoding_mode=draft_tokens_external and –max_draft_len must be specified for target model.

trtllm-build \
–checkpoint_dir./ckpt-draft \
–output_dir=./draft-engine \
–gpt_attention_plugin float16 \
–workers 1 \
–gemm_plugin=fp8 \
–use_paged_context_fmha=enable \
–multiple_profiles enable \
–max_batch_size=32 \
–max_seq_len=131072

trtllm-build \
–checkpoint_dir=./ckpt-target-70b \
–output_dir=./target-engine \
–gpt_attention_plugin float16 \
–workers 1 \
–gemm_plugin=fp8 \
–use_paged_context_fmha=enable \
–multiple_profiles enable \
–max_batch_size=32 \
–max_seq_len=131072 \
–low_latency_gemm_plugin fp8 \
–speculative_decoding_mode=draft_tokens_external \
–max_draft_len 10

Finally, run speculative decoding in TensorRT-LLM.

#Run decoding

# Important flags to set during the run process:
#–draft_engine_dir and –engine_dir must be specified for the draft and target engines.

#–draft_target_model_config is corresponding to the configuration of
Draft-Target-Model. As an example, [4,[0],[1],False] means draft_len=4,
device of draft model is GPU0, device of target model is GPU1, and use
tokens rather than logits to accept.

# Only CPP session (using executor as low-level API) is supported, while
Python session (–use_py_session) is not supported.

# Run with Llama 3.3 70B target model

mpirun -n 1 –allow-run-as-root python3./run.py \
–tokenizer_dir \
–draft_engine_dir./draft-engine \
–engine_dir./target-engine \
–draft_target_model_config = “[10,[0,1,2,3,4,5,6,7],[0,1,2,3,4,5,6,7], False]” \
–kv_cache_free_gpu_memory_fraction=0.35 \
–max_output_len=1024 \
–kv_cache_enable_block_reuse \

–input_text=”

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here