TensorRT-LLM Supports Recurrent Drafting for Optimizing LLM Inference

Inflight-batching Compatible Engine

Inflight-batching (IFB) is a strategy that significantly improves the throughput by batching context-phase and generation-phase requests. Speculative decoding, coupled with IFB, introduces more complexity to the pipeline as context-phase requests need to be handled differently than generation-phase requests, which require draft token validation. Since ReDrafter moves the validation logic inside the model definition, the engine needs that logic as well during validation. Similar to the attention plugin, the batch is split into two smaller batches: one for context requests and another for generation requests. Each smaller batch then enters its computational workflow, and at the end, they are combined back to a single batch for drafting.

Note that this approach requires that all operators on either path support empty tensors, which could happen if a batch consists of all context requests or all generation requests. This capability adds flexibility to TensorRT-LLM APIs, enabling the definition of more complicated models in the future.

Implementing in-Engine Validation and Drafting

To validate and draft inside the engine, TensorRT-LLM is updated with support for numerous new operations so that PyTorch code can be easily translated into a definition of the TensorRT-LLM model.

The following PyTorch code excerpt is Apple’s PyTorch implementation of ReDrafter. The TensorRT-LLM implementation is almost a straightforward line-by-line mapping of the PyTorch version.

PyTorch

def unpack(
packed_tensor: torch.Tensor,
unpacker: torch.Tensor,
) -> torch.Tensor:
assert len(packed_tensor.shape) == 3
last_dim_size = packed_tensor.shape[2]
batch_size, beam_width, beam_length = unpacker.shape
unpacked_data_indices = unpacker.view(
batch_size, beam_width * beam_length, 1).expand(
-1, -1, last_dim_size
)
unpacked_tensor = torch.gather(
packed_tensor, 1, unpacked_data_indices).reshape(
batch_size, beam_width, beam_length, -1
)
return unpacked_tensor

TensorRT-LLM

def _unpack_beams(
x: Tensor,
indices: Tensor,
num_beams: int,
beam_length: int
) -> Tensor:
assert x.rank() == 3
d0 = shape(x, 0, INT_DTYPE_STR)
dl = shape(x, -1, INT_DTYPE_STR)
indices = view(
indices, [-1, num_beams * beam_length, 1], False)
res_shape = concat([d0, num_beams, beam_length, dl])
res = view(gather_nd(x, indices), res_shape, False)
return res

This, of course, is a very simple example. For a more complex example, see the beam search implementation. With the new functionalities added for ReDrafter, it might be possible to improve the Medusa implementation in TensorRT-LLM to further increase its performance.

ReDrafter Performance in TensorRT-LLM

As benchmarked by Apple, ReDrafter with TensorRT-LLM can provide up to 2.7x throughput improvements on NVIDIA H100 GPUs with TP8 over the base LLM.

Note that the performance improvement of any speculative decoding technique can be heavily impacted by many factors, including:

GPU utilization: Speculative decoding is commonly used for low-traffic scenarios, where GPU resources are typically underutilized due to small batch sizes.
Average acceptance rate: The latency of each decoding step is increased since speculative decoding must perform extra computation, where a significant portion of it is ultimately wasted after validation. As a result, to see any performance benefits from speculative decoding, the average acceptance rate must be high enough to pay for that extra latency. This is affected by the number of beams, their lengths, and the quality of the beam search itself (which is impacted by the training data).
Task: It is easier to predict future tokens for some tasks (code completion, for example), which leads to a higher acceptance rate, and thus improved performance.

Summary

This collaboration between NVIDIA and Apple has made TensorRT-LLM more powerful and more flexible, enabling the LLM community to innovate more sophisticated models and easily deploy them with TensorRT-LLM to achieve unparalleled performance on NVIDIA GPUs. These new features open exciting possibilities, and we eagerly anticipate the next generation of advanced models from the community that leverage TensorRT-LLM capabilities, driving further improvements in LLM workloads.

FAQs

Q: What is ReDrafter?

A: ReDrafter is a novel speculative decoding technique developed by Apple for large language model (LLM) inference.

Q: What is the benefit of ReDrafter?

A: ReDrafter can significantly boost LLM workload performance on NVIDIA GPUs, providing up to 2.7x throughput improvements on NVIDIA H100 GPUs with TP8 over the base LLM.

Q: How does ReDrafter work?

A: ReDrafter uses recurrent neural network (RNN)-based sampling, referred to as drafting, combined with tree-style attention to predict and verify draft tokens from multiple possible paths for better accuracy and to potentially accept more than one token in each iteration of the decoder.

Q: What are the factors that impact ReDrafter performance?

A: The performance of ReDrafter is impacted by many factors, including GPU utilization, average acceptance rate, and task type.

Post Views: 67

TensorRT-LLM Supports Recurrent Drafting for Optimizing LLM Inference

Inflight-batching Compatible Engine

Implementing in-Engine Validation and Drafting

PyTorch

TensorRT-LLM

ReDrafter Performance in TensorRT-LLM

Summary

FAQs

Q: What is ReDrafter?

Q: What is the benefit of ReDrafter?

Q: How does ReDrafter work?

Q: What are the factors that impact ReDrafter performance?

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter