NVIDIA TensorRT-LLM Accelerates Encoder-Decoder Model Architectures
NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following:
- Decoder-only models, such as Llama 3.1
- Mixture-of-experts (MoE) models, such as Mixtral
- Selective state-space models (SSM), such as Mamba
- Multimodal models for vision-language and video-language applications
In-flight Batching for Encoder-Decoder Architectures
Encoder-decoder models have a different runtime pattern as compared to decoder-only models. They have more than one engine (commonly two engines) where the first engine is executed only one time per request with simpler input/output buffers. The second engine is executed auto-regressively with more complex handling logic for key-value (KV) cache management and batch management that provide high throughput at low latency.
Key Extensions for In-Flight Batching (IFB) and KV Cache Management
- Dual-paged KV cache management for the decoder’s self-attention cache as well as the decoder’s cross-attention cache computed from the encoder’s output.
- Data passing from encoder-to decoder-controlled at the LLM request level. When decoder requests are batched in-flight, each request’s encoder-stage output should be gathered and batched in-flight as well.
- Decoupled batching strategy for the encoder and decoder. As encoder and decoder could have different sizes and compute properties, the requests at each stage should be batched independently and asynchronously.
Low-Rank Adaptation Support
Low-rank adaptation (LoRA) is a powerful parameter-efficient fine-tuning (PEFT) technique that enables the customization of LLMs while maintaining impressive performance and minimal resource usage. Instead of updating all model parameters during fine-tuning, LoRA adds small trainable rank decomposition matrices to the model, significantly reducing memory requirements and computational costs.
Benefits of LoRA Support
- Efficient serving of multiple LoRA adapters within a single batch
- Reduced memory footprint through the dynamic loading of LoRA adapters
- Seamless integration with existing BART model deployments
Summary
NVIDIA TensorRT-LLM continues to expand its capabilities for optimizing and efficiently running LLMs across different architectures. Upcoming enhancements to encoder-decoder models include FP8 quantization, enabling further improvements in latency and throughput. For production deployments, NVIDIA Triton Inference Server provides the ideal platform for serving these models.
FAQs
Q: What is NVIDIA TensorRT-LLM?
A: NVIDIA TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures.
Q: What is the purpose of in-flight batching?
A: In-flight batching enables efficient execution of encoder-decoder models by reducing the number of requests and improving cache management.
Q: What is low-rank adaptation (LoRA) support?
A: LoRA support enables customization of LLMs while maintaining impressive performance and minimal resource usage.
Q: What are the benefits of LoRA support?
A: The benefits of LoRA support include efficient serving of multiple LoRA adapters, reduced memory footprint, and seamless integration with existing BART model deployments.

