Easily deploy high-throughput, low-latency inference
Six years ago, NVIDIA set out to create an AI inference server specifically designed for developers building high-throughput, latency-critical production applications. To address this, NVIDIA developed the NVIDIA Triton Inference Server, an open-source platform capable of serving models from any AI framework.
Optimizations for AI inference workloads
Inference is a full-stack problem today, requiring high-performance infrastructure and efficient software to make effective use of that infrastructure. NVIDIA offers a broad ecosystem of AI inference solutions, including NVIDIA TensorRT, which provides a high-performance deep learning inference library with APIs that enable fine-grained optimizations.
Prefill and KV cache optimizations
The TensorRT-LLM library incorporates many state-of-the-art features that accelerate inference performance for large language models (LLMs). These include:
- Key-value (KV) cache early reuse: By reusing system prompts across users, the KV Cache Early Reuse feature accelerates time-to-first-token (TTFT) by up to 5x.
- Chunked prefill: For smarter deployment, chunked prefill divides the prefill phase into smaller tasks, enhancing GPU utilization and reducing latency.
- Supercharging multiturn interactions: The NVIDIA GH200 Superchip architecture enables efficient KV cache offloading, improving TTFT by up to 2x in multiturn interactions with Llama models.
Decoding optimization
The TensorRT-LLM library also includes decoding optimization techniques, such as:
- Multiblock attention for long sequences: Addressing the challenge of long input sequences, TensorRT-LLM multiblock attention maximizes GPU utilization by distributing tasks across streaming multiprocessors (SMs).
- Speculative decoding for accelerated throughput: Leveraging a smaller draft model alongside a larger target model, speculative decoding enables up to a 3.6x improvement in inference throughput.
- Speculative decoding with Medusa: The Medusa speculative decoding algorithm is available as part of TensorRT-LLM optimizations, predicting multiple subsequent tokens simultaneously to boost throughput.
Multi-GPU inference
TensorRT-LLM also includes multi-GPU inference optimizations, such as:
- MultiShot communication protocol: Traditional Ring AllReduce operations can become a bottleneck in multi-GPU scenarios. TensorRT-LLM MultiShot reduces communication steps to just two, irrespective of GPU count.
- Pipeline parallelism for high-concurrency efficiency: Parallelism techniques require that GPUs be able to transfer data quickly and efficiently, necessitating a robust GPU-to-GPU interconnect fabric for maximum performance.
Quantization and lower-precision compute
NVIDIA TensorRT Model Optimizer for precision and performance: The NVIDIA custom FP8 quantization recipe delivers up to 1.44x higher throughput without sacrificing accuracy.
Evaluating inference performance
Delivering world-class inference performance takes a full technology stack—chips, systems, and software—all contributing to boosting throughput, reducing energy consumption per token, and minimizing costs. MLPerf Inference is one key measure of inference performance, regularly updated to reflect new advances in AI.
The future of AI inference: Emerging trends and technologies
The landscape of AI inference is rapidly evolving, driven by a series of groundbreaking advancements and emerging technologies. Models continue to get smarter, as increases in compute at data center scale enable pretraining larger models.
Get started
Check out How to Get Started with AI Inference, learn more about the NVIDIA AI Inference platform, and stay informed about the latest AI inference performance updates.

