Introducing KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM

Priority-based KV Cache Eviction

When an LLM request has completed, the KV cache blocks associated with these requests are stored. Given the bounded size of the KV cache, some cached blocks may need to be evicted to make room for new sequences. By default, eviction follows a least recently used (LRU) policy.

Priority-based eviction is a new feature of the TensorRT-LLM Executor API that enables users to influence how blocks are selected for eviction. Users can specify two attributes that guide block eviction: priority and duration. The priority value sets the relative retention priority (how important it is to retain that block in the cache), and the duration value sets how long this priority level should apply for.

The priority based-eviction API enables an LLM deployer to use knowledge about their workload to improve reuse opportunities by persisting blocks that are likely to be reused. For example, the deployer may want blocks corresponding to a system prompt to stay in the cache as long as possible, or blocks that might be involved in a latency-critical request should persist with higher priority than others (Figure 1).

For each request, you can specify a priority and duration value for discrete ranges of tokens in the input context, along with a priority and duration for blocks allocated during the decode phase. The priority level of a range of tokens applies until the duration has passed after no period of reuse, or until the blocks corresponding to these ranges have been evicted.

When choosing blocks to be evicted, TensorRT-LLM considers the priority levels of tokens within the block. For example, a request with a 500-token system prompt can set the token range [0, 500) to the maximum priority. This way, the cache blocks corresponding to these tokens will only be evicted if absolutely necessary. Alternatively, if you know that blocks will never be reused, you can set the blocks of this request to the lowest priority to ensure that they are evicted first, before other blocks.

This new implementation also biases toward blocks further from the root, which leads to a small performance improvement, even when not setting priority levels. Our internal benchmarks show priority-based eviction increasing cache hit rate by around 20% and varies based on the workload.

# Priority-based eviction usage examples

#Example 1: One-off request

KvCacheRetentionConfig(
[TokenRangeRetentionConfig(start=0, end=null, priority=0)],
decode_priority=0
)

#Example 2: High Priority system prompt

KvCacheRetentionConfig(
[TokenRangeRetentionConfig(start=0, end=1000, priority=100)]
)

#Example 3: Retain context blocks for 30 seconds, and decode blocks for 10 seconds

KvCacheRetentionConfig(
[TokenRangeRetentionConfig(start=0, end=null, priority=100, duration=30s)],
decode_priority=100, decode_duration=10s)

KV Cache Event API

In large-scale LLM-powered applications, deployers often provision multiple serving instances of a model to distribute incoming requests. This raises the question, which instance should process new requests? Requests are often routed to balance load to ensure efficient utilization and quick processing of any request. The size of the KV cache on any instance represents the capacity to grow and accept new work.

However, load-based routing may not be optimal. If a moderately loaded instance has already computed and cached the keys and values for a new request, routing the request to this instance might still be preferred to optimize for cache reuse. The KV cache event API enables request routing systems to track which instances have cached or evicted blocks, enabling more intelligent reuse and greater performance.

The TensorRT-LLM Executor API now exposes a means of tracking updates to the KV cache.

# Set the max size of the internal event buffer. Defaults to 0 (no events)
kv_cache_config = KvCacheConfig(event_buffer_max_size=16384)

executor_config = ExecutorConfig(kv_cache_config)

executor = Executor(executor_config)

# Get an event manager
eventManager = executor.getKvCacheEventManager()

# Wait for new events. Once it returns, it implicitly clears the internal queue of events. Optionally provide a timeout value. If there’s no events within this timeout, it returns an empty list.
events = eventManager.getLatestEvents()

When a cache block is stored for reuse, removed, or updated, an event is emitted. These events can be consumed in real time by an application to get an eventually consistent view of the current state of the TensorRT-LLM KV cache. This is especially useful for tracking KV cache reuse opportunities. It can be used on the scale of a single executor to anticipate which requests will have more reuse, or aggregated across many executors to make KV-aware routing and scheduling decisions (Figure 2).

# KV cache event API scalable implementation where events are processed across KV tree shards and aggregated across multiple executors to provide KV-aware routing and scheduling of requests that optimize KV cache reuse opportunity.

Summary

NVIDIA TensorRT-LLM provides several optimizations to efficiently deploy your generative AI applications across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations. These optimizations lead to significant speedups and better cache reuse on the same hardware. This ultimately enables using fewer resources to serve the same workload, reducing energy costs, and improving total cost of ownership.

Post Views: 68

Introducing KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM

Priority-based KV Cache Eviction

KV Cache Event API

Summary

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter