Date:

Dynamic Memory Compression

What Impacts LLM Inference Performance?

LLM inference consists of two phases: pre-filling and auto-regressive generation. During generation, to perform self-attention, Transformers append a pair of representations (key-value pair, or KVP) for every token to a cache. A different KVP is stored for every layer and every attention head. As a result, the KVP cache grows proportionally to the sequence length. As the KVP cache must fit into the GPU memory together with the LLM weights, it can occupy a significant part of it or even exhaust it.

Dynamic Memory Compression

Dynamic memory compression (DMC) is a simple way to compress KV cache during inference without incurring performance drop. The model decides for every token, layer, and head separately, if the new key and value pair should be appended to the KV cache, as usual in plain Transformer, or accumulated onto the last one, using the following equation:

ki = αk{i-1} + k_{new_i}

This equation, lying at the heart of DMC, transforms a sub-sequence of keys into a particular prefix sum, which is reminiscent of popular SSMs like xLSTM or RWKV. During inference, the values of alpha are strictly binary. When α = 1, the tokens are summed in place without extending the KVP cache, for the compressing behavior.

How Does DMC Work?

DMC opens a third way, where a Transformer model can be trained to adaptively compress the conversation state and achieve a desired compression rate. This enables a significant reduction of the conversation state size without replacing the familiar Transformer architecture. DMC does not require training from scratch, as the existing models can be retrofitted through a negligible amount of additional training, which is more reliable than error-prone training-free methods.

What are the Benefits of DMC?

Compression of the conversation state frees up memory, which could be used to accommodate a larger batch size. As the model is memory-bound, this usually translates almost directly into an equivalent increase in throughput. For instance, choosing the maximum batch size that fits in memory with 8x compression on an NVIDIA H100 GPU means 700% more tokens generated per second than with the vanilla model.

Conclusion

DMC holds promise to push the frontier of LLMs even further. It provides a mechanism for adaptive memory, which lies in between linear memory in Transformers and constant memory of SSMs and offers a better trade-off between capacity and space. This boosts Transformer LLM throughput without sacrificing quality, and enables the accommodation of much longer contexts within the same hardware constraints.

Acknowledgements

We would like to thank Mostofa Patwary and Szymon Migacz for their assistance, as well as Przemysław Strzelczyk, Daniel Korzekwa, and Bryan Catanzaro for helpful discussions and support in releasing this paper. This work was supported in part by the UKRI Centre for Doctoral Training in Natural Language Processing, funded by the UKRI (grant EP/S022481/1) and the University of Edinburgh, School of Informatics and School of Philosophy, Psychology & Language Sciences.

FAQs

Q: What is Dynamic Memory Compression (DMC)?
A: DMC is a simple way to compress KV cache during inference without incurring performance drop.

Q: How does DMC work?
A: DMC uses the equation ki = αk{i-1} + k_{new_i} to transform a sub-sequence of keys into a particular prefix sum, which is reminiscent of popular SSMs like xLSTM or RWKV.

Q: What are the benefits of DMC?
A: DMC enables a significant reduction of the conversation state size without replacing the familiar Transformer architecture, and frees up memory, which could be used to accommodate a larger batch size.

Q: Can DMC be used with pre-existing LLMs?
A: Yes, DMC can be retrofitted through a negligible amount of additional training, which is more reliable than error-prone training-free methods.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here