Load testing and performance benchmarking
Load testing and performance benchmarking are two distinct approaches to evaluating the deployment of an LLM. Load testing focuses on simulating a large number of concurrent requests to a model to assess its ability to handle real-world traffic at scale. This type of testing helps identify issues related to server capacity, autoscaling tactics, network latency, and resource utilization.
In contrast, performance benchmarking, as demonstrated by the NVIDIA GenAI-Perf tool, is concerned with measuring the actual performance of the model itself, such as its throughput, latency, and token-level metrics. This type of testing helps identify issues related to model efficiency, optimization, and configuration.
While load testing is essential for ensuring the model can handle a large volume of requests, performance benchmarking is crucial for understanding the model’s ability to process requests efficiently. By combining both approaches, developers can gain a comprehensive understanding of their LLM deployment capabilities and identify areas for improvement.
How LLM inference works
Prior to examining benchmark metrics, it is important to understand how LLM inference works, and to become familiar with related terminology. An LLM application produces results through inference stages. For a given specific LLM application, these stages include:
- Prompt: User provides a query
- Queuing: Query joins the queue for processing
- Prefill: The LLM model processes the prompt
- Generation: The LLM model outputs a response, one token at a time
An AI token is a concept specific to LLMs and is core to LLM inference performance metrics. It is the unit, or smallest lingual entity, that LLMs use to break down and process natural language. The collection of all tokens is known as a vocabulary. Each LLM has its own tokenizer that is learned from the data so as to represent the input text efficiently. As an approximation, for many popular LLMs, each token is ~0.75 English words.
Sequence length is the length of the sequence of data. The Input Sequence Length (ISL) is how many tokens that the LLM gets. It includes the user query, any system prompt (instructions for the model, for example), previous chat history, chain of thought (CoT) reasoning, and documents from the retrieval-augmented generation (RAG) pipeline. The Output Sequence Length (OSL) is how many tokens the LLM generates. Context length is how many tokens the LLM uses at each generation step, including both the input and output tokens generated to that point. Each LLM has a maximum context length that can be allocated to both input and output tokens.
Streaming is an option that allows partial LLM outputs to be streamed back to users in the form of chunks of tokens generated incrementally. This is important for chatbot applications, where it is desirable to receive an initial response quickly. While the user digests the partial content, the next chunk of the result arrives in the background. In contrast, in non-streaming mode, the full answer is returned all at once.
LLM inference metrics
This section explains some of the common metrics used in the industry, including time to first token and intertoken latency, as shown in Figure 1.
Time to first token
Time to first token (TTFT) is the time it takes to process the prompt and generate the first token (Figure 2). In other words, it measures how long a user must wait before seeing the model’s output.
End-to-end request latency
End-to-end request latency (e2e_latency) indicates the time it takes from submitting a query to receiving the full response, including the time for queueing and batching and network latencies (Figure 3).
Intertoken latency
Intertoken latency (ITL) is the average time between the generation of consecutive tokens in a sequence. It is also known as time per output token (TPOT).
Tokens per second
Tokens per second (TPS) per system represents the total output tokens per second throughput, accounting for all the requests happening simultaneously.
Requests per second
Requests per second (RPS) is the average number of requests that can be successfully completed by the system in a 1-second period.
Benchmarking parameters and best practices
This section presents some important test parameters and their sweep range, which ensures meaningful benchmarking and quality assurance.
Application use cases and their impact on LLM performance
An application’s specific use cases will influence the sequence lengths (ISL and OSL), which will in turn impact how fast a system digests the input to form KV-cache and generate output tokens.
Load control parameters
Load control parameters as defined in this section are used to induce loads on LLM systems.
Other parameters
In addition, there are relevant LLM serving parameters that can affect the inference performance as well as the accuracy of the benchmark.
Get started
LLM performance benchmarking is a critical step to ensure both high performance and cost-efficient LLM serving at scale. This post has discussed the most important metrics and parameters when benchmarking LLM inference.
Conclusion
In conclusion, this post has covered the essential metrics and parameters for benchmarking LLM inference. By understanding how LLM inference works and the importance of load testing and performance benchmarking, developers can gain a comprehensive understanding of their LLM deployment capabilities and identify areas for improvement.
FAQs
Q: What is the difference between load testing and performance benchmarking?
A: Load testing focuses on simulating a large number of concurrent requests to a model to assess its ability to handle real-world traffic at scale, while performance benchmarking measures the actual performance of the model itself.
Q: What are the key metrics for LLM inference performance benchmarking?
A: The key metrics include time to first token, end-to-end request latency, intertoken latency, tokens per second, and requests per second.
Q: How do I choose the right sequence length for my LLM application?
A: The sequence length depends on the specific use case and should be chosen based on the expected input and output lengths.
Q: What are the important load control parameters for LLM inference?
A: The important load control parameters include concurrency, request rate, and batch size.
Q: How do I optimize my LLM inference for better performance?
A: Optimizations can be achieved by using the right sampling methods, such as greedy or top_p, and by tuning the model hyperparameters.