Optimizing and Deploying Large Language Models with NVIDIA TensorRT-LLM and Triton Inference Server
Hardware and software requirements
For optimizing and deploying your models, you need to have NVIDIA GPUs that support TensorRT-LLM and Triton Inference Server. It is recommended to use the new generations of NVIDIA GPUs. You can find the list of supported GPUs in the hardware section of the support matrix for TensorRT-LLM. You can also deploy your models on public cloud compute instances with the appropriate GPU resources such as AWS EKS, Azure AKS, GCP GKE, or OCI OKE.
Optimize LLMs with TensorRT-LLM
TensorRT-LLM supports a variety of state-of-the-art models. You can download the model checkpoints from Hugging Face, then use TensorRT-LLM to build engines that contain the model optimizations. To download the LLMs, you will need an access token. You can then create a Kubernetes secret with the access token, which will be used in a later step of Kubernetes Deployment to download the models.
Autoscale Deployment of LLMs with Kubernetes
After optimizing your LLMs with TensorRT-LLM, you can deploy the models using Triton and autoscale the deployment with Kubernetes. Three main steps are required to deploy LLMs for AI inference:
- Create a Kubernetes Deployment for Triton servers
- Create a Kubernetes Service to expose the Triton servers as a network service
- Autoscale the Deployment using Horizontal Pod Autoscaler (HPA) based on Triton metrics scraped by Prometheus
Helm Chart for LLM Deployment
You can use a Helm chart for deployment, as it is easy to modify and deploy across different environments. To find the Helm chart, see Autoscaling and Load Balancing Generative AI with Triton Server and TensorRT-LLM.
Send test inference requests
Finally, you can test the Triton servers by sending inference requests from clients. A sample client folder is provided, from which you can build a client container image:
$ docker build -f ./containers/client.containerfile -t ./containers/.
Next, modify the .yaml file in the client folder to the name of the client container image that you built. Use the following commands to create the Deployment of the client with one replica:
$ kubectl apply -f ./clients/gpt2.yaml
deployment.apps/client-gpt2 created
Conclusion
In this post, we provided step-by-step instructions for deploying LLMs and autoscaling the deployment in a Kubernetes environment. LLMs can be optimized using NVIDIA TensorRT-LLM, then deployed using NVIDIA Triton Inference Server. Prometheus collects the Triton metrics and communicates with Kubernetes. HPA can use a custom metric to autoscale the Pod numbers, depending on the amount of inference requests from clients.
Frequently Asked Questions
Q: What are the hardware requirements for optimizing and deploying LLMs?
A: NVIDIA GPUs that support TensorRT-LLM and Triton Inference Server are required. You can find the list of supported GPUs in the hardware section of the support matrix for TensorRT-LLM.
Q: How can I optimize LLMs using TensorRT-LLM?
A: You can download the model checkpoints from Hugging Face, then use TensorRT-LLM to build engines that contain the model optimizations. To download the LLMs, you will need an access token. You can then create a Kubernetes secret with the access token, which will be used in a later step of Kubernetes Deployment to download the models.
Q: How can I autoscale the deployment of LLMs with Kubernetes?
A: After optimizing your LLMs with TensorRT-LLM, you can deploy the models using Triton and autoscale the deployment with Kubernetes. Three main steps are required to deploy LLMs for AI inference: create a Kubernetes Deployment for Triton servers, create a Kubernetes Service to expose the Triton servers as a network service, and autoscale the Deployment using Horizontal Pod Autoscaler (HPA) based on Triton metrics scraped by Prometheus.
Q: What is a Helm chart for LLM deployment?
A: A Helm chart is a package of files that defines a Kubernetes application. It is easy to modify and deploy across different environments. You can find the Helm chart for LLM deployment in Autoscaling and Load Balancing Generative AI with Triton Server and TensorRT-LLM.

