NVIDIA NIM Microservices on Kubernetes Autoscaling

Prerequisites

To follow along with this tutorial, you need the following list of prerequisites:

An NVIDIA AI Enterprise license
A Kubernetes cluster version 1.29 or later (we used DGX Cloud Clusters)
Admin access to the Kubernetes cluster
Kubernetes CLI tool kubectl installed
HELM CLI installed

Setting up a Kubernetes cluster

The first step in this tutorial is to set up your Kubernetes cluster with the appropriate components to enable metric scraping and availability to the Kubernetes HPA service. This requires the following components:

Kubernetes Metrics Server
Prometheus
Prometheus Adapter
Grafana

Kubernetes Metrics Server

Metrics Server is responsible for scraping resource metrics from Kubelets and exposes them in Kubernetes API Server through the Metrics API. This is used by both the Horizontal Pod Autoscaler and the kubectl top command.

To install the Kubernetes Metric Server, use Helm.

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server

Prometheus and Grafana

Prometheus and Grafana are well-known tools for scraping metrics from pods and creating dashboards. To install Prometheus and Grafana, use the kube-prometheus-stack Helm chart that includes many different components.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack

The Prometheus adapter exposes the scraped metrics from Prometheus in the Kubernetes apiserver through Metrics API. This enables HPA to use custom metrics from pods to make scaling strategies.

helm install prometheus-community/prometheus-adapter -n

Make sure that the Prometheus adaptor is pointing to the correct Prometheus service endpoint. In this case, I had to edit the deployment and correct the URL.

kubectl edit deployment prom-adapter-prometheus-adapter -n prometheus
spec:
  affinity: {}
  containers:
  - args:
    - /adapter
    - --secure-port=6443
    - --cert-dir=/tmp/cert
    - --prometheus-url=http://prometheus-prometheus.prometheus.svc:9090
    - --metrics-relist-interval=1m
    - --v=4
    - --config=/etc/adapter/config.yaml
    image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.12.0

Deploying a NIM microservice

In this tutorial, you use NIM for LLMs as a microservice to scale, specifically using model meta/llama-3.1-8b-instruct.

After deployment, you should note the service name and namespace of your NIM for LLMs microservice, as this will be used in many commands.

NIM for LLMs already exposes a Prometheus endpoint with many interesting metrics. To see the endpoint, use the following commands:

kubectl -n <namespace> port-forward svc/<service_name> 8080

From a browser, go to localhost:8080/metrics and look for the specific metric named gpu_cache_usage_perc. In this post, you use this metric as a basis for autoscaling. This metric shows the percent utilization of the KV cache and is reported by the vLLM stack.

Creating HPA

Now that you have observed the impact of concurrency on KV cache utilization, you can create the HPA resource. Create the HPA resource to scale based on the gpu_cache_usage_perc metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa-cache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: meta-llama3-8b
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_cache_usage_perc
      target:
        type: AverageValue
        averageValue: 100m

kubectl create -f hpa-gpu-cache.yaml -n <namespace>

Run genai-perf at different concurrencies (10, 100, 200) and watch the HPA metric increase:

NAME            REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa-cache   Deployment/meta-llama3-8b   9m/100m   1         10        1          3m37s

Check the number of pods and you should see that autoscaling added two new pods:

NAME                                       READY   STATUS              RESTARTS      AGE
meta-llama3-8b-5c6ddbbfb5-85p6c            1/1     Running             0             25s
meta-llama3-8b-5c6ddbbfb5-dp2mv            1/1     Running             0             146m
meta-llama3-8b-5c6ddbbfb5-sf85v            1/1     Running             0             26s

Conclusion

In this post, we described how to set up your Kubernetes cluster to scale on custom metrics and showed how you can scale a NIM for LLMs based on the KV cache utilization parameter.

There are many advanced areas to explore further in this topic. For example, many other metrics could also be considered for scaling, such as request latency, request throughput, and GPU compute utilization. You can scale on multiple metrics in one HPA resource and scale accordingly.

Another area of interest is the ability to create new metrics using Prometheus Query Language (PromQL) and add them to the configmap of the Prometheus adapter so that HPA can scale.

FAQs

Q: What is the purpose of the Kubernetes Metrics Server?
A: The Kubernetes Metrics Server is responsible for scraping resource metrics from Kubelets and exposes them in Kubernetes API Server through the Metrics API.

Q: What is Prometheus and Grafana?
A: Prometheus and Grafana are well-known tools for scraping metrics from pods and creating dashboards.

Q: How do I create a new metric using Prometheus Query Language (PromQL)?
A: You can create a new metric using Prometheus Query Language (PromQL) and add it to the configmap of the Prometheus adapter so that HPA can scale.

Q: What are some advanced areas to explore further in this topic?
A: Some advanced areas to explore further in this topic include scaling on multiple metrics in one HPA resource, scaling based on request latency, request throughput, and GPU compute utilization, and creating new metrics using Prometheus Query Language (PromQL).

Post Views: 80

NVIDIA NIM Microservices on Kubernetes Autoscaling

Prerequisites

Setting up a Kubernetes cluster

Kubernetes Metrics Server

Prometheus and Grafana

Deploying a NIM microservice

Creating HPA

Conclusion

FAQs

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter