How to Autoscale Kubernetes Pods Based on GPU

Share This Post

There are several resources available on the internet on how to scale your Kubernetes pods based on CPU, but when it comes to Kubernetes pods based on GPU, it’s hard to find a concise breakdown that outlines each step and how to test. In this article, we outline the steps to scale Kubernetes pods based on GPU metrics. These steps are performed on a AKS (Azure Kubernetes Service), but work well with most cloud service providers, as well as, with self managed clusters.

For this tutorial, you will need an API key. Contact us to download yours. 


Step 0: Prerequisites

Kubernetes cluster

You’ll need to have a Kubernetes cluster up and running for this tutorial. To set up an AKS cluster, see this guide from Azure.

Note: Your cluster should have at least 2 GPU enabled nodes.

Kubectl

To manage Kubernetes resources, set up the kubectl command line client.

Here is a guide to install kubectl if you haven’t installed it already.

Helm

Helm is used to manage packaging, configuration and deployment of resources to the Kubernetes cluster. In this tutorial we’ll make use of the helm.

Use this guide and follow your OS specific installation instructions.

Step 1: Install metrics server

Now that we have prerequisites installed and setup, we’ll move ahead with installing Kubernetes plugins and tools to set up auto scaling based on GPU metrics.

Metrics server collects various resource metrics from Kubelet and exposes it via a metrics API of Kubernetes. Most of the cloud (ie. AKS), as well as the local distribution of Kubernetes, have this metrics-server already installed. If you’re not sure, follow the instructions to check and install it.

  1. To check if you have metrics-service running
				
					kubectl get pods -A | grep metrics-server
				
			


If the metrics-server is installed, you should see an output like this.

				
					kube-system  metrics-server-774f99dbf4-tjw6l
				
			

  1. In case you don’t have it installed, use the following command to install it.
				
					kubectl apply -f \
https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
				
			


Step 2: Install nvidia device plugin

The Nvidia device plugin for Kubernetes is a Daemonset that allows you to run GPU enabled containers in your cluster.

Install it using the following command:

				
					kubectl create -f \
https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml
				
			

To learn more about the Nvidia device plugin, see this resource here.

Step 3: Install dcgm exporter

DCGM-Exporter collects GPU telemetry using Go bindings of NVIDIA’s API and allows you to monitor health and utilization of GPU. It exposes an easy to consume http endpoint (/metrics) for monitoring tools like Prometheus.

Run the following command to install dcgm-exporter:

				
					kubectl create -f \
https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml
				
			


Once it is running, you can try to query its /metrics endpoint.

First, forward port 9400 of dcgm-exporter service. (Run this command in a separate terminal)

				
					kubectl port-forward svc/dcgm-exporter 9400:9400
				
			


Query /metrics endpoint.

				
					curl localhost:9400/metrics
				
			


Step 4: Install kube-prometheus-stack

Next, install the prometheus stack using the kube-prometheus-stack.values. This value file has some changes that are suggested by NVIDIA (to make prometheus available to local machines) and an additionalScrapeConfigs which create a job to scrape the metrics exported by dcgm-exporter.

Find the kube-prometheus-stack.values file below.

Add & update the helm repo:

				
					helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts

helm repo update
				
			


Once we have the helm repo set up, inspect the helm chart and modify the settings.

				
					
helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values
				
			


In the Prometheus instance section of the chart, update the service type from ClusterIP to NodePort. This change will allow Prometheus server to be available at your local machine at port 30900.

				
					From:
 ## Port to expose on each node
 ## Only used if service.type is 'NodePort'
 ##
 nodePort: 30090
 
 ## Loadbalancer IP
 ## Only use if service.type is "loadbalancer"
 loadBalancerIP: ""
 loadBalancerSourceRanges: []
 ## Service type
 ##
 type: ClusterIP


To:
 ## Port to expose on each node
 ## Only used if service.type is 'NodePort'
 ##
 nodePort: 30090
 
 ## Loadbalancer IP
 ## Only use if service.type is "loadbalancer"
 loadBalancerIP: ""
 loadBalancerSourceRanges: []
 ## Service type
 ##
 type: NodePort
				
			


Update the value of serviceMonitorSelectorNilUsesHelmValues to false.

				
					## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector
## will cause the prometheus resource to be created with selectors based on
## values in the helm deployment, which will also match the servicemonitors created
##

serviceMonitorSelectorNilUsesHelmValues: false
				
			


Add this configMap to the additionalScrapeConfigs section of the helm chart.

				
					
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node
				
			


Once you have your helm chart ready, install kube-prometheus-stack via Helm.

				
					helm install prometheus-community/kube-prometheus-stack \
  --create-namespace --namespace prometheus \
  --generate-name \
  --values /tmp/kube-prometheus-stack.values
				
			


After installation is finished, your output should look like this.

				
					NAME: kube-prometheus-stack-1652691100
LAST DEPLOYED: Mon May 16 14:22:12 2022
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
 kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1652691100"
				
			


Step 5: Install prometheus-adapter

Now we’ll install the prometheus-adapter . The adapter gathers available metrics from Prometheus at a regular interval.

				
					prometheus_service=$(kubectl get svc -nprometheus -lapp=kube-prometheus-stack-prometheus -ojsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')

helm upgrade \
--install prometheus-adapter prometheus-community/prometheus-adapter \
--set rbac.create=true,prometheus.url=http://${prometheus_service}.prometheus.svc.cluster.local,prometheus.port=9090
				
			


This will take a moment to set up, after it’s up, you should be able to make

Step 6: Create a HPA which scales based on GPU

Now that all the pieces are available, create a HorizontalPodAutoscaler and configure it to scale on the bases of GPU utilization metric (DCGM_FI_DEV_GPU_UTIL)

				
					apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
 name: my-gpu-app
spec:
 maxReplicas: 3  # Update this accordingly
 minReplicas: 1
 scaleTargetRef:
   apiVersion: apps/v1beta1
   kind: Deployment
   name: my-gpu-app
 metrics:
 - type: Pods  # scale based on gpu
   pods:
     metric:
       name: DCGM_FI_DEV_GPU_UTIL
     target:
       type: AverageValue
       averageValue: 80
				
			

There are other GPU metrics available than just DCGM_FI_DEV_GPU_UTIL. Find a complete list of available metrics in their docs.

Step 7: Create a LoadBalancer service (Optional)

This is an optional step to expose your app to the web. If you are setting up your cluster using a cloud service provider, there’s a good chance that it’ll allocate a public IP address, which you can use to interact with your application. Alternatively, you can create a service of type nodePort and access your app via that.

				
					apiVersion: v1
kind: Service
metadata:
 name: app-ip
 labels:
   component: app
spec:
 type: LoadBalancer
 selector:
   component: app
 ports:
   - name: http
     port: 80
     targetPort: 8080
				
			

In this configuration, we are assuming that our app runs on port 8080 and we are mapping it to port 80 of the pod.

Step 8: Summing it all up together

Now that we have all the external pieces that we need, let’s create a Kubernetes manifest file and save it as autoscaling-demo.yml.

For demonstration we’ll use container image of deid application, this is Private AI’s container based de-identification system. You can use any GPU based application of your choice.

				
					apiVersion: apps/v1
kind: Deployment
metadata:
name: my-gpu-app
spec:
replicas: 1
selector:
  matchLabels:
    component: app
template:
  metadata:
    labels:
      component: app
  spec:
    containers:
      - name: app
        securityContext:
          capabilities: # SYS_ADMIN capabilities needed for DCMG Exporter
            add:
              - SYS_ADMIN
        resources:
          limits:
            nvidia.com/gpu: 1
        image: privateai/deid:2.11full_gpu # You can use any GPU based image

---
apiVersion: v1
kind: Service
metadata:
name: app-ip
labels:
  component: app
spec:
type: LoadBalancer
selector:
  component: app
ports:
  - name: http
    port: 80
    targetPort: 8080 # The port might be different for your application

---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-gpu-app
spec:
maxReplicas: 2 # Update this according to your desired number of replicas
minReplicas: 1
scaleTargetRef:
  apiVersion: apps/v1beta1
  kind: Deployment
  name: my-gpu-app
metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_GPU_UTIL
      target:
        type: AverageValue
        averageValue: 30

				
			


Step 9: Create a deployment

Run kubectl create command to create your deployment.

				
					kubectl create -f deid.yaml


				
			


Once your deployment is complete, you should be able to see the running status of pods and our 
HorizontalPodAutoscaler, which will scale based on  GPU utilization.

To check the status of pods

				
					$ kubectl get pods

NAME                                 READY   STATUS             RESTARTS   AGE
dcgm-exporter-6bjn8                  1/1     Running            0          3h37m
dcgm-exporter-xmn74                  1/1     Running            0          3h37m
my-gpu-app-675b967d56-q7swb          1/1     Running            0          12m
prometheus-adapter-6696b6d76-g2csx   1/1     Running            0          104m
				
			


To check the status of Horizontal Pod Autoscaler

				
					$ kubectl get hpa

NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
my-gpu-app   Deployment/my-gpu-app   0/30      1         2         1          2m15s
				
			


Getting your public/external ip

				
					$ kubectl get svc

NAME                 TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
app-ip               LoadBalancer   10.0.208.227   20.233.60.124  80:31074/TCP   15s
dcgm-exporter        ClusterIP      10.0.116.180            9400/TCP       3h55m
kubernetes           ClusterIP      10.0.0.1                443/TCP        4h26m
prometheus-adapter   ClusterIP      10.0.12.96              443/TCP        122m
				
			


20.233.60.124 is your IP.

Step 10: Test autoscaling

Increase the GPU utilization by making requests to the application. When the average GPU utilization (target) crosses 30, max average utilization set by us, you’ll observe that the application will scale up and spin another pod.

Making a request to your app

Here we are making a request to /deidentiy_text endpoint of our deid container. You can make a request to any resource which utilizes GPU.

				
					for ((i=1;i<=10;i++)); \
do curl -X POST http://20.233.60.124/deidentify_text \
-H 'content-type: application/json' -d '{"text": ["My name is John and my friend is Grace", "I live in Berlin"], "unique_pii_markers": false, "key": “”}' &; \
done
				
			

Need an API key? Contact us to download yours.

Meanwhile keep observing the status of horizontal pod autoscaler. When the GPU utilization (target) crosses 30, the system will automatically spin up another instance of pod.

				
					$ kubectl get hpa
 
NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
my-gpu-app   Deployment/my-gpu-app   40/30      2         2         1         30m

				
			

Check the status of pods, you’ll notice that now we have another my-gpu-app spinned up by our autoscaler.
				
					$ kubectl get pods

NAME                                 READY   STATUS             RESTARTS   AGE
dcgm-exporter-6bjn8                  1/1     Running            0          3h37m
dcgm-exporter-xmn74                  1/1     Running            0          3h37m
my-gpu-app-675b967d56-q7swb          1/1     Running            0          30m
my-gpu-app-572f924e36-q7swb          1/1     Running            0          5m
prometheus-adapter-6696b6d76-g2csx   1/1     Running            0          104m

				
			


Additional resources for Kubernetes GPU deployment

Interested in receiving more tech tips like autoscaling Kubernetes pods based on GPU? Sign up for Private AI’s mailing list to get notified about the latest information on machine learning deployment, privacy, and more.

Sign up for our Community API

The “get to know us” plan. Our full product, but limited to 75 API calls per day and hosted by us.

Get Started Today

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.