There are several resources available on the internet on how to scale your Kubernetes pods based on CPU, but when it comes to Kubernetes pods based on GPU, it’s hard to find a concise breakdown that outlines each step and how to test. In this article, we outline the steps to scale Kubernetes pods based on GPU metrics. These steps are performed on a AKS (Azure Kubernetes Service), but work well with most cloud service providers, as well as, with self managed clusters.
For this tutorial, you will need an API key. Contact us to download yours.
Step 0: Prerequisites
Kubernetes cluster
You’ll need to have a Kubernetes cluster up and running for this tutorial. To set up an AKS cluster, see this guide from Azure.
Note: Your cluster should have at least 2 GPU enabled nodes.
Kubectl
To manage Kubernetes resources, set up the kubectl command line client.
Here is a guide to install kubectl if you haven’t installed it already.
Helm
Helm is used to manage packaging, configuration and deployment of resources to the Kubernetes cluster. In this tutorial we’ll make use of the helm.
Use this guide and follow your OS specific installation instructions.
Step 1: Install metrics server
Now that we have prerequisites installed and setup, we’ll move ahead with installing Kubernetes plugins and tools to set up auto scaling based on GPU metrics.
Metrics server collects various resource metrics from Kubelet and exposes it via a metrics API of Kubernetes. Most of the cloud (ie. AKS), as well as the local distribution of Kubernetes, have this metrics-server already installed. If you’re not sure, follow the instructions to check and install it.
- To check if you have metrics-service running
kubectl get pods -A | grep metrics-server
If the metrics-server is installed, you should see an output like this.
kube-system metrics-server-774f99dbf4-tjw6l
- In case you don’t have it installed, use the following command to install it.
kubectl apply -f \
https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Step 2: Install nvidia device plugin
The Nvidia device plugin for Kubernetes is a Daemonset that allows you to run GPU enabled containers in your cluster.
Install it using the following command:
kubectl create -f \
https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml
To learn more about the Nvidia device plugin, see this resource here.
Step 3: Install dcgm exporter
DCGM-Exporter collects GPU telemetry using Go bindings of NVIDIA’s API and allows you to monitor health and utilization of GPU. It exposes an easy to consume http endpoint (/metrics) for monitoring tools like Prometheus.
Run the following command to install dcgm-exporter:
kubectl create -f \
https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml
Once it is running, you can try to query its /metrics endpoint.
First, forward port 9400 of dcgm-exporter service. (Run this command in a separate terminal)
kubectl port-forward svc/dcgm-exporter 9400:9400
Query /metrics endpoint.
curl localhost:9400/metrics
Step 4: Install kube-prometheus-stack
Next, install the prometheus stack using the kube-prometheus-stack.values. This value file has some changes that are suggested by NVIDIA (to make prometheus available to local machines) and an additionalScrapeConfigs which create a job to scrape the metrics exported by dcgm-exporter.
Find the kube-prometheus-stack.values file below.
Add & update the helm repo:
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
Once we have the helm repo set up, inspect the helm chart and modify the settings.
helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values
In the Prometheus instance section of the chart, update the service type from ClusterIP to NodePort. This change will allow Prometheus server to be available at your local machine at port 30900.
From:
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30090
## Loadbalancer IP
## Only use if service.type is "loadbalancer"
loadBalancerIP: ""
loadBalancerSourceRanges: []
## Service type
##
type: ClusterIP
To:
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30090
## Loadbalancer IP
## Only use if service.type is "loadbalancer"
loadBalancerIP: ""
loadBalancerSourceRanges: []
## Service type
##
type: NodePort
Update the value of serviceMonitorSelectorNilUsesHelmValues to false.
## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector
## will cause the prometheus resource to be created with selectors based on
## values in the helm deployment, which will also match the servicemonitors created
##
serviceMonitorSelectorNilUsesHelmValues: false
Add this configMap to the additionalScrapeConfigs section of the helm chart.
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
Once you have your helm chart ready, install kube-prometheus-stack via Helm.
helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--values /tmp/kube-prometheus-stack.values
After installation is finished, your output should look like this.
NAME: kube-prometheus-stack-1652691100
LAST DEPLOYED: Mon May 16 14:22:12 2022
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1652691100"
Step 5: Install prometheus-adapter
Now we’ll install the prometheus-adapter . The adapter gathers available metrics from Prometheus at a regular interval.
prometheus_service=$(kubectl get svc -nprometheus -lapp=kube-prometheus-stack-prometheus -ojsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')
helm upgrade \
--install prometheus-adapter prometheus-community/prometheus-adapter \
--set rbac.create=true,prometheus.url=http://${prometheus_service}.prometheus.svc.cluster.local,prometheus.port=9090
This will take a moment to set up, after it’s up, you should be able to make
Step 6: Create a HPA which scales based on GPU
Now that all the pieces are available, create a HorizontalPodAutoscaler and configure it to scale on the bases of GPU utilization metric (DCGM_FI_DEV_GPU_UTIL)
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-gpu-app
spec:
maxReplicas: 3 # Update this accordingly
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: my-gpu-app
metrics:
- type: Pods # scale based on gpu
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: 80
There are other GPU metrics available than just DCGM_FI_DEV_GPU_UTIL. Find a complete list of available metrics in their docs.
Step 7: Create a LoadBalancer service (Optional)
This is an optional step to expose your app to the web. If you are setting up your cluster using a cloud service provider, there’s a good chance that it’ll allocate a public IP address, which you can use to interact with your application. Alternatively, you can create a service of type nodePort and access your app via that.
apiVersion: v1
kind: Service
metadata:
name: app-ip
labels:
component: app
spec:
type: LoadBalancer
selector:
component: app
ports:
- name: http
port: 80
targetPort: 8080
In this configuration, we are assuming that our app runs on port 8080 and we are mapping it to port 80 of the pod.
Step 8: Summing it all up together
Now that we have all the external pieces that we need, let’s create a Kubernetes manifest file and save it as autoscaling-demo.yml.
For demonstration we’ll use container image of deid application, this is Private AI’s container based de-identification system. You can use any GPU based application of your choice.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-gpu-app
spec:
replicas: 1
selector:
matchLabels:
component: app
template:
metadata:
labels:
component: app
spec:
containers:
- name: app
securityContext:
capabilities: # SYS_ADMIN capabilities needed for DCMG Exporter
add:
- SYS_ADMIN
resources:
limits:
nvidia.com/gpu: 1
image: privateai/deid:2.11full_gpu # You can use any GPU based image
---
apiVersion: v1
kind: Service
metadata:
name: app-ip
labels:
component: app
spec:
type: LoadBalancer
selector:
component: app
ports:
- name: http
port: 80
targetPort: 8080 # The port might be different for your application
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-gpu-app
spec:
maxReplicas: 2 # Update this according to your desired number of replicas
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: my-gpu-app
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: 30
Step 9: Create a deployment
Run kubectl create command to create your deployment.
kubectl create -f deid.yaml
Once your deployment is complete, you should be able to see the running status of pods and our HorizontalPodAutoscaler, which will scale based on GPU utilization.
To check the status of pods
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dcgm-exporter-6bjn8 1/1 Running 0 3h37m
dcgm-exporter-xmn74 1/1 Running 0 3h37m
my-gpu-app-675b967d56-q7swb 1/1 Running 0 12m
prometheus-adapter-6696b6d76-g2csx 1/1 Running 0 104m
To check the status of Horizontal Pod Autoscaler
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-gpu-app Deployment/my-gpu-app 0/30 1 2 1 2m15s
Getting your public/external ip
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
app-ip LoadBalancer 10.0.208.227 20.233.60.124 80:31074/TCP 15s
dcgm-exporter ClusterIP 10.0.116.180 9400/TCP 3h55m
kubernetes ClusterIP 10.0.0.1 443/TCP 4h26m
prometheus-adapter ClusterIP 10.0.12.96 443/TCP 122m
20.233.60.124 is your IP.
Step 10: Test autoscaling
Increase the GPU utilization by making requests to the application. When the average GPU utilization (target) crosses 30, max average utilization set by us, you’ll observe that the application will scale up and spin another pod.
Making a request to your app
Here we are making a request to /deidentiy_text endpoint of our deid container. You can make a request to any resource which utilizes GPU.
for ((i=1;i<=10;i++)); \
do curl -X POST http://20.233.60.124/deidentify_text \
-H 'content-type: application/json' -d '{"text": ["My name is John and my friend is Grace", "I live in Berlin"], "unique_pii_markers": false, "key": “”}' &; \
done
Need an API key? Contact us to download yours.
Meanwhile keep observing the status of horizontal pod autoscaler. When the GPU utilization (target) crosses 30, the system will automatically spin up another instance of pod.
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-gpu-app Deployment/my-gpu-app 40/30 2 2 1 30m
Check the status of pods, you’ll notice that now we have another my-gpu-app spinned up by our autoscaler.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dcgm-exporter-6bjn8 1/1 Running 0 3h37m
dcgm-exporter-xmn74 1/1 Running 0 3h37m
my-gpu-app-675b967d56-q7swb 1/1 Running 0 30m
my-gpu-app-572f924e36-q7swb 1/1 Running 0 5m
prometheus-adapter-6696b6d76-g2csx 1/1 Running 0 104m
Additional resources for Kubernetes GPU deployment
- – DCGM Exporter documentation
- – GitHub repository for NVIDIA device plugin for Kubernetes
- – Quick deployment guide for Azure Kubernetes Service (AKS)
- – Kubectl cheat sheet for common kubectl commands
- – GPU load based autoscaling on Kubernetes clusters
Interested in receiving more tech tips like autoscaling Kubernetes pods based on GPU? Sign up for Private AI’s mailing list to get notified about the latest information on machine learning deployment, privacy, and more.
Sign up for our Community API
The “get to know us” plan. Our full product, but limited to 75 API calls per day and hosted by us.
Get Started Today