Table of Contents
Introduction
As cloud-native applications grow in scale and sophistication, the ability to dynamically allocate computing resources becomes critical—especially for workloads that require GPUs. Kubernetes provides built-in autoscaling, but it often falls short when you need to scale based on external or custom metrics. This is where KEDA (Kubernetes Event-Driven Autoscaling) excels. KEDA allows your Kubernetes workloads to scale based on real-time metrics from sources like Prometheus.
In this tutorial, we'll walk through how to autoscale an AMD GPU-based workload running on Kubernetes (DOKS) using Prometheus and KEDA. This setup allows you to react to live metrics and optimize GPU utilization efficiently and cost-effectively.
Key Takeaways
- Event-Driven GPU Autoscaling: Learn how to use KEDA to autoscale AMD GPU workloads on Kubernetes (DOKS) based on real-time metrics from Prometheus, enabling responsive and cost-efficient scaling beyond standard CPU-based autoscaling.
- Separation of Workloads: Discover best practices for isolating GPU and non-GPU workloads using multiple node pools, optimizing resource allocation, and reducing operational costs.
- Native AMD GPU Support: Take advantage of DOKS’s built-in support for AMD MI300X GPUs, with the AMD GPU device plugin and metrics exporter enabled by default for seamless integration and observability.
- Custom Metrics for Scaling: See how to configure Prometheus to collect custom GPU metrics and use them as triggers for scaling, allowing you to match resource allocation to actual workload demand (e.g., machine learning inference, video processing, real-time analytics).
- Hands-On, Step-by-Step Guidance: Follow a practical walkthrough covering cluster setup, node pool configuration, enabling GPU metrics, deploying a sample workload, and implementing KEDA-based autoscaling.
- Testing and Validation: Learn how to simulate GPU load, observe autoscaling in action, and verify that new GPU nodes are provisioned and workloads are rescheduled automatically as demand increases.
- Production-Ready Patterns: Gain insights into building robust, scalable, and cost-effective GPU-powered applications on Kubernetes using open-source tools and cloud-native best practices.
By the end of this tutorial, you’ll be able to deploy and autoscale GPU workloads on DOKS with confidence, leveraging event-driven scaling to optimize performance and cost for demanding AI/ML and compute-intensive applications.
Prerequisites
- A cloud account with access to DOKS cluster.
- A DO API token with write permissions
- Basic knowledge of Kubernetes and Helm
Step 1 - Create a DOKS Cluster with AMD GPU Support
You can create a new DOKS cluster by clicking the Create button on the top-right of your the cloud provider dashboard, or by navigating to Resources within your project and selecting:
- Kubernetes → Create a Kubernetes Cluster.
- Choose Kubernetes Version and Region
- Kubernetes Version: Select the latest stable version to ensure security and feature parity.
- Region: Choose a region that supports GPU Droplets—TOR1, NYC2, or ATL1 are recommended if you're planning to deploy AMD MI300X or other GPU workloads.
- Next, specify the VPC and the service networks
- Once you've named your cluster and chosen a region, the next step is to configure a node pool—this is where your applications will run.
Note: We recommend creating two node pools
Node Pool #1: For GPU Workloads
- GPU Type: Under *Choose cluster capacity*, go to the GPU tab.
- Machine Type: Select a GPU Droplet, such as AMD MI300X.
- Min/Max Nodes: Set a minimum of 0 (to save cost when idle) and a higher maximum depending on workload needs.
Node Pool #2: For General/Peripheral Workloads
- Droplet Type: Choose Standard, Dedicated CPU, or Memory-Optimized, depending on your app needs.
- Purpose: This pool is ideal for running Prometheus, dashboards (e.g., Grafana), KEDA, web APIs as well as kube-system components.
- Autoscaling: You can optionally enable autoscaling as well.
Why two node pools?
Isolating GPU and non-GPU workloads allows better scaling control, reduces resource waste, and avoids scheduling conflicts. Peripheral services don’t need expensive GPU nodes.
Note: For AMD GPUs, the AMD GPU device plugin is enabled by default.
Step 2 - Configure Node Pools
- Review your selections.
- Click Create Kubernetes Cluster.
Note: For AMD GPUs, the AMD GPU device plugin is enabled by default. Enable the AMD GPU Device Metrics Exporter then, you can make use of the API
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-token>" \
-d '{"name": "<your_cluster_name.", "amd_gpu_device_metrics_exporter_plugin": {"enabled": true}}' \
"https://kubernetes.io/docs/home/;"
Step 3 - Install the kube-prometheus-stack via Helm
- Prometheus will monitor and collect the custom metrics needed by KEDA.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack -n kube-system
Step 3: Install KEDA with Helm KEDA will evaluate Prometheus metrics and scale your workloads accordingly.
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda
Step 4 - Deploy a GPU Workload with Custom Metrics
We’ll deploy a sample Go application that exposes a Prometheus gauge metric. This workload requests a GPU and will be scaled based on the metric value.
Create a file named resources.yaml and paste the following manifests:
apiVersion: v1
kind: ConfigMap
metadata:
name: go-program
data:
main.go: |
package main
import (
"log"
"net/http"
"strconv"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var customGauge = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "my_custom_gauge",
Help: "This is a custom gauge metric",
})
func init() {
prometheus.MustRegister(customGauge)
}
func handler(w http.ResponseWriter, r *http.Request) {
value := r.URL.Query().Get("value")
floatVal, err := strconv.ParseFloat(value, 64)
if err != nil {
http.Error(w, "Bad Request: invalid input", http.StatusBadRequest)
return
}
customGauge.Set(floatVal)
w.Write([]byte("Hello, from the handler!"))
}
func main() {
http.HandleFunc("/", handler)
http.Handle("/metrics", promhttp.Handler())
log.Fatal(http.ListenAndServe(":8080", nil))
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: go-from-configmap
spec:
replicas: 1
selector:
matchLabels:
app: go-from-configmap
template:
metadata:
labels:
app: go-from-configmap
spec:
nodeSelector:
www.progressiverobot.com/gpu-brand: amd
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: amd.com/gpu
operator: Exists
containers:
- name: go-runner
image: golang:1.24
command: ["/bin/sh", "-c"]
resources:
requests:
amd.com/gpu: "1"
limits:
amd.com/gpu: "1"
ports:
- containerPort: 8080
args:
- |
set -e
cp /config/*.go /app/
cd /app
[ -f go.mod ] || go mod init temp
go mod tidy
go run .
volumeMounts:
- name: go-code
mountPath: /config
- name: app-dir
mountPath: /app
volumes:
- name: go-code
configMap:
name: go-program
- name: app-dir
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: go-from-configmap
labels:
app: go-from-configmap
spec:
selector:
app: go-from-configmap
ports:
- name: metrics
port: 8080
targetPort: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: go-from-configmap
labels:
release: prometheus
spec:
selector:
matchLabels:
app: go-from-configmap
endpoints:
- port: metrics
path: /metrics
interval: 30s
namespaceSelector:
matchNames:
- kube-system
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: go-from-configmap
spec:
scaleTargetRef:
name: go-from-configmap
pollingInterval: 30
cooldownPeriod: 30
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.kube-system.svc.cluster.local:9090
metricName: my_custom_gauge
threshold: '80'
query: sum(my_custom_gauge{}) / sum(kube_deployment_status_replicas{deployment="go-from-configmap"})
Apply the resources:
kubectl apply -f resources.yaml -n <your-namespace>
Step 5 - What You Just Deployed
ConfigMap
We created a ConfigMap that stores a simple Go HTTP server as a string. This app
- HTTP server on port
8080. - Defines a Prometheus gauge metric:
my_custom_gauge. This metric represents a simulated load value and will be the core driver for our autoscaling logic. - Exposes a
/metricsendpoint compatible with Prometheus scraping - Offers a root endpoint
/where a user can update a gauge metric by calling/?value=123
Note: This setup is for demo purposes only and not intended for production use.
Deployment
We define a Deployment that uses the official golang:1.24 image. On startup, it:
- Mounts the
ConfigMapand compiles the Go server code dynamically - Executes the program which exposes the HTTP endpoints
- Requests 1 AMD GPU via resource requests and limits (
amd.com/gpu: 1)
Service
- The
Serviceresource allows Prometheus to access our/metricsendpoint. - It targets pods labeled
app: go-from-configmap - Exposes port 8080.
ServiceMonitor
To make Prometheus aware of our scraping endpoint, we define a ServiceMonitor. This resource:
- Tells Prometheus to scrape service matching the
app: go-from-configmaplabel - Configures the scrape path as
/metrics - Sets the scrape interval to
30s
Prometheus will now continuously collect the latest values of my_custom_gauge.
ScaledObject (KEDA)
The KEDA ScaledObject ties everything together. It configures KEDA to:
- Use Prometheus as the metric source
- Tells Keda to watch the
my_custom_gaugemetric stored in Prometheus - And use the following query for autoscaling events
In short, Keda will:
- Poll Prometheus every 30 seconds
- Scale between 1 and 8 replicas of the
go-from-configmapdeployment. - Cool down (scale back) after 30s if load decreases.
Step 6 - Test GPU Autoscaling
Get the pod name and exec into it:
kubectl get pods -l app=go-from-configmap
kubectl exec -it <go-pod-name> -- bash
Set the gauge to a high value (e.g. 160):
curl http://127.0.0.1:8080/?value=160
exit
Watch for pod scaling:
kubectl get pods -w -l app=go-from-configmap
KEDA will evaluate the query:
160 (gauge value) / 1 (replica) = 160, which exceeds the threshold of 80 \ → A second replica is triggered.
At this point, the new pod should be in a *pending* state since we don’t have enough GPUs based on the requested AMD GPU resources of that deployment. In the cloud UI, you should see a new autoscaling event (a new node coming up) under the AMD GPU node pool. Wait until the new GPU is done bootstrapping and check the pod. The pod must have changed from the *Pending* state to *Running*.
The query is continuously evaluated and scaling stops or continues depending on the metric’s behavior.
FAQs
1. Can I use KEDA to autoscale any type of GPU workload on DOKS?
Yes. KEDA is agnostic to the type of workload as long as you can expose a metric (such as GPU utilization, queue length, or custom application metrics) that Prometheus can scrape. For AMD GPU workloads, ensure the AMD device plugin and metrics exporter are enabled (default on DOKS GPU node pools).
2. What the cloud provider regions support AMD GPU nodes for DOKS?
As of this writing, AMD MI300X GPU nodes are available in select regions such as TOR1, NYC2, and ATL1. Always check the the cloud provider documentation for the latest supported regions and GPU types.
3. How does KEDA differ from Kubernetes’ built-in Horizontal Pod Autoscaler (HPA)?
KEDA extends Kubernetes autoscaling by allowing you to scale workloads based on *external* or *custom* metrics (like Prometheus queries, queue length, or cloud events), not just CPU or memory. This is especially useful for GPU workloads where utilization patterns may not correlate with CPU usage.
4. What happens if there are not enough available GPUs in the node pool?
If your scaling logic triggers more pods than there are available GPUs, new pods will remain in the Pending state until additional GPU nodes are provisioned. DOKS will automatically scale the GPU node pool (if autoscaling is enabled) to accommodate the increased demand, subject to your configured limits.
5. Can I use this approach for other types of accelerators (e.g., NVIDIA GPUs) or on other Kubernetes platforms?
Yes. The KEDA + Prometheus pattern is platform- and vendor-agnostic. You can adapt this approach for NVIDIA GPUs or other accelerators by using the appropriate device plugin and metrics exporter for your hardware and Kubernetes distribution.
Conclusion
This guide demonstrated how to build an autoscaling GPU workload on DOKS using KEDA and Prometheus, all with native Kubernetes constructs and open-source tooling.
This architecture empowers your workloads to:
- Automatically scale in response to real-world metrics—not just CPU
- Optimize GPU costs by dynamically managing replicas
- Easily adapt to use cases like machine learning inference, video processing, or real-time analytics
By combining event-driven scaling with custom observability, teams can efficiently manage GPU resources and deliver high-performance compute workloads with minimal manual intervention.