Custom Kubernetes Schedulers for GPU-Aware Pod Placement
The Default Scheduler's Blind Spot: Why ML Workloads Suffer
The default Kubernetes scheduler, kube-scheduler, is a masterpiece of general-purpose orchestration. It excels at bin-packing workloads based on CPU and memory requests, ensuring cluster-wide resource utilization. However, for high-performance computing (HPC) and distributed machine learning workloads, this generality becomes a significant limitation. The scheduler is fundamentally unaware of the underlying hardware topology that is critical for performance.
Consider a distributed training job using NVIDIA's NCCL for high-speed inter-GPU communication. The default scheduler might place two communicating pods on nodes that:
A100 GPUs might have one of its pods scheduled on a node with an older V100 GPU, causing either runtime failures or performance degradation to the lowest common denominator.The kube-scheduler sees nvidia.com/gpu: 1 as a fungible resource. It cannot differentiate between an A100-80GB and a T4, nor can it understand that GPUs 0 and 1 on node-a are connected via NVLink, while GPUs 2 and 3 are not. This topological ignorance is the primary motivation for extending its logic.
This article dives deep into one of the most powerful and production-ready methods for solving this: building a Scheduler Extender. We will design, implement, and deploy a Go-based webhook that injects GPU topology and NUMA awareness into the core Kubernetes scheduling pipeline, enabling intelligent, performance-oriented placement decisions for demanding ML workloads.
Architectural Choice: Extender vs. Full Custom Scheduler
Before we write any code, we must choose our architectural pattern. Kubernetes offers two primary avenues for custom scheduling:
schedulerName in their spec. This offers maximum control but requires you to reimplement or vendor much of the default scheduler's logic (like preemption, affinity/anti-affinity, and basic resource fitting). It's a heavy lift and prone to bit-rot as the upstream scheduler evolves.kube-scheduler to call out to an external HTTP/S endpoint at two key phases of its scheduling cycle:* Filter: The extender receives the pod and a candidate node. It must respond with a simple yes/no: is this node a valid host for this pod? This is where we'll implement hard constraints, like matching a specific GPU model.
* Prioritize: For all nodes that passed the Filter phase, the extender receives the pod and the list of viable nodes. It returns a list of scores (0-10) for each node. The kube-scheduler adds these scores to its own internal scores to make a final decision. This is where we'll implement soft preferences, like NUMA affinity or co-location.
For our use case, the Extender is the superior choice. It allows us to surgically inject our specialized logic while continuing to leverage the battle-hardened foundation of the default scheduler. We don't need to reinvent the wheel for pod preemption or taints and tolerations; we simply augment the existing process with our domain-specific knowledge.
Implementation: A GPU-Aware Scheduler Extender in Go
Let's build our extender. We'll use Go and the standard net/http library, as the core logic is simply handling JSON payloads over HTTP. For local development, we'll use kind (Kubernetes in Docker) to simulate a multi-node cluster.
Prerequisites
- Go (1.18+)
- Docker
kubectlkindFirst, create a kind cluster. We'll start with a simple single-node setup and expand later.
kind create cluster --name gpu-scheduler-demo
Project Structure
Our Go project will be straightforward:
/gpu-scheduler-extender
├── go.mod
├── go.sum
├── main.go
├── handler.go # HTTP handlers for /filter and /prioritize
├── types.go # Struct definitions for Kubernetes API payloads
├── Dockerfile
└── /deploy
├── extender.yaml # Deployment and Service for our extender
└── scheduler-config.yaml # KubeSchedulerConfiguration
Core Data Structures (`types.go`)
The kube-scheduler communicates with the extender using specific JSON structures. We must define these in Go to correctly unmarshal the requests and marshal our responses.
// types.go
package main
import (
v1 "k8s.io/api/core/v1"
)
// ExtenderArgs represents the arguments sent by the scheduler to the extender.
type ExtenderArgs struct {
Pod *v1.Pod `json:"pod"`
Nodes *v1.NodeList `json:"nodes"`
NodeNames *[]string `json:"nodeNames"`
}
// ExtenderFilterResult represents the result of the filter operation.
type ExtenderFilterResult struct {
Nodes *v1.NodeList `json:"nodes,omitempty"`
NodeNames *[]string `json:"nodeNames,omitempty"`
FailedNodes map[string]string `json:"failedNodes,omitempty"`
Error string `json:"error,omitempty"`
}
// HostPriority represents the priority of a single host.
type HostPriority struct {
Host string `json:"host"`
Score int `json:"score"`
}
// HostPriorityList is a collection of HostPriority.
type HostPriorityList []HostPriority
These types are the foundation of our communication protocol with kube-scheduler.
The HTTP Handlers (`handler.go`)
This is the heart of our extender. We'll create two main handlers: handleFilter and handlePrioritize.
// handler.go
package main
import (
"encoding/json"
"io"
"log"
"net/http"
"strings"
v1 "k8s.io/api/core/v1"
)
const (
// The label key we'll use to request a specific GPU model.
gpuModelLabel = "accelerator.sched.io/gpu-model"
// The node label key where the GPU model is advertised.
nodeGpuModelLabel = "nvidia.com/gpu.product"
// The node label key for NUMA node affinity.
numaNodeLabel = "nvidia.com/gpu.numa"
)
func handleFilter(w http.ResponseWriter, r *http.Request) {
body, err := io.ReadAll(r.Body)
if err != nil {
http.Error(w, "Failed to read request body", http.StatusInternalServerError)
return
}
var args ExtenderArgs
if err := json.Unmarshal(body, &args); err != nil {
http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
return
}
if args.Pod == nil || args.Nodes == nil {
http.Error(w, "Invalid request: pod or nodes missing", http.StatusBadRequest)
return
}
// The logic for filtering nodes
result := filterNodes(args.Pod, args.Nodes)
responseBody, err := json.Marshal(result)
if err != nil {
http.Error(w, "Failed to marshal response", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = w.Write(responseBody)
}
// filterNodes implements our first piece of custom logic: GPU model matching.
func filterNodes(pod *v1.Pod, nodes *v1.NodeList) ExtenderFilterResult {
requestedModel, ok := pod.Labels[gpuModelLabel]
if !ok {
// If the pod doesn't request a specific model, we don't filter any nodes.
// The default scheduler can handle it.
return ExtenderFilterResult{Nodes: nodes}
}
log.Printf("Pod %s/%s requests GPU model: %s", pod.Namespace, pod.Name, requestedModel)
filteredNodes := v1.NodeList{}
failedNodes := make(map[string]string)
for _, node := range nodes.Items {
nodeModel, ok := node.Labels[nodeGpuModelLabel]
if !ok {
failedNodes[node.Name] = "Node does not have a GPU model label"
continue
}
if strings.EqualFold(nodeModel, requestedModel) {
filteredNodes.Items = append(filteredNodes.Items, node)
} else {
failedNodes[node.Name] = "GPU model mismatch"
}
}
log.Printf("Filtering complete. Passed nodes: %d, Failed nodes: %d", len(filteredNodes.Items), len(failedNodes))
return ExtenderFilterResult{
Nodes: &filteredNodes,
FailedNodes: failedNodes,
}
}
// handlePrioritize will contain our scoring logic.
func handlePrioritize(w http.ResponseWriter, r *http.Request) {
body, err := io.ReadAll(r.Body)
if err != nil {
http.Error(w, "Failed to read request body", http.StatusInternalServerError)
return
}
var args ExtenderArgs
if err := json.Unmarshal(body, &args); err != nil {
http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
return
}
if args.Pod == nil || args.NodeNames == nil {
http.Error(w, "Invalid request: pod or nodeNames missing", http.StatusBadRequest)
return
}
// The logic for scoring nodes
priorities := prioritizeNodes(args.Pod, *args.NodeNames, args.Nodes)
responseBody, err := json.Marshal(priorities)
if err != nil {
http.Error(w, "Failed to marshal response", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = w.Write(responseBody)
}
// prioritizeNodes implements NUMA affinity scoring.
func prioritizeNodes(pod *v1.Pod, nodeNames []string, nodes *v1.NodeList) HostPriorityList {
hostPriorities := make(HostPriorityList, 0, len(nodeNames))
// For simplicity, we'll assume a pod requesting a GPU implicitly prefers NUMA node 0.
// A production system might parse this from an annotation.
_, gpuRequested := pod.Spec.Containers[0].Resources.Limits["nvidia.com/gpu"]
if !gpuRequested {
// If no GPU is requested, we don't apply any special priority.
for _, name := range nodeNames {
hostPriorities = append(hostPriorities, HostPriority{Host: name, Score: 0})
}
return hostPriorities
}
log.Printf("Prioritizing nodes for GPU pod %s/%s", pod.Namespace, pod.Name)
nodeMap := make(map[string]v1.Node)
for _, node := range nodes.Items {
nodeMap[node.Name] = node
}
for _, name := range nodeNames {
node := nodeMap[name]
score := 0
numaNode, ok := node.Labels[numaNodeLabel]
if ok && numaNode == "0" {
// High score for nodes where the GPU is on NUMA node 0
score = 10
log.Printf("Node %s gets high priority (score 10) for NUMA affinity", name)
} else {
// Lower score for other nodes
score = 1
log.Printf("Node %s gets low priority (score 1) due to lack of NUMA affinity", name)
}
hostPriorities = append(hostPriorities, HostPriority{Host: name, Score: score})
}
return hostPriorities
}
The Main Server (`main.go`)
This file ties everything together, setting up the HTTP server and its routes.
// main.go
package main
import (
"log"
"net/http"
)
func main() {
http.HandleFunc("/filter", handleFilter)
http.HandleFunc("/prioritize", handlePrioritize)
log.Println("Starting GPU scheduler extender on :8888")
if err := http.ListenAndServe(":8888", nil); err != nil {
log.Fatalf("Failed to start server: %v", err)
}
}
Containerizing the Extender (`Dockerfile`)
We'll use a multi-stage build to create a minimal, secure container image.
# build stage
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o gpu-scheduler-extender .
# final stage
FROM alpine:latest
WORKDIR /root/
COPY --from=builder /app/gpu-scheduler-extender .
EXPOSE 8888
CMD ["./gpu-scheduler-extender"]
Build and push the image to a registry of your choice.
docker build -t your-registry/gpu-scheduler-extender:v1.0 .
docker push your-registry/gpu-scheduler-extender:v1.0
Deployment and Configuration in Kubernetes
This is the most critical part. We need to deploy our extender and, more importantly, tell kube-scheduler how to use it.
Extender Deployment (`deploy/extender.yaml`)
This is a standard Deployment and Service to run our Go application in the cluster.
# deploy/extender.yaml
apiVersion: v1
kind: Service
metadata:
name: gpu-extender-svc
namespace: kube-system
spec:
ports:
- port: 80
targetPort: 8888
selector:
app: gpu-scheduler-extender
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-scheduler-extender
namespace: kube-system
labels:
app: gpu-scheduler-extender
spec:
replicas: 2 # Run multiple for HA
selector:
matchLabels:
app: gpu-scheduler-extender
template:
metadata:
labels:
app: gpu-scheduler-extender
spec:
containers:
- name: extender
image: your-registry/gpu-scheduler-extender:v1.0
imagePullPolicy: Always
ports:
- containerPort: 8888
`kube-scheduler` Configuration (`deploy/scheduler-config.yaml`)
This KubeSchedulerConfiguration object is the glue. It tells the master kube-scheduler about our extender's existence and how to communicate with it.
# deploy/scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- schedulerName: default-scheduler
plugins:
# Default plugins are enabled here.
# We add our extender to the flow.
pluginConfig:
- name: DefaultExtender
args:
# This is the extender's configuration
extenders:
- urlPrefix: "http://gpu-extender-svc.kube-system.svc.cluster.local"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 2 # Give our priority score a weight of 2
enableHTTPS: false
ignorable: false # Critical: if extender fails, scheduling fails.
Key Configuration Parameters:
urlPrefix: The address of our extender's Service.filterVerb & prioritizeVerb: The URL paths for the filter and prioritize calls.weight: A multiplier for the scores our extender returns. This allows you to tune how influential your extender is compared to the default scheduler's internal scoring.ignorable: This is a critical production decision. If true, a failure to contact the extender will be ignored, and scheduling will proceed without custom logic. If false, scheduling for the pod will fail. For mandatory hardware requirements, false is the correct choice.Applying the Configuration
Applying this configuration depends on your cluster setup. For kind, it's relatively simple. For a self-managed cluster, you would modify the static pod manifest for kube-scheduler on your control plane nodes (typically at /etc/kubernetes/manifests/kube-scheduler.yaml) to mount this config file and pass it as a command-line argument:
# In /etc/kubernetes/manifests/kube-scheduler.yaml
...
spec:
containers:
- command:
- kube-scheduler
- --config=/etc/kubernetes/scheduler-config.yaml # <-- Add this line
...
volumeMounts:
- name: scheduler-config
mountPath: /etc/kubernetes/scheduler-config.yaml
readOnly: true
...
volumes:
- name: scheduler-config
hostPath:
path: /path/to/your/scheduler-config.yaml # <-- Path on the host
type: File
After saving this file, the kubelet on the control plane node will automatically restart the kube-scheduler pod with the new configuration.
End-to-End Test Scenario
Let's validate our setup. We need to simulate nodes with different GPU labels.
kind node names (kubectl get nodes) and apply labels to simulate a heterogeneous cluster. # Node 1: Powerful, NUMA-aligned A100
kubectl label node gpu-scheduler-demo-control-plane nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
kubectl label node gpu-scheduler-demo-control-plane nvidia.com/gpu.numa=0
# Node 2: Another A100, but on a different NUMA node
kubectl label node gpu-scheduler-demo-worker nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
kubectl label node gpu-scheduler-demo-worker nvidia.com/gpu.numa=1
# Node 3: An older generation V100
kubectl label node gpu-scheduler-demo-worker2 nvidia.com/gpu.product=Tesla-V100-PCIE-16GB
kubectl label node gpu-scheduler-demo-worker2 nvidia.com/gpu.numa=0
# pod-a100.yaml
apiVersion: v1
kind: Pod
metadata:
name: training-pod-1
labels:
accelerator.sched.io/gpu-model: "NVIDIA-A100-SXM4-80GB"
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.4.2-base-ubuntu20.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f pod-a100.yaml
Now, check the extender's logs:
kubectl logs -n kube-system -l app=gpu-scheduler-extender -f
You should see output similar to this:
Pod default/training-pod-1 requests GPU model: NVIDIA-A100-SXM4-80GB
Filtering complete. Passed nodes: 2, Failed nodes: 1
Prioritizing nodes for GPU pod default/training-pod-1
Node gpu-scheduler-demo-control-plane gets high priority (score 10) for NUMA affinity
Node gpu-scheduler-demo-worker gets low priority (score 1) due to lack of NUMA affinity
Finally, verify the pod's placement:
kubectl describe pod training-pod-1
The output should show that the pod was scheduled on gpu-scheduler-demo-control-plane, the node that passed our filter and received the highest priority score.
Advanced Edge Cases and Production Considerations
A simple PoC is one thing; a production-grade system is another. Here's what senior engineers must consider:
1. Extender Availability and Performance
Our extender is now a critical component in the scheduling pipeline. Its failure or slowness directly impacts cluster operations.
NodeList sent with each request, the extender should use the client-go library to create an Informer. An informer maintains an up-to-date, in-memory cache of cluster objects (like Nodes and Pods), which is orders of magnitude faster than querying the API server on every call.Example Informer Setup Snippet:
// In your main function, before starting the HTTP server
config, _ := rest.InClusterConfig()
clientset, _ := kubernetes.NewForConfig(config)
factory := informers.NewSharedInformerFactory(clientset, 0)
nodeInformer := factory.Core().V1().Nodes().Informer()
nodeLister := factory.Core().V1().Nodes().Lister()
stopCh := make(chan struct{})
defer close(stopCh)
factory.Start(stopCh)
if !cache.WaitForCacheSync(stopCh, nodeInformer.HasSynced) {
log.Fatalf("Failed to sync cache")
}
// Now, your handlers can use nodeLister.Get(nodeName) for fast, local lookups.
2. Security
The communication between kube-scheduler and the extender should be secured.
enableHTTPS: true in your scheduler config. This requires you to provide a caBundle and for your extender service to serve traffic over TLS. Use a tool like cert-manager to automatically provision and rotate certificates for your extender's Service.ServiceAccount needs permissions to list and get nodes and pods if it's using an informer. Create a ClusterRole and ClusterRoleBinding that grant only these specific, minimal permissions.3. Advanced Topology-Aware Gang Scheduling
Our current model prioritizes individual pods. For distributed training, we need to co-locate a group of pods (a gang) on interconnected nodes.
This requires the extender to be stateful. When the first pod of a job (e.g., identified by a training-job-id label) is scheduled, the extender must record its placement. When subsequent pods from the same job arrive, the extender's Prioritize logic will consult this state.
- In-Memory (Simple): A sync.Map in your Go application can store job-id -> node-name mappings. This is fast but not durable across extender restarts.
- CRD (Production-Ready): Define a Custom Resource Definition, e.g., TrainingJobPlacement. Your extender would create or update a CR instance for each job, storing the placement decisions. This is durable, observable via kubectl, and the canonical way to store custom state in Kubernetes.
The Prioritize logic would then look like this:
1. Extract training-job-id from the incoming pod's labels.
2. Query the API server for other pods with the same label.
3. Find the nodes where those pods are running.
4. For each candidate node, check its labels for interconnects (e.g., nvlink-fabric-id: fabric-1).
5. Give the highest score to nodes that are part of the same high-speed fabric as the already-placed pods.
Conclusion: Unlocking True Hardware Potential
By implementing a scheduler extender, we've transformed Kubernetes from a generic resource orchestrator into a topology-aware, high-performance scheduler tailored for our specific ML workloads. We've moved beyond simple resource requests to make intelligent placement decisions based on GPU models, NUMA locality, and the potential for high-speed interconnects.
This pattern is not limited to GPUs. The same architecture can be used to manage scheduling for FPGAs, specialized storage, software license availability, or even datacenter-level concerns like power consumption and cooling zones. The scheduler extender is a powerful tool that allows platform and MLOps engineers to bridge the gap between application performance requirements and the underlying hardware reality, unlocking the full potential of their infrastructure without forking or replacing the core of Kubernetes.