Optimizing Kubernetes Scheduler Extenders for GPU-Aware Workloads
The Default Scheduler's Blind Spot: Specialized Hardware
The default Kubernetes scheduler, kube-scheduler, is a marvel of engineering, adept at placing general-purpose workloads across a cluster with impressive efficiency. It operates on a two-phase cycle: Filtering (finding nodes where a pod can run) and Scoring (ranking the viable nodes to find the best one). For stateless web servers or typical microservices that primarily consume CPU and memory, its default predicates and priorities are more than sufficient.
However, this efficiency breaks down when confronted with specialized, stateful hardware like GPUs. In a sophisticated multi-tenant Machine Learning platform, scheduling requirements become far more granular:
* Heterogeneous Hardware: A cluster may contain a mix of GPU models (e.g., NVIDIA A100s for training, T4s for inference). A pod must be scheduled on a node with the correct model.
* Resource Granularity: A pod might require a specific amount of VRAM (e.g., 20GB), which is a sub-resource of the GPU card itself. The default scheduler has no visibility into VRAM usage.
* Topology Awareness: High-performance training jobs might require multiple GPUs connected by a high-speed interconnect like NVLink. The scheduler must not only find a node with enough available GPUs but ensure they have the required physical connectivity.
* Custom Logic: Business logic, such as prioritizing workloads for a specific team or ensuring certain jobs land on cost-effective spot instances, is outside the purview of the default scheduler.
While Scheduler Plugins offer a more integrated way to extend the scheduler, they require compiling a custom scheduler binary. For a more decoupled, API-driven approach, the Scheduler Extender provides a powerful, albeit sharp-edged, tool. It allows you to augment the scheduling process via external webhooks.
This article is not an introduction. We assume you understand the basic scheduling cycle. We will dive directly into building a production-grade, high-performance scheduler extender in Go to solve the GPU-aware scheduling problem, focusing on the patterns and pitfalls encountered in real-world, large-scale clusters.
The Scheduler Extender Webhook Contract
A scheduler extender is fundamentally a web server that exposes endpoints kube-scheduler calls during its scheduling cycle. The communication is synchronous and blocking—a critical performance consideration we'll address later. The extender can intervene at four points, but we'll focus on the two most important:
kube-scheduler combines these scores with its own to make the final decision.Let's start with a minimal Go server to visualize the data flow. This will serve as our foundation.
// main.go
package main
import (
"encoding/json"
"io/ioutil"
"log"
"net/http"
"k8s.io/api/core/v1"
schedulerapi "k8s.io/kube-scheduler/extender/v1"
)
func main() {
http.HandleFunc("/filter", filterHandler)
http.HandleFunc("/prioritize", prioritizeHandler)
log.Println("Starting scheduler extender server on :8888")
if err := http.ListenAndServe(":8888", nil); err != nil {
log.Fatalf("Failed to start server: %v", err)
}
}
func filterHandler(w http.ResponseWriter, r *http.Request) {
body, err := ioutil.ReadAll(r.Body)
if err != nil {
http.Error(w, "Failed to read request body", http.StatusInternalServerError)
return
}
defer r.Body.Close()
var args schedulerapi.ExtenderArgs
if err := json.Unmarshal(body, &args); err != nil {
http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
return
}
log.Printf("FILTER request for Pod: %s/%s with %d candidate nodes", args.Pod.Namespace, args.Pod.Name, len(args.Nodes.Items))
// For now, we approve all nodes.
result := schedulerapi.ExtenderFilterResult{
Nodes: args.Nodes,
FailedNodes: make(map[string]string),
Error: "",
}
w.Header().Set("Content-Type", "application/json")
if err := json.NewEncoder(w).Encode(result); err != nil {
log.Printf("Error encoding response: %v", err)
}
}
func prioritizeHandler(w http.ResponseWriter, r *http.Request) {
body, err := ioutil.ReadAll(r.Body)
if err != nil {
http.Error(w, "Failed to read request body", http.StatusInternalServerError)
return
}
defer r.Body.Close()
var args schedulerapi.ExtenderArgs
if err := json.Unmarshal(body, &args); err != nil {
http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
return
}
log.Printf("PRIORITIZE request for Pod: %s/%s on %d nodes", args.Pod.Namespace, args.Pod.Name, len(args.Nodes.Items))
// For now, we give every node a score of 5.
scores := make(schedulerapi.HostPriorityList, len(args.Nodes.Items))
for i, node := range args.Nodes.Items {
scores[i] = schedulerapi.HostPriority{
Host: node.Name,
Score: 5,
}
}
w.Header().Set("Content-Type", "application/json")
if err := json.NewEncoder(w).Encode(scores); err != nil {
log.Printf("Error encoding response: %v", err)
}
}
This simple server just logs the requests and passes all nodes through. To integrate it, we need to configure kube-scheduler.
Scheduler Configuration
Create a scheduler-config.yaml file:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: "/etc/kubernetes/scheduler.conf"
extenders:
- urlPrefix: "http://<extender-service-ip>:8888"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 10
enableHTTPS: false
nodeCacheCapable: true # Critical for performance, more on this later
ignorable: false # If true, scheduling proceeds even if the extender is down
You would then run kube-scheduler with the --config flag pointing to this file. In a real cluster, you'd mount this as a ConfigMap into the kube-scheduler pod.
Production-Grade Filtering: State Management is Key
Our core task is to filter nodes based on GPU availability. The pod spec might look like this:
apiVersion: v1
kind: Pod
metadata:
name: training-job-resnet50
annotations:
# Custom annotations our extender will read
gpu-extender.mle.com/model: "NVIDIA-A100-SXM4-40GB"
gpu-extender.mle.com/vram-mb: "30000"
spec:
containers:
- name: training-container
image: my-tf-image
resources:
limits:
# Requesting a GPU from the device plugin
nvidia.com/gpu: "1"
How does our extender know which nodes have an A100 with 30GB of VRAM free? This information is not native to Kubernetes Node objects. We need a way to track the state of each GPU on each node.
Anti-Pattern: Do not try to manage this state inside the extender itself. An extender should be stateless. If it crashes, all allocation state is lost. Furthermore, with multiple replicas for HA, you'd need a complex state synchronization mechanism.
Production Pattern: Use a Custom Resource Definition (CRD) to model GPU state and an operator to keep it updated. This offloads state management to the Kubernetes API server, giving us persistence, consistency, and observability for free.
Let's define a GPU CRD:
# gpu.mle.com_gpus.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: gpus.gpu.mle.com
spec:
group: gpu.mle.com
names:
kind: GPU
listKind: GPUList
plural: gpus
singular: gpu
scope: Cluster
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
nodeName:
type: string
uuid:
type: string
model:
type: string
totalVRAMmb:
type: integer
status:
type: object
properties:
allocatedVRAMmb:
type: integer
podNamespace:
type: string
podName:
type: string
phase:
type: string # e.g., Available, Allocated
An accompanying operator (or even a simple DaemonSet on each GPU node) would be responsible for:
nvidia-smi).GPU custom resource for each physical GPU in the cluster.status of the corresponding GPU resource to Allocated, including the pod's name and VRAM usage.With this state management system in place, our extender's job becomes much simpler: it just needs to read these GPU resources.
Performance Optimization: Caching with Informers
A naive extender implementation might query the API server for GPU resources on every /filter request. In a large, busy cluster scheduling hundreds of pods per minute, this would DDoS your own API server. The latency of these API calls would also cripple scheduling throughput.
The solution is to maintain a local, in-memory cache of all relevant objects. The client-go library provides informers for this exact purpose.
Let's build a more sophisticated extender that uses informers for Node and GPU objects.
// gpu_extender.go
package main
import (
// ... other imports
"context"
"fmt"
"strconv"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
// Import our custom GPU client and types
gpuclientset "path/to/your/gpu-crd/pkg/client/clientset/versioned"
gpuinformers "path/to/your/gpu-crd/pkg/client/informers/externalversions"
gpulisters "path/to/your/gpu-crd/pkg/client/listers/gpu.mle.com/v1alpha1"
)
// Extender holds our cached data and client
type Extender struct {
nodeLister cache.GenericLister
gpuLister gpulisters.GPULister
}
func (e *Extender) filterHandler(w http.ResponseWriter, r *http.Request) {
// ... read and decode ExtenderArgs ...
pod := args.Pod
requiredModel, ok1 := pod.Annotations["gpu-extender.mle.com/model"]
vramStr, ok2 := pod.Annotations["gpu-extender.mle.com/vram-mb"]
if !ok1 || !ok2 {
// Pod does not request a specific GPU, so we don't filter.
// Let it pass to other extenders or default scheduling.
// ... write success response with all nodes ...
return
}
requiredVRAM, err := strconv.Atoi(vramStr)
if err != nil {
// ... handle error, fail this node ...
}
filteredNodes := []v1.Node{}
failedNodes := make(map[string]string)
for _, node := range args.Nodes.Items {
if e.hasAvailableGPU(node.Name, requiredModel, requiredVRAM) {
filteredNodes = append(filteredNodes, node)
} else {
failedNodes[node.Name] = fmt.Sprintf("No available %s with %dMB VRAM", requiredModel, requiredVRAM)
}
}
result := schedulerapi.ExtenderFilterResult{
Nodes: &v1.NodeList{Items: filteredNodes},
FailedNodes: failedNodes,
Error: "",
}
// ... encode and write response ...
}
// hasAvailableGPU checks our local cache for a suitable GPU on the given node.
func (e *Extender) hasAvailableGPU(nodeName, model string, vram int) bool {
// Use the lister to get all GPU objects. This reads from the cache.
gpus, err := e.gpuLister.List(labels.Everything())
if err != nil {
log.Printf("Error listing GPUs from cache: %v", err)
return false // Fail closed
}
for _, gpu := range gpus {
if gpu.Spec.NodeName == nodeName && \
gpu.Spec.Model == model && \
gpu.Status.Phase == "Available" && \
(gpu.Spec.TotalVRAMmb - gpu.Status.AllocatedVRAMmb) >= vram {
return true // Found a suitable, available GPU
}
}
return false
}
func main() {
// ... create kubernetes config ...
kubeClient := kubernetes.NewForConfigOrDie(config)
gpuClient := gpuclientset.NewForConfigOrDie(config)
factory := informers.NewSharedInformerFactory(kubeClient, 30*time.Second)
gpuFactory := gpuinformers.NewSharedInformerFactory(gpuClient, 30*time.Second)
nodeInformer := factory.Core().V1().Nodes().Informer()
gpuInformer := gpuFactory.Gpu().V1alpha1().GPUs().Informer()
extender := &Extender{
nodeLister: nodeInformer.GetIndexer(),
gpuLister: gpuFactory.Gpu().V1alpha1().GPUs().Lister(),
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go factory.Start(ctx.Done())
go gpuFactory.Start(ctx.Done())
// Wait for the caches to be synced before starting the web server
if !cache.WaitForCacheSync(ctx.Done(), nodeInformer.HasSynced, gpuInformer.HasSynced) {
log.Fatal("Failed to sync caches")
}
log.Println("Caches synced, starting server...")
http.HandleFunc("/filter", extender.filterHandler)
// ... register other handlers ...
log.Fatal(http.ListenAndServe(":8888", nil))
}
This implementation is vastly more performant. The filterHandler now queries an in-memory cache (gpuLister) that is kept up-to-date by the informer framework in the background. The latency of a filter request is reduced from multiple milliseconds (for an API call) to microseconds (for a memory lookup). This is the single most important optimization for a scheduler extender.
Advanced Prioritization: Bin Packing for GPUs
After filtering, we may have several nodes that can run the pod. The Prioritize step helps kube-scheduler choose the best one. A common strategy is Bin Packing: placing the pod on the most-utilized node that can still fit it. This packs workloads tightly, leaving other nodes completely free for larger jobs.
Let's implement a bin-packing scoring logic. A higher score will be given to nodes that have a higher percentage of their total VRAM allocated.
// Extender struct and main function are the same as before
type nodeGPUStats struct {
totalVRAM int
allocatedVRAM int
}
func (e *Extender) prioritizeHandler(w http.ResponseWriter, r *http.Request) {
// ... read and decode ExtenderArgs ...
// 1. Pre-calculate GPU stats for all nodes in the cluster from our cache
nodeStats := make(map[string]nodeGPUStats)
gpus, err := e.gpuLister.List(labels.Everything())
if err != nil {
// ... handle error, return zero scores ...
}
for _, gpu := range gpus {
stats := nodeStats[gpu.Spec.NodeName]
stats.totalVRAM += gpu.Spec.TotalVRAMmb
stats.allocatedVRAM += gpu.Status.AllocatedVRAMmb
nodeStats[gpu.Spec.NodeName] = stats
}
scores := make(schedulerapi.HostPriorityList, len(args.Nodes.Items))
for i, node := range args.Nodes.Items {
stats, ok := nodeStats[node.Name]
var score int64 = 0
if ok && stats.totalVRAM > 0 {
// Calculate utilization percentage and scale to 0-10 score range.
// Higher utilization = higher score.
utilization := (float64(stats.allocatedVRAM) / float64(stats.totalVRAM)) * 100
score = int64(utilization / 10)
}
scores[i] = schedulerapi.HostPriority{
Host: node.Name,
Score: score,
}
}
// ... encode and write response ...
}
In this handler, we iterate through our cached GPU objects to build an aggregate view of VRAM utilization per node. Then, for each candidate node passed by the scheduler, we calculate a score from 0-10 based on this utilization. kube-scheduler will favor the node with the highest score, achieving our bin-packing goal.
Handling the Edge: Failure, Races, and Preemption
A working extender is one thing; a production-ready one anticipates failure.
Edge Case 1: Extender Unavailability
What happens if your extender deployment crashes? The ignorable flag in the SchedulerConfiguration is critical here.
* ignorable: true: kube-scheduler will log an error and proceed with scheduling as if the extender doesn't exist. This maintains cluster availability but means your custom scheduling logic is bypassed, potentially placing GPU pods on incorrect nodes.
* ignorable: false: kube-scheduler will fail the scheduling attempt for the pod. The pod will go into a Pending state with a reason like SchedulerExtenderError. This enforces your custom policies but can halt all pod scheduling if the extender is down.
For critical workloads like GPU scheduling, ignorable: false is usually the correct choice. This requires you to run your extender in a highly available configuration (e.g., multiple replicas, anti-affinity rules).
Edge Case 2: Race Conditions
Consider two identical pods, P1 and P2, being scheduled concurrently. The flow might look like this:
kube-scheduler calls /filter for P1. Your extender sees a node gpu-node-1 has one A100 available and returns it as a valid node.kube-scheduler calls /filter for P2. Your extender's cache hasn't been updated yet, so it also sees gpu-node-1 has one A100 available and returns it.kube-scheduler might decide to place both P1 and P2 on gpu-node-1.This is a classic race condition. Fortunately, the system is self-correcting at the kubelet level. The NVIDIA device plugin ensures that only one pod can actually claim the physical GPU. One of the pods will fail to start. However, your extender's internal state (represented by the GPU CRDs) could become inconsistent.
This is where the Assume phase comes in. A more advanced extender can implement the preempt and bind verbs. When kube-scheduler decides to bind a pod, it can call the extender's /bind endpoint. This is your signal to optimistically update your state. Your operator would then perform the final reconciliation.
By using a CRD managed by an operator, you have a robust reconciliation loop. Even if the extender's optimistic update is wrong, the operator, as the source of truth, will eventually correct the state of the GPU custom resource by observing the actual pod-to-node bindings.
Edge Case 3: Preemption
When a high-priority pod needs to be scheduled but no resources are available, kube-scheduler may evict a lower-priority pod. During this preemption check, the entire scheduling cycle, including calls to your extender, is re-run for the high-priority pod on the node where the victim pod would be evicted.
Your extender's logic must be idempotent and consistent. The scores it produces during a preemption simulation must be the same as during a regular scheduling cycle. Our cache-based approach ensures this, as the logic is purely a function of the current state of the cluster represented in the cache.
Conclusion: A Powerful, Precise Instrument
The Kubernetes Scheduler Extender is not a tool for everyday problems. It is a precise instrument for scenarios where the default scheduling logic is fundamentally insufficient. For managing specialized hardware like GPUs in a multi-tenant environment, it provides the necessary hook to inject domain-specific, business-critical logic directly into the cluster's brain.
Building a production-grade extender requires moving beyond simple webhook handlers. The key architectural patterns are:
client-go informers to maintain a local, in-memory cache of all required Kubernetes objects. All decisions in the hot path should be served from this cache.ignorable: false to ensure your scheduling policies are always enforced. Ensure your logic is idempotent to behave correctly during preemption.By following these principles, you can transform the scheduler from a general-purpose tool into a highly-specialized, intelligent system tailored to the unique demands of your most critical workloads.