Optimizing Kubernetes Scheduler Extenders for GPU-Aware Workloads

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Default Scheduler's Blind Spot: Specialized Hardware

The default Kubernetes scheduler, kube-scheduler, is a marvel of engineering, adept at placing general-purpose workloads across a cluster with impressive efficiency. It operates on a two-phase cycle: Filtering (finding nodes where a pod can run) and Scoring (ranking the viable nodes to find the best one). For stateless web servers or typical microservices that primarily consume CPU and memory, its default predicates and priorities are more than sufficient.

However, this efficiency breaks down when confronted with specialized, stateful hardware like GPUs. In a sophisticated multi-tenant Machine Learning platform, scheduling requirements become far more granular:

* Heterogeneous Hardware: A cluster may contain a mix of GPU models (e.g., NVIDIA A100s for training, T4s for inference). A pod must be scheduled on a node with the correct model.

* Resource Granularity: A pod might require a specific amount of VRAM (e.g., 20GB), which is a sub-resource of the GPU card itself. The default scheduler has no visibility into VRAM usage.

* Topology Awareness: High-performance training jobs might require multiple GPUs connected by a high-speed interconnect like NVLink. The scheduler must not only find a node with enough available GPUs but ensure they have the required physical connectivity.

* Custom Logic: Business logic, such as prioritizing workloads for a specific team or ensuring certain jobs land on cost-effective spot instances, is outside the purview of the default scheduler.

While Scheduler Plugins offer a more integrated way to extend the scheduler, they require compiling a custom scheduler binary. For a more decoupled, API-driven approach, the Scheduler Extender provides a powerful, albeit sharp-edged, tool. It allows you to augment the scheduling process via external webhooks.

This article is not an introduction. We assume you understand the basic scheduling cycle. We will dive directly into building a production-grade, high-performance scheduler extender in Go to solve the GPU-aware scheduling problem, focusing on the patterns and pitfalls encountered in real-world, large-scale clusters.

The Scheduler Extender Webhook Contract

A scheduler extender is fundamentally a web server that exposes endpoints kube-scheduler calls during its scheduling cycle. The communication is synchronous and blocking—a critical performance consideration we'll address later. The extender can intervene at four points, but we'll focus on the two most important:

  • Filter: The extender receives the pod being scheduled and a list of candidate nodes that have passed the scheduler's default predicates. It returns a filtered list of nodes that also satisfy its custom logic.
  • Prioritize (Scoring): The extender receives the pod and the filtered list of nodes. It returns a list of scores, ranking each node's suitability. The kube-scheduler combines these scores with its own to make the final decision.
  • Let's start with a minimal Go server to visualize the data flow. This will serve as our foundation.

    go
    // main.go
    package main
    
    import (
    	"encoding/json"
    	"io/ioutil"
    	"log"
    	"net/http"
    
    	"k8s.io/api/core/v1"
    	schedulerapi "k8s.io/kube-scheduler/extender/v1"
    )
    
    func main() {
    	http.HandleFunc("/filter", filterHandler)
    	http.HandleFunc("/prioritize", prioritizeHandler)
    
    	log.Println("Starting scheduler extender server on :8888")
    	if err := http.ListenAndServe(":8888", nil); err != nil {
    		log.Fatalf("Failed to start server: %v", err)
    	}
    }
    
    func filterHandler(w http.ResponseWriter, r *http.Request) {
    	body, err := ioutil.ReadAll(r.Body)
    	if err != nil {
    		http.Error(w, "Failed to read request body", http.StatusInternalServerError)
    		return
    	}
    	defer r.Body.Close()
    
    	var args schedulerapi.ExtenderArgs
    	if err := json.Unmarshal(body, &args); err != nil {
    		http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
    		return
    	}
    
    	log.Printf("FILTER request for Pod: %s/%s with %d candidate nodes", args.Pod.Namespace, args.Pod.Name, len(args.Nodes.Items))
    
    	// For now, we approve all nodes.
    	result := schedulerapi.ExtenderFilterResult{
    		Nodes:       args.Nodes,
    		FailedNodes: make(map[string]string),
    		Error:       "",
    	}
    
    	w.Header().Set("Content-Type", "application/json")
    	if err := json.NewEncoder(w).Encode(result); err != nil {
    		log.Printf("Error encoding response: %v", err)
    	}
    }
    
    func prioritizeHandler(w http.ResponseWriter, r *http.Request) {
    	body, err := ioutil.ReadAll(r.Body)
    	if err != nil {
    		http.Error(w, "Failed to read request body", http.StatusInternalServerError)
    		return
    	}
    	defer r.Body.Close()
    
    	var args schedulerapi.ExtenderArgs
    	if err := json.Unmarshal(body, &args); err != nil {
    		http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
    		return
    	}
    
    	log.Printf("PRIORITIZE request for Pod: %s/%s on %d nodes", args.Pod.Namespace, args.Pod.Name, len(args.Nodes.Items))
    
    	// For now, we give every node a score of 5.
    	scores := make(schedulerapi.HostPriorityList, len(args.Nodes.Items))
    	for i, node := range args.Nodes.Items {
    		scores[i] = schedulerapi.HostPriority{
    			Host:  node.Name,
    			Score: 5,
    		}
    	}
    
    	w.Header().Set("Content-Type", "application/json")
    	if err := json.NewEncoder(w).Encode(scores); err != nil {
    		log.Printf("Error encoding response: %v", err)
    	}
    }

    This simple server just logs the requests and passes all nodes through. To integrate it, we need to configure kube-scheduler.

    Scheduler Configuration

    Create a scheduler-config.yaml file:

    yaml
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
    clientConnection:
      kubeconfig: "/etc/kubernetes/scheduler.conf"
    
    extenders:
      - urlPrefix: "http://<extender-service-ip>:8888"
        filterVerb: "filter"
        prioritizeVerb: "prioritize"
        weight: 10
        enableHTTPS: false
        nodeCacheCapable: true # Critical for performance, more on this later
        ignorable: false # If true, scheduling proceeds even if the extender is down

    You would then run kube-scheduler with the --config flag pointing to this file. In a real cluster, you'd mount this as a ConfigMap into the kube-scheduler pod.

    Production-Grade Filtering: State Management is Key

    Our core task is to filter nodes based on GPU availability. The pod spec might look like this:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: training-job-resnet50
      annotations:
        # Custom annotations our extender will read
        gpu-extender.mle.com/model: "NVIDIA-A100-SXM4-40GB"
        gpu-extender.mle.com/vram-mb: "30000"
    spec:
      containers:
      - name: training-container
        image: my-tf-image
        resources:
          limits:
            # Requesting a GPU from the device plugin
            nvidia.com/gpu: "1"

    How does our extender know which nodes have an A100 with 30GB of VRAM free? This information is not native to Kubernetes Node objects. We need a way to track the state of each GPU on each node.

    Anti-Pattern: Do not try to manage this state inside the extender itself. An extender should be stateless. If it crashes, all allocation state is lost. Furthermore, with multiple replicas for HA, you'd need a complex state synchronization mechanism.

    Production Pattern: Use a Custom Resource Definition (CRD) to model GPU state and an operator to keep it updated. This offloads state management to the Kubernetes API server, giving us persistence, consistency, and observability for free.

    Let's define a GPU CRD:

    yaml
    # gpu.mle.com_gpus.yaml
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: gpus.gpu.mle.com
    spec:
      group: gpu.mle.com
      names:
        kind: GPU
        listKind: GPUList
        plural: gpus
        singular: gpu
      scope: Cluster
      versions:
        - name: v1alpha1
          served: true
          storage: true
          schema:
            openAPIV3Schema:
              type: object
              properties:
                spec:
                  type: object
                  properties:
                    nodeName:
                      type: string
                    uuid:
                      type: string
                    model:
                      type: string
                    totalVRAMmb:
                      type: integer
                status:
                  type: object
                  properties:
                    allocatedVRAMmb:
                      type: integer
                    podNamespace:
                      type: string
                    podName:
                      type: string
                    phase:
                      type: string # e.g., Available, Allocated

    An accompanying operator (or even a simple DaemonSet on each GPU node) would be responsible for:

  • Discovering GPUs on a node at startup (e.g., by running nvidia-smi).
  • Creating one GPU custom resource for each physical GPU in the cluster.
  • Watching for pod assignments and updating the status of the corresponding GPU resource to Allocated, including the pod's name and VRAM usage.
  • With this state management system in place, our extender's job becomes much simpler: it just needs to read these GPU resources.

    Performance Optimization: Caching with Informers

    A naive extender implementation might query the API server for GPU resources on every /filter request. In a large, busy cluster scheduling hundreds of pods per minute, this would DDoS your own API server. The latency of these API calls would also cripple scheduling throughput.

    The solution is to maintain a local, in-memory cache of all relevant objects. The client-go library provides informers for this exact purpose.

    Let's build a more sophisticated extender that uses informers for Node and GPU objects.

    go
    // gpu_extender.go
    package main
    
    import (
    	// ... other imports
    	"context"
    	"fmt"
    	"strconv"
    	"time"
    
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/apimachinery/pkg/labels"
    	"k8s.io/client-go/informers"
    	"k8s.io/client-go/kubernetes"
    	"k8s.io/client-go/tools/cache"
    	"k8s.io/client-go/tools/clientcmd"
    
        // Import our custom GPU client and types
    	gpuclientset "path/to/your/gpu-crd/pkg/client/clientset/versioned"
    	gpuinformers "path/to/your/gpu-crd/pkg/client/informers/externalversions"
    	gpulisters "path/to/your/gpu-crd/pkg/client/listers/gpu.mle.com/v1alpha1"
    )
    
    // Extender holds our cached data and client
    type Extender struct {
    	nodeLister cache.GenericLister
    	gpuLister  gpulisters.GPULister
    }
    
    func (e *Extender) filterHandler(w http.ResponseWriter, r *http.Request) {
        // ... read and decode ExtenderArgs ...
    
    	pod := args.Pod
    	requiredModel, ok1 := pod.Annotations["gpu-extender.mle.com/model"]
    	vramStr, ok2 := pod.Annotations["gpu-extender.mle.com/vram-mb"]
    	if !ok1 || !ok2 {
    		// Pod does not request a specific GPU, so we don't filter.
    		// Let it pass to other extenders or default scheduling.
    		// ... write success response with all nodes ...
    		return
    	}
    
    	requiredVRAM, err := strconv.Atoi(vramStr)
    	if err != nil {
            // ... handle error, fail this node ...
        }
    
    	filteredNodes := []v1.Node{}
    	failedNodes := make(map[string]string)
    
    	for _, node := range args.Nodes.Items {
    		if e.hasAvailableGPU(node.Name, requiredModel, requiredVRAM) {
    			filteredNodes = append(filteredNodes, node)
    		} else {
    			failedNodes[node.Name] = fmt.Sprintf("No available %s with %dMB VRAM", requiredModel, requiredVRAM)
    		}
    	}
    
    	result := schedulerapi.ExtenderFilterResult{
    		Nodes:       &v1.NodeList{Items: filteredNodes},
    		FailedNodes: failedNodes,
    		Error:       "",
    	}
        // ... encode and write response ...
    }
    
    // hasAvailableGPU checks our local cache for a suitable GPU on the given node.
    func (e *Extender) hasAvailableGPU(nodeName, model string, vram int) bool {
    	// Use the lister to get all GPU objects. This reads from the cache.
    	gpus, err := e.gpuLister.List(labels.Everything())
    	if err != nil {
    		log.Printf("Error listing GPUs from cache: %v", err)
    		return false // Fail closed
    	}
    
    	for _, gpu := range gpus {
    		if gpu.Spec.NodeName == nodeName && \
    			gpu.Spec.Model == model && \
    			gpu.Status.Phase == "Available" && \
    			(gpu.Spec.TotalVRAMmb - gpu.Status.AllocatedVRAMmb) >= vram {
    			return true // Found a suitable, available GPU
    		}
    	}
    
    	return false
    }
    
    func main() {
    	// ... create kubernetes config ...
    	kubeClient := kubernetes.NewForConfigOrDie(config)
    	gpuClient := gpuclientset.NewForConfigOrDie(config)
    
    	factory := informers.NewSharedInformerFactory(kubeClient, 30*time.Second)
    	gpuFactory := gpuinformers.NewSharedInformerFactory(gpuClient, 30*time.Second)
    
    	nodeInformer := factory.Core().V1().Nodes().Informer()
    	gpuInformer := gpuFactory.Gpu().V1alpha1().GPUs().Informer()
    
    	extender := &Extender{
    		nodeLister: nodeInformer.GetIndexer(),
    		gpuLister:  gpuFactory.Gpu().V1alpha1().GPUs().Lister(),
    	}
    
    	ctx, cancel := context.WithCancel(context.Background())
    	defer cancel()
    
    	go factory.Start(ctx.Done())
    	go gpuFactory.Start(ctx.Done())
    
    	// Wait for the caches to be synced before starting the web server
    	if !cache.WaitForCacheSync(ctx.Done(), nodeInformer.HasSynced, gpuInformer.HasSynced) {
    		log.Fatal("Failed to sync caches")
    	}
    
    	log.Println("Caches synced, starting server...")
    
    	http.HandleFunc("/filter", extender.filterHandler)
    	// ... register other handlers ...
    	log.Fatal(http.ListenAndServe(":8888", nil))
    }

    This implementation is vastly more performant. The filterHandler now queries an in-memory cache (gpuLister) that is kept up-to-date by the informer framework in the background. The latency of a filter request is reduced from multiple milliseconds (for an API call) to microseconds (for a memory lookup). This is the single most important optimization for a scheduler extender.

    Advanced Prioritization: Bin Packing for GPUs

    After filtering, we may have several nodes that can run the pod. The Prioritize step helps kube-scheduler choose the best one. A common strategy is Bin Packing: placing the pod on the most-utilized node that can still fit it. This packs workloads tightly, leaving other nodes completely free for larger jobs.

    Let's implement a bin-packing scoring logic. A higher score will be given to nodes that have a higher percentage of their total VRAM allocated.

    go
    // Extender struct and main function are the same as before
    
    type nodeGPUStats struct {
    	totalVRAM   int
    	allocatedVRAM int
    }
    
    func (e *Extender) prioritizeHandler(w http.ResponseWriter, r *http.Request) {
        // ... read and decode ExtenderArgs ...
    
    	// 1. Pre-calculate GPU stats for all nodes in the cluster from our cache
    	nodeStats := make(map[string]nodeGPUStats)
    	gpus, err := e.gpuLister.List(labels.Everything())
    	if err != nil {
            // ... handle error, return zero scores ...
        }
    
    	for _, gpu := range gpus {
    		stats := nodeStats[gpu.Spec.NodeName]
    		stats.totalVRAM += gpu.Spec.TotalVRAMmb
    		stats.allocatedVRAM += gpu.Status.AllocatedVRAMmb
    		nodeStats[gpu.Spec.NodeName] = stats
    	}
    
    	scores := make(schedulerapi.HostPriorityList, len(args.Nodes.Items))
    	for i, node := range args.Nodes.Items {
    		stats, ok := nodeStats[node.Name]
    		var score int64 = 0
    
    		if ok && stats.totalVRAM > 0 {
    			// Calculate utilization percentage and scale to 0-10 score range.
    			// Higher utilization = higher score.
    			utilization := (float64(stats.allocatedVRAM) / float64(stats.totalVRAM)) * 100
    			score = int64(utilization / 10)
    		}
    
    		scores[i] = schedulerapi.HostPriority{
    			Host:  node.Name,
    			Score: score,
    		}
    	}
        // ... encode and write response ...
    }

    In this handler, we iterate through our cached GPU objects to build an aggregate view of VRAM utilization per node. Then, for each candidate node passed by the scheduler, we calculate a score from 0-10 based on this utilization. kube-scheduler will favor the node with the highest score, achieving our bin-packing goal.

    Handling the Edge: Failure, Races, and Preemption

    A working extender is one thing; a production-ready one anticipates failure.

    Edge Case 1: Extender Unavailability

    What happens if your extender deployment crashes? The ignorable flag in the SchedulerConfiguration is critical here.

    * ignorable: true: kube-scheduler will log an error and proceed with scheduling as if the extender doesn't exist. This maintains cluster availability but means your custom scheduling logic is bypassed, potentially placing GPU pods on incorrect nodes.

    * ignorable: false: kube-scheduler will fail the scheduling attempt for the pod. The pod will go into a Pending state with a reason like SchedulerExtenderError. This enforces your custom policies but can halt all pod scheduling if the extender is down.

    For critical workloads like GPU scheduling, ignorable: false is usually the correct choice. This requires you to run your extender in a highly available configuration (e.g., multiple replicas, anti-affinity rules).

    Edge Case 2: Race Conditions

    Consider two identical pods, P1 and P2, being scheduled concurrently. The flow might look like this:

  • kube-scheduler calls /filter for P1. Your extender sees a node gpu-node-1 has one A100 available and returns it as a valid node.
  • Simultaneously, kube-scheduler calls /filter for P2. Your extender's cache hasn't been updated yet, so it also sees gpu-node-1 has one A100 available and returns it.
  • kube-scheduler might decide to place both P1 and P2 on gpu-node-1.
  • This is a classic race condition. Fortunately, the system is self-correcting at the kubelet level. The NVIDIA device plugin ensures that only one pod can actually claim the physical GPU. One of the pods will fail to start. However, your extender's internal state (represented by the GPU CRDs) could become inconsistent.

    This is where the Assume phase comes in. A more advanced extender can implement the preempt and bind verbs. When kube-scheduler decides to bind a pod, it can call the extender's /bind endpoint. This is your signal to optimistically update your state. Your operator would then perform the final reconciliation.

    By using a CRD managed by an operator, you have a robust reconciliation loop. Even if the extender's optimistic update is wrong, the operator, as the source of truth, will eventually correct the state of the GPU custom resource by observing the actual pod-to-node bindings.

    Edge Case 3: Preemption

    When a high-priority pod needs to be scheduled but no resources are available, kube-scheduler may evict a lower-priority pod. During this preemption check, the entire scheduling cycle, including calls to your extender, is re-run for the high-priority pod on the node where the victim pod would be evicted.

    Your extender's logic must be idempotent and consistent. The scores it produces during a preemption simulation must be the same as during a regular scheduling cycle. Our cache-based approach ensures this, as the logic is purely a function of the current state of the cluster represented in the cache.

    Conclusion: A Powerful, Precise Instrument

    The Kubernetes Scheduler Extender is not a tool for everyday problems. It is a precise instrument for scenarios where the default scheduling logic is fundamentally insufficient. For managing specialized hardware like GPUs in a multi-tenant environment, it provides the necessary hook to inject domain-specific, business-critical logic directly into the cluster's brain.

    Building a production-grade extender requires moving beyond simple webhook handlers. The key architectural patterns are:

  • Decouple State Management: Do not manage state within the extender itself. Use CRDs and an operator to model and reconcile the state of your custom resources (like GPUs) via the Kubernetes control plane.
  • Aggressively Cache: The synchronous nature of the extender webhook makes latency your primary enemy. Use client-go informers to maintain a local, in-memory cache of all required Kubernetes objects. All decisions in the hot path should be served from this cache.
  • Design for Failure: Run your extender in a highly-available configuration and set ignorable: false to ensure your scheduling policies are always enforced. Ensure your logic is idempotent to behave correctly during preemption.
  • By following these principles, you can transform the scheduler from a general-purpose tool into a highly-specialized, intelligent system tailored to the unique demands of your most critical workloads.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles