Custom Kubernetes Schedulers for GPU-Aware Pod Placement

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Default Scheduler's Blind Spot: Why ML Workloads Suffer

The default Kubernetes scheduler, kube-scheduler, is a masterpiece of general-purpose orchestration. It excels at bin-packing workloads based on CPU and memory requests, ensuring cluster-wide resource utilization. However, for high-performance computing (HPC) and distributed machine learning workloads, this generality becomes a significant limitation. The scheduler is fundamentally unaware of the underlying hardware topology that is critical for performance.

Consider a distributed training job using NVIDIA's NCCL for high-speed inter-GPU communication. The default scheduler might place two communicating pods on nodes that:

  • Lack a high-speed interconnect: One pod lands on a node with an NVLink fabric, while the other is on a standard PCIe-based node. Communication defaults to slower Ethernet, crippling training throughput.
  • Violate NUMA locality: A pod gets scheduled on a multi-socket server, but its CPU cores are allocated on a different NUMA node from the GPU it's assigned. This forces memory access across the slower interconnect between sockets (QPI/UPI), introducing significant latency.
  • Use heterogeneous GPUs: A job expecting homogenous A100 GPUs might have one of its pods scheduled on a node with an older V100 GPU, causing either runtime failures or performance degradation to the lowest common denominator.
  • The kube-scheduler sees nvidia.com/gpu: 1 as a fungible resource. It cannot differentiate between an A100-80GB and a T4, nor can it understand that GPUs 0 and 1 on node-a are connected via NVLink, while GPUs 2 and 3 are not. This topological ignorance is the primary motivation for extending its logic.

    This article dives deep into one of the most powerful and production-ready methods for solving this: building a Scheduler Extender. We will design, implement, and deploy a Go-based webhook that injects GPU topology and NUMA awareness into the core Kubernetes scheduling pipeline, enabling intelligent, performance-oriented placement decisions for demanding ML workloads.

    Architectural Choice: Extender vs. Full Custom Scheduler

    Before we write any code, we must choose our architectural pattern. Kubernetes offers two primary avenues for custom scheduling:

  • Multiple Schedulers: You can run a completely separate, custom-built scheduler in your cluster. Pods opt-in by specifying the schedulerName in their spec. This offers maximum control but requires you to reimplement or vendor much of the default scheduler's logic (like preemption, affinity/anti-affinity, and basic resource fitting). It's a heavy lift and prone to bit-rot as the upstream scheduler evolves.
  • Scheduler Extender: This is a webhook-based mechanism. You configure kube-scheduler to call out to an external HTTP/S endpoint at two key phases of its scheduling cycle:
  • * Filter: The extender receives the pod and a candidate node. It must respond with a simple yes/no: is this node a valid host for this pod? This is where we'll implement hard constraints, like matching a specific GPU model.

    * Prioritize: For all nodes that passed the Filter phase, the extender receives the pod and the list of viable nodes. It returns a list of scores (0-10) for each node. The kube-scheduler adds these scores to its own internal scores to make a final decision. This is where we'll implement soft preferences, like NUMA affinity or co-location.

    For our use case, the Extender is the superior choice. It allows us to surgically inject our specialized logic while continuing to leverage the battle-hardened foundation of the default scheduler. We don't need to reinvent the wheel for pod preemption or taints and tolerations; we simply augment the existing process with our domain-specific knowledge.

    Implementation: A GPU-Aware Scheduler Extender in Go

    Let's build our extender. We'll use Go and the standard net/http library, as the core logic is simply handling JSON payloads over HTTP. For local development, we'll use kind (Kubernetes in Docker) to simulate a multi-node cluster.

    Prerequisites

    • Go (1.18+)
    • Docker
  • kubectl
  • kind
  • First, create a kind cluster. We'll start with a simple single-node setup and expand later.

    bash
    kind create cluster --name gpu-scheduler-demo

    Project Structure

    Our Go project will be straightforward:

    text
    /gpu-scheduler-extender
    ├── go.mod
    ├── go.sum
    ├── main.go
    ├── handler.go      # HTTP handlers for /filter and /prioritize
    ├── types.go        # Struct definitions for Kubernetes API payloads
    ├── Dockerfile
    └── /deploy
        ├── extender.yaml   # Deployment and Service for our extender
        └── scheduler-config.yaml # KubeSchedulerConfiguration

    Core Data Structures (`types.go`)

    The kube-scheduler communicates with the extender using specific JSON structures. We must define these in Go to correctly unmarshal the requests and marshal our responses.

    go
    // types.go
    package main
    
    import (
    	v1 "k8s.io/api/core/v1"
    )
    
    // ExtenderArgs represents the arguments sent by the scheduler to the extender.
    type ExtenderArgs struct {
    	Pod  *v1.Pod   `json:"pod"`
    	Nodes *v1.NodeList `json:"nodes"`
    	NodeNames *[]string `json:"nodeNames"`
    }
    
    // ExtenderFilterResult represents the result of the filter operation.
    type ExtenderFilterResult struct {
    	Nodes *v1.NodeList `json:"nodes,omitempty"`
    	NodeNames *[]string `json:"nodeNames,omitempty"`
    	FailedNodes map[string]string `json:"failedNodes,omitempty"`
    	Error string `json:"error,omitempty"`
    }
    
    // HostPriority represents the priority of a single host.
    type HostPriority struct {
    	Host  string `json:"host"`
    	Score int    `json:"score"`
    }
    
    // HostPriorityList is a collection of HostPriority.
    type HostPriorityList []HostPriority

    These types are the foundation of our communication protocol with kube-scheduler.

    The HTTP Handlers (`handler.go`)

    This is the heart of our extender. We'll create two main handlers: handleFilter and handlePrioritize.

    go
    // handler.go
    package main
    
    import (
    	"encoding/json"
    	"io"
    	"log"
    	"net/http"
    	"strings"
    
    	v1 "k8s.io/api/core/v1"
    )
    
    const (
    	// The label key we'll use to request a specific GPU model.
    	gpuModelLabel = "accelerator.sched.io/gpu-model"
    	// The node label key where the GPU model is advertised.
    	nodeGpuModelLabel = "nvidia.com/gpu.product"
    	// The node label key for NUMA node affinity.
    	numaNodeLabel = "nvidia.com/gpu.numa"
    )
    
    func handleFilter(w http.ResponseWriter, r *http.Request) {
    	body, err := io.ReadAll(r.Body)
    	if err != nil {
    		http.Error(w, "Failed to read request body", http.StatusInternalServerError)
    		return
    	}
    
    	var args ExtenderArgs
    	if err := json.Unmarshal(body, &args); err != nil {
    		http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
    		return
    	}
    
    	if args.Pod == nil || args.Nodes == nil {
    		http.Error(w, "Invalid request: pod or nodes missing", http.StatusBadRequest)
    		return
    	}
    
    	// The logic for filtering nodes
    	result := filterNodes(args.Pod, args.Nodes)
    
    	responseBody, err := json.Marshal(result)
    	if err != nil {
    		http.Error(w, "Failed to marshal response", http.StatusInternalServerError)
    		return
    	}
    
    	w.Header().Set("Content-Type", "application/json")
    	w.WriteHeader(http.StatusOK)
    	_, _ = w.Write(responseBody)
    }
    
    // filterNodes implements our first piece of custom logic: GPU model matching.
    func filterNodes(pod *v1.Pod, nodes *v1.NodeList) ExtenderFilterResult {
    	requestedModel, ok := pod.Labels[gpuModelLabel]
    	if !ok {
    		// If the pod doesn't request a specific model, we don't filter any nodes.
    		// The default scheduler can handle it.
    		return ExtenderFilterResult{Nodes: nodes}
    	}
    
    	log.Printf("Pod %s/%s requests GPU model: %s", pod.Namespace, pod.Name, requestedModel)
    
    	filteredNodes := v1.NodeList{}
    	failedNodes := make(map[string]string)
    
    	for _, node := range nodes.Items {
    		nodeModel, ok := node.Labels[nodeGpuModelLabel]
    		if !ok {
    			failedNodes[node.Name] = "Node does not have a GPU model label"
    			continue
    		}
    
    		if strings.EqualFold(nodeModel, requestedModel) {
    			filteredNodes.Items = append(filteredNodes.Items, node)
    		} else {
    			failedNodes[node.Name] = "GPU model mismatch"
    		}
    	}
    
    	log.Printf("Filtering complete. Passed nodes: %d, Failed nodes: %d", len(filteredNodes.Items), len(failedNodes))
    
    	return ExtenderFilterResult{
    		Nodes:       &filteredNodes,
    		FailedNodes: failedNodes,
    	}
    }
    
    // handlePrioritize will contain our scoring logic.
    func handlePrioritize(w http.ResponseWriter, r *http.Request) {
        body, err := io.ReadAll(r.Body)
        if err != nil {
            http.Error(w, "Failed to read request body", http.StatusInternalServerError)
            return
        }
    
        var args ExtenderArgs
        if err := json.Unmarshal(body, &args); err != nil {
            http.Error(w, "Failed to unmarshal request", http.StatusBadRequest)
            return
        }
    
        if args.Pod == nil || args.NodeNames == nil {
            http.Error(w, "Invalid request: pod or nodeNames missing", http.StatusBadRequest)
            return
        }
    
        // The logic for scoring nodes
        priorities := prioritizeNodes(args.Pod, *args.NodeNames, args.Nodes)
    
        responseBody, err := json.Marshal(priorities)
        if err != nil {
            http.Error(w, "Failed to marshal response", http.StatusInternalServerError)
            return
        }
    
        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(http.StatusOK)
        _, _ = w.Write(responseBody)
    }
    
    // prioritizeNodes implements NUMA affinity scoring.
    func prioritizeNodes(pod *v1.Pod, nodeNames []string, nodes *v1.NodeList) HostPriorityList {
    	hostPriorities := make(HostPriorityList, 0, len(nodeNames))
    
    	// For simplicity, we'll assume a pod requesting a GPU implicitly prefers NUMA node 0.
    	// A production system might parse this from an annotation.
    	_, gpuRequested := pod.Spec.Containers[0].Resources.Limits["nvidia.com/gpu"]
    	if !gpuRequested {
    		// If no GPU is requested, we don't apply any special priority.
    		for _, name := range nodeNames {
    			hostPriorities = append(hostPriorities, HostPriority{Host: name, Score: 0})
    		}
    		return hostPriorities
    	}
    
    	log.Printf("Prioritizing nodes for GPU pod %s/%s", pod.Namespace, pod.Name)
    
    	nodeMap := make(map[string]v1.Node)
    	for _, node := range nodes.Items {
    		nodeMap[node.Name] = node
    	}
    
    	for _, name := range nodeNames {
    		node := nodeMap[name]
    		score := 0
    		numaNode, ok := node.Labels[numaNodeLabel]
    		if ok && numaNode == "0" {
    			// High score for nodes where the GPU is on NUMA node 0
    			score = 10
    			log.Printf("Node %s gets high priority (score 10) for NUMA affinity", name)
    		} else {
    			// Lower score for other nodes
    			score = 1
    			log.Printf("Node %s gets low priority (score 1) due to lack of NUMA affinity", name)
    		}
    		hostPriorities = append(hostPriorities, HostPriority{Host: name, Score: score})
    	}
    
    	return hostPriorities
    }
    

    The Main Server (`main.go`)

    This file ties everything together, setting up the HTTP server and its routes.

    go
    // main.go
    package main
    
    import (
    	"log"
    	"net/http"
    )
    
    func main() {
    	http.HandleFunc("/filter", handleFilter)
    	http.HandleFunc("/prioritize", handlePrioritize)
    
    	log.Println("Starting GPU scheduler extender on :8888")
    	if err := http.ListenAndServe(":8888", nil); err != nil {
    		log.Fatalf("Failed to start server: %v", err)
    	}
    }

    Containerizing the Extender (`Dockerfile`)

    We'll use a multi-stage build to create a minimal, secure container image.

    Dockerfile
    # build stage
    FROM golang:1.21-alpine AS builder
    
    WORKDIR /app
    
    COPY go.mod go.sum ./
    RUN go mod download
    
    COPY . .
    
    RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o gpu-scheduler-extender .
    
    # final stage
    FROM alpine:latest
    
    WORKDIR /root/
    
    COPY --from=builder /app/gpu-scheduler-extender .
    
    EXPOSE 8888
    
    CMD ["./gpu-scheduler-extender"]

    Build and push the image to a registry of your choice.

    bash
    docker build -t your-registry/gpu-scheduler-extender:v1.0 .
    docker push your-registry/gpu-scheduler-extender:v1.0

    Deployment and Configuration in Kubernetes

    This is the most critical part. We need to deploy our extender and, more importantly, tell kube-scheduler how to use it.

    Extender Deployment (`deploy/extender.yaml`)

    This is a standard Deployment and Service to run our Go application in the cluster.

    yaml
    # deploy/extender.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: gpu-extender-svc
      namespace: kube-system
    spec:
      ports:
      - port: 80
        targetPort: 8888
      selector:
        app: gpu-scheduler-extender
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-scheduler-extender
      namespace: kube-system
      labels:
        app: gpu-scheduler-extender
    spec:
      replicas: 2 # Run multiple for HA
      selector:
        matchLabels:
          app: gpu-scheduler-extender
      template:
        metadata:
          labels:
            app: gpu-scheduler-extender
        spec:
          containers:
          - name: extender
            image: your-registry/gpu-scheduler-extender:v1.0
            imagePullPolicy: Always
            ports:
            - containerPort: 8888

    `kube-scheduler` Configuration (`deploy/scheduler-config.yaml`)

    This KubeSchedulerConfiguration object is the glue. It tells the master kube-scheduler about our extender's existence and how to communicate with it.

    yaml
    # deploy/scheduler-config.yaml
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
    clientConnection:
      kubeconfig: /etc/kubernetes/scheduler.conf
    profiles:
    - schedulerName: default-scheduler
      plugins:
        # Default plugins are enabled here.
        # We add our extender to the flow.
      pluginConfig:
      - name: DefaultExtender
        args:
          # This is the extender's configuration
          extenders:
          - urlPrefix: "http://gpu-extender-svc.kube-system.svc.cluster.local"
            filterVerb: "filter"
            prioritizeVerb: "prioritize"
            weight: 2 # Give our priority score a weight of 2
            enableHTTPS: false
            ignorable: false # Critical: if extender fails, scheduling fails.

    Key Configuration Parameters:

  • urlPrefix: The address of our extender's Service.
  • filterVerb & prioritizeVerb: The URL paths for the filter and prioritize calls.
  • weight: A multiplier for the scores our extender returns. This allows you to tune how influential your extender is compared to the default scheduler's internal scoring.
  • ignorable: This is a critical production decision. If true, a failure to contact the extender will be ignored, and scheduling will proceed without custom logic. If false, scheduling for the pod will fail. For mandatory hardware requirements, false is the correct choice.
  • Applying the Configuration

    Applying this configuration depends on your cluster setup. For kind, it's relatively simple. For a self-managed cluster, you would modify the static pod manifest for kube-scheduler on your control plane nodes (typically at /etc/kubernetes/manifests/kube-scheduler.yaml) to mount this config file and pass it as a command-line argument:

    yaml
    # In /etc/kubernetes/manifests/kube-scheduler.yaml
    ... 
    spec:
      containers:
      - command:
        - kube-scheduler
        - --config=/etc/kubernetes/scheduler-config.yaml # <-- Add this line
    ... 
        volumeMounts:
        - name: scheduler-config
          mountPath: /etc/kubernetes/scheduler-config.yaml
          readOnly: true
    ... 
      volumes:
      - name: scheduler-config
        hostPath:
          path: /path/to/your/scheduler-config.yaml # <-- Path on the host
          type: File

    After saving this file, the kubelet on the control plane node will automatically restart the kube-scheduler pod with the new configuration.

    End-to-End Test Scenario

    Let's validate our setup. We need to simulate nodes with different GPU labels.

  • Label the Nodes: Get your kind node names (kubectl get nodes) and apply labels to simulate a heterogeneous cluster.
  • bash
        # Node 1: Powerful, NUMA-aligned A100
        kubectl label node gpu-scheduler-demo-control-plane nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
        kubectl label node gpu-scheduler-demo-control-plane nvidia.com/gpu.numa=0
    
        # Node 2: Another A100, but on a different NUMA node
        kubectl label node gpu-scheduler-demo-worker nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
        kubectl label node gpu-scheduler-demo-worker nvidia.com/gpu.numa=1
    
        # Node 3: An older generation V100
        kubectl label node gpu-scheduler-demo-worker2 nvidia.com/gpu.product=Tesla-V100-PCIE-16GB
        kubectl label node gpu-scheduler-demo-worker2 nvidia.com/gpu.numa=0
  • Deploy a Pod Requesting a Specific GPU:
  • yaml
        # pod-a100.yaml
        apiVersion: v1
        kind: Pod
        metadata:
          name: training-pod-1
          labels:
            accelerator.sched.io/gpu-model: "NVIDIA-A100-SXM4-80GB"
        spec:
          containers:
          - name: cuda-container
            image: nvidia/cuda:11.4.2-base-ubuntu20.04
            command: ["sleep", "3600"]
            resources:
              limits:
                nvidia.com/gpu: 1
  • Apply and Observe:
  • bash
        kubectl apply -f pod-a100.yaml

    Now, check the extender's logs:

    kubectl logs -n kube-system -l app=gpu-scheduler-extender -f

    You should see output similar to this:

    text
        Pod default/training-pod-1 requests GPU model: NVIDIA-A100-SXM4-80GB
        Filtering complete. Passed nodes: 2, Failed nodes: 1
        Prioritizing nodes for GPU pod default/training-pod-1
        Node gpu-scheduler-demo-control-plane gets high priority (score 10) for NUMA affinity
        Node gpu-scheduler-demo-worker gets low priority (score 1) due to lack of NUMA affinity

    Finally, verify the pod's placement:

    kubectl describe pod training-pod-1

    The output should show that the pod was scheduled on gpu-scheduler-demo-control-plane, the node that passed our filter and received the highest priority score.

    Advanced Edge Cases and Production Considerations

    A simple PoC is one thing; a production-grade system is another. Here's what senior engineers must consider:

    1. Extender Availability and Performance

    Our extender is now a critical component in the scheduling pipeline. Its failure or slowness directly impacts cluster operations.

  • High Availability: Always run at least two replicas of your extender deployment, spread across failure domains (e.g., different nodes or availability zones) using pod anti-affinity.
  • Performance: Every pod scheduling decision incurs at least two HTTP round trips to your extender. This latency adds up. To mitigate this, you must cache node information within the extender. Instead of relying on the NodeList sent with each request, the extender should use the client-go library to create an Informer. An informer maintains an up-to-date, in-memory cache of cluster objects (like Nodes and Pods), which is orders of magnitude faster than querying the API server on every call.
  • Example Informer Setup Snippet:

    go
        // In your main function, before starting the HTTP server
        config, _ := rest.InClusterConfig()
        clientset, _ := kubernetes.NewForConfig(config)
        factory := informers.NewSharedInformerFactory(clientset, 0)
        nodeInformer := factory.Core().V1().Nodes().Informer()
        nodeLister := factory.Core().V1().Nodes().Lister()
    
        stopCh := make(chan struct{})
        defer close(stopCh)
        factory.Start(stopCh)
        if !cache.WaitForCacheSync(stopCh, nodeInformer.HasSynced) {
            log.Fatalf("Failed to sync cache")
        }
        // Now, your handlers can use nodeLister.Get(nodeName) for fast, local lookups.

    2. Security

    The communication between kube-scheduler and the extender should be secured.

  • TLS: Configure enableHTTPS: true in your scheduler config. This requires you to provide a caBundle and for your extender service to serve traffic over TLS. Use a tool like cert-manager to automatically provision and rotate certificates for your extender's Service.
  • RBAC: The extender's ServiceAccount needs permissions to list and get nodes and pods if it's using an informer. Create a ClusterRole and ClusterRoleBinding that grant only these specific, minimal permissions.
  • 3. Advanced Topology-Aware Gang Scheduling

    Our current model prioritizes individual pods. For distributed training, we need to co-locate a group of pods (a gang) on interconnected nodes.

    This requires the extender to be stateful. When the first pod of a job (e.g., identified by a training-job-id label) is scheduled, the extender must record its placement. When subsequent pods from the same job arrive, the extender's Prioritize logic will consult this state.

  • State Storage:
  • - In-Memory (Simple): A sync.Map in your Go application can store job-id -> node-name mappings. This is fast but not durable across extender restarts.

    - CRD (Production-Ready): Define a Custom Resource Definition, e.g., TrainingJobPlacement. Your extender would create or update a CR instance for each job, storing the placement decisions. This is durable, observable via kubectl, and the canonical way to store custom state in Kubernetes.

    The Prioritize logic would then look like this:

    1. Extract training-job-id from the incoming pod's labels.

    2. Query the API server for other pods with the same label.

    3. Find the nodes where those pods are running.

    4. For each candidate node, check its labels for interconnects (e.g., nvlink-fabric-id: fabric-1).

    5. Give the highest score to nodes that are part of the same high-speed fabric as the already-placed pods.

    Conclusion: Unlocking True Hardware Potential

    By implementing a scheduler extender, we've transformed Kubernetes from a generic resource orchestrator into a topology-aware, high-performance scheduler tailored for our specific ML workloads. We've moved beyond simple resource requests to make intelligent placement decisions based on GPU models, NUMA locality, and the potential for high-speed interconnects.

    This pattern is not limited to GPUs. The same architecture can be used to manage scheduling for FPGAs, specialized storage, software license availability, or even datacenter-level concerns like power consumption and cooling zones. The scheduler extender is a powerful tool that allows platform and MLOps engineers to bridge the gap between application performance requirements and the underlying hardware reality, unlocking the full potential of their infrastructure without forking or replacing the core of Kubernetes.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles