Custom Kubernetes Schedulers for GPU-Intensive ML Workloads

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inadequacy of the Default Scheduler for GPU Workloads

The kube-scheduler is designed for versatility, balancing resource requests (cpu, memory) across a cluster. However, when scheduling GPU-intensive machine learning workloads, especially distributed training jobs, this generalist approach reveals critical gaps. Senior engineers managing MLOps platforms quickly discover that the default scheduler is blind to the nuanced requirements of high-performance computing.

Key limitations include:

  • GPU Model Agnosticism: The scheduler sees a GPU request (nvidia.com/gpu: 4) as a fungible commodity. It cannot differentiate between an NVIDIA A100 and a V100 on its own. A training job compiled with CUDA features specific to the Ampere architecture may be scheduled on a node with older Volta GPUs, causing runtime failures. While nodeSelector or nodeAffinity can enforce this, it's a static, manual constraint that doesn't integrate into a dynamic scoring and balancing model.
  • Lack of Topology Awareness: For multi-GPU training jobs using frameworks like Horovod or PyTorch DistributedDataParallel, inter-GPU communication bandwidth is a primary performance bottleneck. The default scheduler has no concept of NVLink or NVSwitch topology. It might place a 4-GPU pod across two separate NUMA nodes or PCI-e buses, forcing communication over the much slower QPI/UPI interconnect instead of high-speed NVLink. This can degrade training performance by 20-40% or more, silently eroding the value of expensive hardware.
  • No Concept of VRAM Fragmentation: The scheduler only checks the count of available GPUs, not their state. A node might have 4 available GPUs, but each could have 90% of its VRAM occupied by other processes. A new pod requesting a large model might fail to initialize, even though it was successfully scheduled. A smarter scheduler could score nodes based on available VRAM or GPU utilization.
  • Inefficient Bin-Packing: The default scheduler's scoring logic might spread pods across many nodes to balance load (LeastAllocated strategy), which is often undesirable for GPU clusters. For ML workloads, you often want to pack pods tightly onto as few nodes as possible (MostAllocated strategy) to maximize locality and potentially scale down unused nodes to save costs. A custom scheduler allows you to implement this logic precisely.
  • These limitations are not bugs; they are consequences of a design optimized for general-purpose stateless applications. To truly optimize GPU clusters, we need to inject domain-specific knowledge directly into the scheduling logic. This is precisely what the Kubernetes scheduler framework allows.


    Architecture of a Custom Scheduler Plugin

    Instead of forking and modifying kube-scheduler, the modern approach is to build plugins that hook into its well-defined Scheduler Framework. This framework provides extension points that allow you to implement custom logic without rewriting the entire scheduler.

    For our GPU-aware scheduler, we will focus on two primary extension points:

  • Filter (FilterPlugin): This is a predicate function. For a given pod, the scheduler iterates through all nodes and runs the Filter plugin. If the plugin returns Success, the node is a feasible candidate. If it returns Unschedulable, the node is immediately rejected for this pod. We will use this to ensure a node has the correct type and number of GPUs.
  • Score (ScorePlugin): After filtering, all feasible nodes are passed to the Score plugin. This plugin assigns an integer score (typically 0-100) to each node, where a higher score is better. The scheduler then selects the node with the highest total score from all scoring plugins. We will use this to implement our NVLink topology-aware logic.
  • The overall process for a single pod is:

    Pending Pod -> PreFilter -> Filter (on all nodes in parallel) -> PostFilter -> PreScore -> Score (on all filtered nodes in parallel) -> Reserve -> Bind

    We will implement a scheduler that performs the following logic:

    * Filter Phase:

    * Check if the pod requests GPUs (nvidia.com/gpu).

    * Check if the node has a label specifying its GPU model (e.g., gpu-model=nvidia-a100).

    * Ensure the pod's requested GPU model (via an annotation like gpu-workload-model: nvidia-a100) matches the node's label.

    * Reject the node if there's a mismatch or if the required GPUs are not available.

    * Score Phase:

    * For multi-GPU pods, check for a node annotation that describes its NVLink topology (e.g., nvlink-topology: '0-1,2-3').

    * Assign a high score to nodes where the requested number of GPUs can be placed on a single, fully-connected NVLink bridge.

    * Assign a lower score to nodes where GPUs are split across buses.

    * Assign the lowest score to nodes without topology information.

    Let's build this in Go.


    Core Implementation in Go

    We'll use the official k8s.io/kubernetes and k8s.io/client-go libraries. Ensure your Go environment is set up.

    Project Setup:

    bash
    go mod init custom-gpu-scheduler
    go get k8s.io/component-base/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]

    main.go: This file will register our custom plugin and start the scheduler component.

    go
    // main.go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/logs"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	"custom-gpu-scheduler/pkg/scheduler"
    )
    
    func main() {
    	logs.InitLogs()
    	defer logs.FlushLogs()
    
    	// Register our custom plugin with the scheduler framework.
    	// The command constructor from kube-scheduler/app will build a scheduler
    	// that includes our plugin.
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(scheduler.Name, scheduler.New),
    	)
    
    	if err := command.Execute(); err != nil {
    		os.Exit(1)
    	}
    }

    pkg/scheduler/scheduler.go: This is where our core logic resides.

    go
    // pkg/scheduler/scheduler.go
    package scheduler
    
    import (
    	"context"
    	"fmt"
    	"strconv"
    	"strings"
    
    	"github.com/go-logr/logr"
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	Name = "GpuTopologyScheduler"
    
    	// Annotations used on Pods and Nodes
    	PodGpuModelAnnotation    = "gpu-workload-model"
    	NodeGpuModelLabel        = "gpu-model"
    	NodeGpuTopologyAnnotation = "nvlink-topology"
    
    	// Resource name
    	NvidiaGpuResource = "nvidia.com/gpu"
    )
    
    // GpuTopologyScheduler is a scheduler plugin that's aware of GPU models and NVLink topology.
    type GpuTopologyScheduler struct {
    	handle framework.Handle
    	log    logr.Logger
    }
    
    // Ensure GpuTopologyScheduler implements the necessary interfaces.
    var _ framework.FilterPlugin = &GpuTopologyScheduler{}
    var _ framework.ScorePlugin = &GpuTopologyScheduler{}
    
    // New initializes a new plugin and returns it.
    func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &GpuTopologyScheduler{
    		handle: h,
    		log:    klog.FromContext(context.Background()),
    	}, nil
    }
    
    // Name returns the name of the plugin.
    func (s *GpuTopologyScheduler) Name() string {
    	return Name
    }
    
    // Filter plugin implementation
    func (s *GpuTopologyScheduler) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    	log := s.log.WithValues("pod", klog.KObj(pod), "node", klog.KObj(nodeInfo.Node()))
    
    	// Step 1: Check if the Pod requests GPUs. If not, this plugin has no opinion.
    	requestedGpus := getGpuRequest(pod)
    	if requestedGpus == 0 {
    		log.V(4).Info("Pod does not request GPUs, skipping filter")
    		return framework.NewStatus(framework.Skip)
    	}
    
    	// Step 2: Check if the Node has GPUs.
    	node := nodeInfo.Node()
    	if node.Status.Allocatable.Name(NvidiaGpuResource, "").Value() == 0 {
    		log.V(2).Info("Node has no allocatable GPUs")
    		return framework.NewStatus(framework.Unschedulable, "node has no GPUs")
    	}
    
    	// Step 3: Enforce GPU model affinity.
    	podGpuModel, ok := pod.Annotations[PodGpuModelAnnotation]
    	if !ok {
    		// If pod doesn't specify a model, we don't apply a model-based filter.
    		// A production scheduler might reject such pods.
    		log.V(4).Info("Pod does not specify a GPU model annotation, skipping model check")
    		return framework.NewStatus(framework.Success)
    	}
    
    	nodeGpuModel, ok := node.Labels[NodeGpuModelLabel]
    	if !ok {
    		log.V(2).Info("Node does not have GPU model label", "label", NodeGpuModelLabel)
    		return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node missing label %s", NodeGpuModelLabel))
    	}
    
    	if podGpuModel != nodeGpuModel {
    		log.V(2).Info("GPU model mismatch", "pod_model", podGpuModel, "node_model", nodeGpuModel)
    		return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("pod requires GPU model %s, but node has %s", podGpuModel, nodeGpuModel))
    	}
    
    	log.V(4).Info("Pod and Node GPU models match, filter successful")
    	return framework.NewStatus(framework.Success)
    }
    
    // Score plugin implementation
    func (s *GpuTopologyScheduler) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    	nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.AsStatus(fmt.Errorf("getting node info for %s: %w", nodeName, err))
    	}
    	node := nodeInfo.Node()
    	log := s.log.WithValues("pod", klog.KObj(pod), "node", klog.KObj(node))
    
    	requestedGpus := getGpuRequest(pod)
    	if requestedGpus <= 1 {
    		// For single-GPU pods, topology doesn't matter. Give a neutral score.
    		return 50, framework.NewStatus(framework.Success)
    	}
    
    	topology, ok := node.Annotations[NodeGpuTopologyAnnotation]
    	if !ok {
    		log.V(3).Info("Node is missing topology annotation, giving low score")
    		return 10, framework.NewStatus(framework.Success) // Low score for nodes without topology info
    	}
    
    	// Example topology: "0-1,2-3|4-5,6-7" means 0&1 are linked, 2&3 are linked, etc.
    	// A more robust implementation would parse this properly.
    	// For this example, we'll just check if a group of 'requestedGpus' size exists.
    	maxGroupSize := 0
    	groups := strings.Split(topology, "|")
    	for _, group := range groups {
    		linkedGpus := strings.Split(group, "-")
    		if len(linkedGpus) > maxGroupSize {
    			maxGroupSize = len(linkedGpus)
    		}
    	}
    
    	if int64(maxGroupSize) >= requestedGpus {
    		log.V(2).Info("Node has a suitable NVLink group for the pod", "requested_gpus", requestedGpus, "max_group_size", maxGroupSize)
    		return 100, framework.NewStatus(framework.Success) // Highest score for perfect topology match
    	}
    
    	log.V(2).Info("Node does not have a large enough NVLink group", "requested_gpus", requestedGpus, "max_group_size", maxGroupSize)
    	return 20, framework.NewStatus(framework.Success) // Higher than no-info, but lower than perfect match
    }
    
    // ScoreExtensions returns a ScoreExtensions interface if the plugin implements one, or nil.
    func (s *GpuTopologyScheduler) ScoreExtensions() framework.ScoreExtensions {
    	return nil // We don't need normalization
    }
    
    // Helper to get GPU request from a Pod spec
    func getGpuRequest(pod *v1.Pod) int64 {
    	var count int64
    	for _, container := range pod.Spec.Containers {
    		if val, ok := container.Resources.Limits[NvidiaGpuResource]; ok {
    			count += val.Value()
    		}
    	}
    	return count
    }

    This code provides the fundamental building blocks. The Filter method performs a hard rejection based on GPU model compatibility. The Score method implements our business logic: it gives the highest score to nodes that can contain the entire multi-GPU pod within a single NVLink group, promoting optimal performance.


    Deployment and Configuration

    Now, let's get this scheduler running in a cluster.

    1. Containerize the Scheduler

    Create a Dockerfile:

    dockerfile
    # Use a distroless base image for a smaller, more secure final image.
    FROM gcr.io/distroless/static:nonroot
    
    WORKDIR /
    COPY custom-gpu-scheduler .
    
    USER nonroot:nonroot
    
    ENTRYPOINT ["/custom-gpu-scheduler"]

    Build and push the image:

    bash
    # Build the static Go binary
    CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o custom-gpu-scheduler .
    
    # Build and push the Docker image
    docker build -t your-repo/custom-gpu-scheduler:v0.1.0 .
    docker push your-repo/custom-gpu-scheduler:v0.1.0

    2. Create the Scheduler Configuration

    This KubeSchedulerConfiguration file tells the scheduler binary to activate our custom plugin and disable default plugins that we don't need or want to interfere.

    scheduler-config.yaml:

    yaml
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
    clientConnection:
      kubeconfig: "/etc/kubernetes/scheduler.conf" # Path inside the pod
    profiles:
      - schedulerName: gpu-topology-scheduler
        plugins:
          # Our custom plugins are enabled here in the desired phases
          filter:
            enabled:
              - name: "GpuTopologyScheduler"
          score:
            enabled:
              - name: "GpuTopologyScheduler"
          # We must also enable the default plugins for other phases
          queueSort:
            enabled:
              - name: "PrioritySort"
          preFilter:
            enabled:
              - name: "NodeResourcesFit"
          bind:
            enabled:
              - name: "DefaultBinder"
        # We explicitly disable default plugins in phases we've overridden
        pluginConfig:
          - name: "DefaultPreemption"
            args: {}
          - name: "InterPodAffinity"
            args: {}
          - name: "NodeAffinity"
            args: {}
          # Configure our plugin if needed (not in this example)
          - name: "GpuTopologyScheduler"
            args: {}

    3. Deploy the Scheduler to Kubernetes

    We need a Deployment, ServiceAccount, and the necessary ClusterRoles for the scheduler to read Pods and Nodes and to bind Pods to Nodes.

    scheduler-deployment.yaml:

    yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: gpu-topology-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.ioio/v1
    kind: ClusterRole
    metadata:
      name: gpu-topology-scheduler-role
    rules:
      - apiGroups: [""]
        resources: ["nodes", "pods", "pods/status", "replicationcontrollers", "services"]
        verbs: ["get", "list", "watch"]
      - apiGroups: ["apps", "extensions"]
        resources: ["replicasets", "statefulsets"]
        verbs: ["get", "list", "watch"]
      - apiGroups: ["policy"]
        resources: ["poddisruptionbudgets"]
        verbs: ["get", "list", "watch"]
      - apiGroups: [""]
        resources: ["bindings", "pods/binding"]
        verbs: ["create"]
      - apiGroups: ["coordination.k8s.io"]
        resources: ["leases"]
        verbs: ["create", "get", "update"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-topology-scheduler-binding
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: gpu-topology-scheduler-role
    subjects:
      - kind: ServiceAccount
        name: gpu-topology-scheduler
        namespace: kube-system
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: scheduler-config
      namespace: kube-system
    data:
      scheduler-config.yaml: | # Paste the content of scheduler-config.yaml here
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        # ... (rest of config)
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-topology-scheduler
      namespace: kube-system
      labels:
        app: gpu-topology-scheduler
    spec:
      replicas: 2 # Run with HA
      selector:
        matchLabels:
          app: gpu-topology-scheduler
      template:
        metadata:
          labels:
            app: gpu-topology-scheduler
        spec:
          serviceAccountName: gpu-topology-scheduler
          containers:
            - name: scheduler
              image: your-repo/custom-gpu-scheduler:v0.1.0
              args:
                - "/custom-gpu-scheduler"
                - "--config=/etc/kubernetes/scheduler-config.yaml"
                - "--v=3"
              volumeMounts:
                - name: scheduler-config-volume
                  mountPath: /etc/kubernetes
          volumes:
            - name: scheduler-config-volume
              configMap:
                name: scheduler-config

    Apply this manifest: kubectl apply -f scheduler-deployment.yaml.


    Using the Custom Scheduler

    To use the scheduler, a Pod simply needs to specify its schedulerName.

    First, let's prepare our nodes. This would typically be done by a daemonset or an operator.

    bash
    # Node 1: 8x A100s, with two 4-GPU NVLink groups
    kubectl label node node-1 gpu-model=nvidia-a100
    kubectl annotate node node-1 nvlink-topology='0-1-2-3|4-5-6-7'
    
    # Node 2: 8x V100s, with no topology info provided
    kubectl label node node-2 gpu-model=nvidia-v100

    Now, let's create a Pod that requests 4 A100 GPUs for a distributed training job.

    training-pod.yaml:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: distributed-training-job-a100
      annotations:
        gpu-workload-model: "nvidia-a100"
    spec:
      schedulerName: gpu-topology-scheduler # KEY: This tells Kubernetes to use our scheduler
      containers:
        - name: training-container
          image: nvidia/cuda:11.8.0-base-ubuntu22.04
          command: ["sleep", "3600"]
          resources:
            limits:
              nvidia.com/gpu: 4

    When you kubectl apply -f training-pod.yaml, the following will happen:

  • The pod is assigned to gpu-topology-scheduler.
  • Our scheduler's Filter plugin runs.
  • * It evaluates node-1: The pod asks for nvidia-a100, the node has label gpu-model=nvidia-a100. Pass.

    * It evaluates node-2: The pod asks for nvidia-a100, the node has label gpu-model=nvidia-v100. Fail. node-2 is filtered out.

  • Our scheduler's Score plugin runs on the remaining candidates (only node-1).
  • * It sees the pod requests 4 GPUs.

    * It reads node-1's annotation nvlink-topology='0-1-2-3|4-5-6-7'.

    * It finds a group of size 4 that can fit the request.

    * It assigns node-1 a score of 100.

  • node-1 is the highest-scoring node, and the pod is bound to it.
  • If we submitted a similar pod asking for V100s, it would be correctly placed on node-2. If we submitted a 5-GPU A100 pod, it would be scheduled on node-1 but would receive a lower score (20) because it doesn't fit a single NVLink group, signaling a potential performance compromise.


    Advanced Considerations and Edge Cases

    Building a custom scheduler for production requires more than just the core logic. Here are critical factors senior engineers must address.

    Scheduling Latency and Throughput

    Every line of code in your Filter and Score plugins adds latency to pod scheduling. If your logic involves complex computations or external API calls (an anti-pattern!), you can significantly slow down the scheduler. The Kubernetes project aims for a scheduling throughput of thousands of pods per second.

    Mitigation:

    * Performance Profiling: Use Go's built-in pprof tooling. The scheduler binary can expose a pprof endpoint. Analyze CPU and memory profiles under load to identify bottlenecks in your plugins.

    * Caching: For data that doesn't change often (like node topology), use the framework.Handle to access the scheduler's shared snapshot lister. This is an in-memory cache of cluster state, far faster than querying the API server directly.

    * Pre-computation: If scoring is complex, consider if some parts can be pre-computed and stored in the framework.CycleState. This is a key-value store that persists for the duration of a single pod's scheduling attempt, allowing you to pass data from PreFilter to Filter to Score without recalculating.

    High Availability (HA)

    A single scheduler instance is a single point of failure. If it crashes, no new pods (using that scheduler) can be scheduled.

    Solution:

    Run multiple replicas of your scheduler deployment (as shown in our Deployment manifest with replicas: 2). The --leader-elect=true flag (enabled by default in the KubeSchedulerConfiguration) is crucial. It uses a Lease object in Kubernetes to ensure that only one scheduler replica is active at any given time. If the leader fails, another replica will acquire the lease and take over, providing seamless failover.

    Preemption

    What happens if a high-priority ML training job arrives but there are no available resources, while low-priority batch processing pods are occupying the needed GPUs? The job will remain pending.

    Solution:

    Your scheduler needs to implement the Preemption logic. This is a complex process where the scheduler identifies low-priority pods that can be evicted to make room for the high-priority pod. You can enable the DefaultPreemption plugin in your configuration. However, for a truly custom behavior, you might need to implement your own PreFilter logic that checks for potential preemption candidates, which significantly increases the complexity of your scheduler.

    Race Conditions and State Staleness

    The scheduler operates on a snapshot of the cluster's state. It's possible for a node's state to change after the Score phase but before the Bind phase. For example, another pod lands on your chosen node, consuming the GPUs you thought were free.

    Solution:

    The Bind phase is the final, atomic operation. The DefaultBinder plugin will attempt to create the binding object. If the node no longer has the resources, the Kubelet on that node will reject the pod, and the pod will go back to the Pending state to be rescheduled. Your scheduler should be stateless and idempotent; it should make the same correct decision given the same cluster state, regardless of previous failed attempts.

    Conclusion: The Trade-off of Customization

    Building a custom Kubernetes scheduler is a powerful but non-trivial undertaking. It's not a solution for every problem. You should only consider it when you have high-value, specialized workloads where the performance and efficiency gains from bespoke scheduling logic justify the development and maintenance overhead.

    For GPU-intensive ML platforms, the calculus often leans in favor of custom schedulers. The ability to perform topology-aware scheduling, enforce fine-grained resource constraints, and implement custom bin-packing strategies can translate directly into faster training times, higher hardware utilization, and significant cost savings. By leveraging the scheduler framework, you can inject this critical domain knowledge into your cluster, transforming it from a general-purpose compute grid into a highly optimized, high-performance computing environment.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles