Custom Kubernetes Schedulers for GPU-Intensive ML Workloads

September 28, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inadequacy of the Default Scheduler for GPU Workloads

The kube-scheduler is designed for versatility, balancing resource requests (cpu, memory) across a cluster. However, when scheduling GPU-intensive machine learning workloads, especially distributed training jobs, this generalist approach reveals critical gaps. Senior engineers managing MLOps platforms quickly discover that the default scheduler is blind to the nuanced requirements of high-performance computing.

Key limitations include:

GPU Model Agnosticism: The scheduler sees a GPU request (nvidia.com/gpu: 4) as a fungible commodity. It cannot differentiate between an NVIDIA A100 and a V100 on its own. A training job compiled with CUDA features specific to the Ampere architecture may be scheduled on a node with older Volta GPUs, causing runtime failures. While nodeSelector or nodeAffinity can enforce this, it's a static, manual constraint that doesn't integrate into a dynamic scoring and balancing model.

Lack of Topology Awareness: For multi-GPU training jobs using frameworks like Horovod or PyTorch DistributedDataParallel, inter-GPU communication bandwidth is a primary performance bottleneck. The default scheduler has no concept of NVLink or NVSwitch topology. It might place a 4-GPU pod across two separate NUMA nodes or PCI-e buses, forcing communication over the much slower QPI/UPI interconnect instead of high-speed NVLink. This can degrade training performance by 20-40% or more, silently eroding the value of expensive hardware.

No Concept of VRAM Fragmentation: The scheduler only checks the count of available GPUs, not their state. A node might have 4 available GPUs, but each could have 90% of its VRAM occupied by other processes. A new pod requesting a large model might fail to initialize, even though it was successfully scheduled. A smarter scheduler could score nodes based on available VRAM or GPU utilization.

Inefficient Bin-Packing: The default scheduler's scoring logic might spread pods across many nodes to balance load (LeastAllocated strategy), which is often undesirable for GPU clusters. For ML workloads, you often want to pack pods tightly onto as few nodes as possible (MostAllocated strategy) to maximize locality and potentially scale down unused nodes to save costs. A custom scheduler allows you to implement this logic precisely.

These limitations are not bugs; they are consequences of a design optimized for general-purpose stateless applications. To truly optimize GPU clusters, we need to inject domain-specific knowledge directly into the scheduling logic. This is precisely what the Kubernetes scheduler framework allows.

Architecture of a Custom Scheduler Plugin

Instead of forking and modifying kube-scheduler, the modern approach is to build plugins that hook into its well-defined Scheduler Framework. This framework provides extension points that allow you to implement custom logic without rewriting the entire scheduler.

For our GPU-aware scheduler, we will focus on two primary extension points:

Filter (FilterPlugin): This is a predicate function. For a given pod, the scheduler iterates through all nodes and runs the Filter plugin. If the plugin returns Success, the node is a feasible candidate. If it returns Unschedulable, the node is immediately rejected for this pod. We will use this to ensure a node has the correct type and number of GPUs.

Score (ScorePlugin): After filtering, all feasible nodes are passed to the Score plugin. This plugin assigns an integer score (typically 0-100) to each node, where a higher score is better. The scheduler then selects the node with the highest total score from all scoring plugins. We will use this to implement our NVLink topology-aware logic.

The overall process for a single pod is:

Pending Pod -> PreFilter -> Filter (on all nodes in parallel) -> PostFilter -> PreScore -> Score (on all filtered nodes in parallel) -> Reserve -> Bind

We will implement a scheduler that performs the following logic:

* Filter Phase:

* Check if the pod requests GPUs (nvidia.com/gpu).

* Check if the node has a label specifying its GPU model (e.g., gpu-model=nvidia-a100).

* Ensure the pod's requested GPU model (via an annotation like gpu-workload-model: nvidia-a100) matches the node's label.

* Reject the node if there's a mismatch or if the required GPUs are not available.

* Score Phase:

* For multi-GPU pods, check for a node annotation that describes its NVLink topology (e.g., nvlink-topology: '0-1,2-3').

* Assign a high score to nodes where the requested number of GPUs can be placed on a single, fully-connected NVLink bridge.

* Assign a lower score to nodes where GPUs are split across buses.

* Assign the lowest score to nodes without topology information.

Let's build this in Go.

Core Implementation in Go

We'll use the official k8s.io/kubernetes and k8s.io/client-go libraries. Ensure your Go environment is set up.

Project Setup:

bash

go mod init custom-gpu-scheduler
go get k8s.io/component-base/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]

main.go: This file will register our custom plugin and start the scheduler component.

// main.go
package main

import (
	"os"

	"k8s.io/component-base/logs"
	"k8s.io/kubernetes/cmd/kube-scheduler/app"

	"custom-gpu-scheduler/pkg/scheduler"
)

func main() {
	logs.InitLogs()
	defer logs.FlushLogs()

	// Register our custom plugin with the scheduler framework.
	// The command constructor from kube-scheduler/app will build a scheduler
	// that includes our plugin.
	command := app.NewSchedulerCommand(
		app.WithPlugin(scheduler.Name, scheduler.New),
	)

	if err := command.Execute(); err != nil {
		os.Exit(1)
	}
}

pkg/scheduler/scheduler.go: This is where our core logic resides.

// pkg/scheduler/scheduler.go
package scheduler

import (
	"context"
	"fmt"
	"strconv"
	"strings"

	"github.com/go-logr/logr"
	v1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	Name = "GpuTopologyScheduler"

	// Annotations used on Pods and Nodes
	PodGpuModelAnnotation    = "gpu-workload-model"
	NodeGpuModelLabel        = "gpu-model"
	NodeGpuTopologyAnnotation = "nvlink-topology"

	// Resource name
	NvidiaGpuResource = "nvidia.com/gpu"
)

// GpuTopologyScheduler is a scheduler plugin that's aware of GPU models and NVLink topology.
type GpuTopologyScheduler struct {
	handle framework.Handle
	log    logr.Logger
}

// Ensure GpuTopologyScheduler implements the necessary interfaces.
var _ framework.FilterPlugin = &GpuTopologyScheduler{}
var _ framework.ScorePlugin = &GpuTopologyScheduler{}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
	return &GpuTopologyScheduler{
		handle: h,
		log:    klog.FromContext(context.Background()),
	}, nil
}

// Name returns the name of the plugin.
func (s *GpuTopologyScheduler) Name() string {
	return Name
}

// Filter plugin implementation
func (s *GpuTopologyScheduler) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	log := s.log.WithValues("pod", klog.KObj(pod), "node", klog.KObj(nodeInfo.Node()))

	// Step 1: Check if the Pod requests GPUs. If not, this plugin has no opinion.
	requestedGpus := getGpuRequest(pod)
	if requestedGpus == 0 {
		log.V(4).Info("Pod does not request GPUs, skipping filter")
		return framework.NewStatus(framework.Skip)
	}

	// Step 2: Check if the Node has GPUs.
	node := nodeInfo.Node()
	if node.Status.Allocatable.Name(NvidiaGpuResource, "").Value() == 0 {
		log.V(2).Info("Node has no allocatable GPUs")
		return framework.NewStatus(framework.Unschedulable, "node has no GPUs")
	}

	// Step 3: Enforce GPU model affinity.
	podGpuModel, ok := pod.Annotations[PodGpuModelAnnotation]
	if !ok {
		// If pod doesn't specify a model, we don't apply a model-based filter.
		// A production scheduler might reject such pods.
		log.V(4).Info("Pod does not specify a GPU model annotation, skipping model check")
		return framework.NewStatus(framework.Success)
	}

	nodeGpuModel, ok := node.Labels[NodeGpuModelLabel]
	if !ok {
		log.V(2).Info("Node does not have GPU model label", "label", NodeGpuModelLabel)
		return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node missing label %s", NodeGpuModelLabel))
	}

	if podGpuModel != nodeGpuModel {
		log.V(2).Info("GPU model mismatch", "pod_model", podGpuModel, "node_model", nodeGpuModel)
		return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("pod requires GPU model %s, but node has %s", podGpuModel, nodeGpuModel))
	}

	log.V(4).Info("Pod and Node GPU models match, filter successful")
	return framework.NewStatus(framework.Success)
}

// Score plugin implementation
func (s *GpuTopologyScheduler) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.AsStatus(fmt.Errorf("getting node info for %s: %w", nodeName, err))
	}
	node := nodeInfo.Node()
	log := s.log.WithValues("pod", klog.KObj(pod), "node", klog.KObj(node))

	requestedGpus := getGpuRequest(pod)
	if requestedGpus <= 1 {
		// For single-GPU pods, topology doesn't matter. Give a neutral score.
		return 50, framework.NewStatus(framework.Success)
	}

	topology, ok := node.Annotations[NodeGpuTopologyAnnotation]
	if !ok {
		log.V(3).Info("Node is missing topology annotation, giving low score")
		return 10, framework.NewStatus(framework.Success) // Low score for nodes without topology info
	}

	// Example topology: "0-1,2-3|4-5,6-7" means 0&1 are linked, 2&3 are linked, etc.
	// A more robust implementation would parse this properly.
	// For this example, we'll just check if a group of 'requestedGpus' size exists.
	maxGroupSize := 0
	groups := strings.Split(topology, "|")
	for _, group := range groups {
		linkedGpus := strings.Split(group, "-")
		if len(linkedGpus) > maxGroupSize {
			maxGroupSize = len(linkedGpus)
		}
	}

	if int64(maxGroupSize) >= requestedGpus {
		log.V(2).Info("Node has a suitable NVLink group for the pod", "requested_gpus", requestedGpus, "max_group_size", maxGroupSize)
		return 100, framework.NewStatus(framework.Success) // Highest score for perfect topology match
	}

	log.V(2).Info("Node does not have a large enough NVLink group", "requested_gpus", requestedGpus, "max_group_size", maxGroupSize)
	return 20, framework.NewStatus(framework.Success) // Higher than no-info, but lower than perfect match
}

// ScoreExtensions returns a ScoreExtensions interface if the plugin implements one, or nil.
func (s *GpuTopologyScheduler) ScoreExtensions() framework.ScoreExtensions {
	return nil // We don't need normalization
}

// Helper to get GPU request from a Pod spec
func getGpuRequest(pod *v1.Pod) int64 {
	var count int64
	for _, container := range pod.Spec.Containers {
		if val, ok := container.Resources.Limits[NvidiaGpuResource]; ok {
			count += val.Value()
		}
	}
	return count
}

This code provides the fundamental building blocks. The Filter method performs a hard rejection based on GPU model compatibility. The Score method implements our business logic: it gives the highest score to nodes that can contain the entire multi-GPU pod within a single NVLink group, promoting optimal performance.

Deployment and Configuration

Now, let's get this scheduler running in a cluster.

1. Containerize the Scheduler

Create a Dockerfile:

dockerfile

# Use a distroless base image for a smaller, more secure final image.
FROM gcr.io/distroless/static:nonroot

WORKDIR /
COPY custom-gpu-scheduler .

USER nonroot:nonroot

ENTRYPOINT ["/custom-gpu-scheduler"]

Build and push the image:

bash

# Build the static Go binary
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o custom-gpu-scheduler .

# Build and push the Docker image
docker build -t your-repo/custom-gpu-scheduler:v0.1.0 .
docker push your-repo/custom-gpu-scheduler:v0.1.0

2. Create the Scheduler Configuration

This KubeSchedulerConfiguration file tells the scheduler binary to activate our custom plugin and disable default plugins that we don't need or want to interfere.

scheduler-config.yaml:

yaml

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.conf" # Path inside the pod
profiles:
  - schedulerName: gpu-topology-scheduler
    plugins:
      # Our custom plugins are enabled here in the desired phases
      filter:
        enabled:
          - name: "GpuTopologyScheduler"
      score:
        enabled:
          - name: "GpuTopologyScheduler"
      # We must also enable the default plugins for other phases
      queueSort:
        enabled:
          - name: "PrioritySort"
      preFilter:
        enabled:
          - name: "NodeResourcesFit"
      bind:
        enabled:
          - name: "DefaultBinder"
    # We explicitly disable default plugins in phases we've overridden
    pluginConfig:
      - name: "DefaultPreemption"
        args: {}
      - name: "InterPodAffinity"
        args: {}
      - name: "NodeAffinity"
        args: {}
      # Configure our plugin if needed (not in this example)
      - name: "GpuTopologyScheduler"
        args: {}

3. Deploy the Scheduler to Kubernetes

We need a Deployment, ServiceAccount, and the necessary ClusterRoles for the scheduler to read Pods and Nodes and to bind Pods to Nodes.

scheduler-deployment.yaml:

yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-topology-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.ioio/v1
kind: ClusterRole
metadata:
  name: gpu-topology-scheduler-role
rules:
  - apiGroups: [""]
    resources: ["nodes", "pods", "pods/status", "replicationcontrollers", "services"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps", "extensions"]
    resources: ["replicasets", "statefulsets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["bindings", "pods/binding"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create", "get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-topology-scheduler-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gpu-topology-scheduler-role
subjects:
  - kind: ServiceAccount
    name: gpu-topology-scheduler
    namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: | # Paste the content of scheduler-config.yaml here
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    # ... (rest of config)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-topology-scheduler
  namespace: kube-system
  labels:
    app: gpu-topology-scheduler
spec:
  replicas: 2 # Run with HA
  selector:
    matchLabels:
      app: gpu-topology-scheduler
  template:
    metadata:
      labels:
        app: gpu-topology-scheduler
    spec:
      serviceAccountName: gpu-topology-scheduler
      containers:
        - name: scheduler
          image: your-repo/custom-gpu-scheduler:v0.1.0
          args:
            - "/custom-gpu-scheduler"
            - "--config=/etc/kubernetes/scheduler-config.yaml"
            - "--v=3"
          volumeMounts:
            - name: scheduler-config-volume
              mountPath: /etc/kubernetes
      volumes:
        - name: scheduler-config-volume
          configMap:
            name: scheduler-config

Apply this manifest: kubectl apply -f scheduler-deployment.yaml.

Using the Custom Scheduler

To use the scheduler, a Pod simply needs to specify its schedulerName.

First, let's prepare our nodes. This would typically be done by a daemonset or an operator.

bash

# Node 1: 8x A100s, with two 4-GPU NVLink groups
kubectl label node node-1 gpu-model=nvidia-a100
kubectl annotate node node-1 nvlink-topology='0-1-2-3|4-5-6-7'

# Node 2: 8x V100s, with no topology info provided
kubectl label node node-2 gpu-model=nvidia-v100

Now, let's create a Pod that requests 4 A100 GPUs for a distributed training job.

training-pod.yaml:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-job-a100
  annotations:
    gpu-workload-model: "nvidia-a100"
spec:
  schedulerName: gpu-topology-scheduler # KEY: This tells Kubernetes to use our scheduler
  containers:
    - name: training-container
      image: nvidia/cuda:11.8.0-base-ubuntu22.04
      command: ["sleep", "3600"]
      resources:
        limits:
          nvidia.com/gpu: 4

When you kubectl apply -f training-pod.yaml, the following will happen:

The pod is assigned to gpu-topology-scheduler.

Our scheduler's Filter plugin runs.

* It evaluates node-1: The pod asks for nvidia-a100, the node has label gpu-model=nvidia-a100. Pass.

* It evaluates node-2: The pod asks for nvidia-a100, the node has label gpu-model=nvidia-v100. Fail. node-2 is filtered out.

Our scheduler's Score plugin runs on the remaining candidates (only node-1).

* It sees the pod requests 4 GPUs.

* It reads node-1's annotation nvlink-topology='0-1-2-3|4-5-6-7'.

* It finds a group of size 4 that can fit the request.

* It assigns node-1 a score of 100.

node-1 is the highest-scoring node, and the pod is bound to it.

If we submitted a similar pod asking for V100s, it would be correctly placed on node-2. If we submitted a 5-GPU A100 pod, it would be scheduled on node-1 but would receive a lower score (20) because it doesn't fit a single NVLink group, signaling a potential performance compromise.

Advanced Considerations and Edge Cases

Building a custom scheduler for production requires more than just the core logic. Here are critical factors senior engineers must address.

Scheduling Latency and Throughput

Every line of code in your Filter and Score plugins adds latency to pod scheduling. If your logic involves complex computations or external API calls (an anti-pattern!), you can significantly slow down the scheduler. The Kubernetes project aims for a scheduling throughput of thousands of pods per second.

Mitigation:

* Performance Profiling: Use Go's built-in pprof tooling. The scheduler binary can expose a pprof endpoint. Analyze CPU and memory profiles under load to identify bottlenecks in your plugins.

* Caching: For data that doesn't change often (like node topology), use the framework.Handle to access the scheduler's shared snapshot lister. This is an in-memory cache of cluster state, far faster than querying the API server directly.

* Pre-computation: If scoring is complex, consider if some parts can be pre-computed and stored in the framework.CycleState. This is a key-value store that persists for the duration of a single pod's scheduling attempt, allowing you to pass data from PreFilter to Filter to Score without recalculating.

High Availability (HA)

A single scheduler instance is a single point of failure. If it crashes, no new pods (using that scheduler) can be scheduled.

Solution:

Run multiple replicas of your scheduler deployment (as shown in our Deployment manifest with replicas: 2). The --leader-elect=true flag (enabled by default in the KubeSchedulerConfiguration) is crucial. It uses a Lease object in Kubernetes to ensure that only one scheduler replica is active at any given time. If the leader fails, another replica will acquire the lease and take over, providing seamless failover.

Preemption

What happens if a high-priority ML training job arrives but there are no available resources, while low-priority batch processing pods are occupying the needed GPUs? The job will remain pending.

Solution:

Your scheduler needs to implement the Preemption logic. This is a complex process where the scheduler identifies low-priority pods that can be evicted to make room for the high-priority pod. You can enable the DefaultPreemption plugin in your configuration. However, for a truly custom behavior, you might need to implement your own PreFilter logic that checks for potential preemption candidates, which significantly increases the complexity of your scheduler.

Race Conditions and State Staleness

The scheduler operates on a snapshot of the cluster's state. It's possible for a node's state to change after the Score phase but before the Bind phase. For example, another pod lands on your chosen node, consuming the GPUs you thought were free.

Solution:

The Bind phase is the final, atomic operation. The DefaultBinder plugin will attempt to create the binding object. If the node no longer has the resources, the Kubelet on that node will reject the pod, and the pod will go back to the Pending state to be rescheduled. Your scheduler should be stateless and idempotent; it should make the same correct decision given the same cluster state, regardless of previous failed attempts.

Conclusion: The Trade-off of Customization

Building a custom Kubernetes scheduler is a powerful but non-trivial undertaking. It's not a solution for every problem. You should only consider it when you have high-value, specialized workloads where the performance and efficiency gains from bespoke scheduling logic justify the development and maintenance overhead.

For GPU-intensive ML platforms, the calculus often leans in favor of custom schedulers. The ability to perform topology-aware scheduling, enforce fine-grained resource constraints, and implement custom bin-packing strategies can translate directly into faster training times, higher hardware utilization, and significant cost savings. By leveraging the scheduler framework, you can inject this critical domain knowledge into your cluster, transforming it from a general-purpose compute grid into a highly optimized, high-performance computing environment.

The Inadequacy of the Default Scheduler for GPU Workloads

Architecture of a Custom Scheduler Plugin

Core Implementation in Go

Deployment and Configuration

Using the Custom Scheduler

Advanced Considerations and Edge Cases

Scheduling Latency and Throughput

High Availability (HA)

Preemption

Race Conditions and State Staleness

Conclusion: The Trade-off of Customization

Found this article helpful?