Building a GPU-Aware Kubernetes Scheduler for ML Workloads

October 14, 2025

14 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Problem: Why the Default Scheduler Fails High-Performance ML

The kube-scheduler is a marvel of engineering, but its default configuration is optimized for general-purpose stateless and stateful applications. When it comes to specialized, high-performance workloads like distributed machine learning training, its one-size-fits-all approach reveals critical limitations. The scheduler sees a node with eight NVIDIA A100 GPUs as simply having a capacity of nvidia.com/gpu: 8. It has no intrinsic understanding that four of those GPUs might be on one NUMA node connected by high-speed NVLink, while the other four are on another, connected only by the slower PCIe bus.

For a distributed training job using a framework like Horovod, PyTorch DistributedDataParallel, or DeepSpeed, the interconnect bandwidth between GPUs is a primary determinant of training throughput. A job requiring four GPUs might be scheduled by default across two different NUMA nodes, or even worse, across two different physical machines. The resulting communication overhead through PCIe or the network fabric can cripple performance, negating the benefits of distributed training.

Consider this common scenario:

A StatefulSet or a custom operator like MPIJob launches a 4-pod training job, with each pod requesting one GPU (nvidia.com/gpu: 1). The default scheduler might place them like this:

* pod-0 -> node-A, GPU-0

* pod-1 -> node-A, GPU-5

* pod-2 -> node-B, GPU-1

* pod-3 -> node-B, GPU-2

If GPU-0 and GPU-5 on node-A are on different sockets, communication between pod-0 and pod-1 is already suboptimal. The communication between pods on node-A and node-B is even worse, bottlenecked by the network. A topology-aware scheduler would understand that the ideal placement is to find a single node with four available GPUs connected by NVLink and place all four pods there.

This is the gap we will fill. We need a scheduler that can read node topology metadata and make intelligent placement decisions based on pod-specific requirements.

Architectural Choices: Extender vs. Framework Plugin

Before diving into code, it's crucial to choose the right architectural pattern for customizing the Kubernetes scheduler. There are two primary methods:

Scheduler Extender: This is the older, webhook-based approach. The kube-scheduler is configured with a URL to an external HTTP server (the extender). During the scheduling cycle, the scheduler sends a request containing the pod and a list of candidate nodes to the extender. The extender responds with a list of filtered (and optionally, scored) nodes.

* Pros: Language-agnostic (can be written in any language that can run an HTTP server), decoupled from the scheduler's lifecycle.

* Cons: High latency due to network hops and serialization/deserialization overhead. Less powerful, as it has limited access to the scheduler's internal state and caching mechanisms. It's generally considered a legacy approach.

Scheduler Framework Plugin: Introduced in Kubernetes 1.15 and now the standard, this approach involves writing plugins in Go that are compiled directly into a scheduler binary. These plugins register for specific extension points (hooks) in the scheduling cycle.

* Pros: Extremely high performance as it runs in-process with the scheduler. Full access to the scheduler's internal cache and framework handle, allowing for complex and stateful logic. The de facto standard for serious scheduler customization.

* Cons: Must be written in Go. Tightly coupled with the Kubernetes scheduler's source code and version.

For performance-critical ML workloads, the latency introduced by an extender is unacceptable. We will be building a Scheduler Framework Plugin to ensure our scheduling decisions are as fast and efficient as possible.

Deep Dive: The Kubernetes Scheduler Framework Extension Points

The Scheduler Framework provides a series of extension points that allow us to inject custom logic. For our GPU-aware scheduler, the most important are Filter and Score.

* Filter: This extension point is used to eliminate nodes that cannot run the pod. The scheduler runs Filter plugins sequentially. If any plugin marks a node as unschedulable, it's removed from consideration for the current pod. Our Filter plugin will check if a node has the required GPU topology (e.g., a sufficient number of GPUs connected by NVLink).

* Score: After filtering, the remaining nodes are candidates. The Score extension point is used to rank these candidates. Each Score plugin returns an integer score (typically 0-100) for each node. The scheduler sums the scores from all Score plugins and picks the node with the highest total score. Our Score plugin will prioritize nodes that are a better fit, for instance, by preferring to co-locate pods from the same training job or implementing bin-packing logic.

Other relevant points include PreFilter (for pre-computation before filtering nodes) and Permit (for implementing gang scheduling), but Filter and Score are the core of our solution.

Implementation Prerequisite: Discovering GPU Topology

Our scheduler can only make decisions based on data it has. It needs to know the GPU topology of each node in the cluster. The best way to achieve this is by using the NVIDIA GPU Feature Discovery (GFD) tool. GFD runs as a DaemonSet and automatically labels each node with detailed GPU information. For example, it can produce labels like:

yaml

nvidia.com/gpu.count: '8'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
nvidia.com/gpu.memory: '40536'
nvidia.com/gpu.topology.nvlink: 'true'

The nvidia.com/gpu.topology.nvlink: 'true' label is the critical piece of information our scheduler will use. For more complex scenarios, GFD can also expose metrics via the node-feature-discovery API, but for our purposes, simple node labels are sufficient.

Implementation: Building the Topology-Aware Scheduler Plugin

Let's write the Go code for our plugin. We'll create two plugins: GpuTopologyFilter and JobLocalityScore.

Our project structure will look like this:

text

gpu-scheduler/
├── go.mod
├── go.sum
├── main.go
└── plugins/
    ├── filter.go
    └── score.go

First, set up the Go module:

bash

go mod init github.com/your-org/gpu-scheduler
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/klog/v2

The `GpuTopologyFilter` Plugin

This plugin will filter out nodes that don't meet the GPU topology requirements specified in a pod's annotations.

plugins/filter.go

package plugins

import (
	"context"
	"fmt"

	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	// Name is the name of the plugin used in the plugin registry and configurations.
	GpuTopologyFilterName = "GpuTopologyFilter"

	// Pod annotations
	AnnotationGpuTopologyRequirement = "ml-workload.io/gpu-topology-required"

	// Node labels from NVIDIA GFD
	LabelGpuTopologyNvlink = "nvidia.com/gpu.topology.nvlink"
)

// GpuTopologyFilter is a scheduler plugin that filters nodes based on GPU topology.
type GpuTopologyFilter struct{}

var _ framework.FilterPlugin = &GpuTopologyFilter{}

// Name returns name of the plugin.
func (f *GpuTopologyFilter) Name() string {
	return GpuTopologyFilterName
}

// Filter is the function that the scheduler framework calls for each node to check if the pod can be scheduled on it.
func (f *GpuTopologyFilter) Filter(ctx context.Context, state *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	klog.Infof("Filtering pod %s on node %s", pod.Name, nodeInfo.Node().Name)

	// 1. Check if the pod even requests GPUs. If not, this plugin has no opinion.
	requestsGpu := false
	for _, container := range pod.Spec.Containers {
		if _, ok := container.Resources.Limits["nvidia.com/gpu"]; ok {
			requestsGpu = true
			break
		}
	}
	if !requestsGpu {
		klog.Infof("Pod %s does not request GPUs. Skipping filter.", pod.Name)
		return framework.NewStatus(framework.Success)
	}

	// 2. Check for the specific topology annotation on the pod.
	topologyRequirement, ok := pod.Annotations[AnnotationGpuTopologyRequirement]
	if !ok {
		klog.Infof("Pod %s has no GPU topology requirement. Skipping filter.", pod.Name)
		return framework.NewStatus(framework.Success)
	}

	// 3. We have a requirement. Now check if the node satisfies it.
	node := nodeInfo.Node()
	if node == nil {
		return framework.NewStatus(framework.Error, "node not found")
	}
	nodeLabels := node.GetLabels()

	switch topologyRequirement {
	case "nvlink":
		nvlinkAvailable, labelExists := nodeLabels[LabelGpuTopologyNvlink]
		if !labelExists || nvlinkAvailable != "true" {
			msg := fmt.Sprintf("Node %s lacks required NVLink topology", node.Name)
			klog.Warningf("Pod %s failed filter on node %s: %s", pod.Name, node.Name, msg)
			return framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)
		}
	default:
		// If the annotation has an unknown value, we can choose to fail or succeed. Failing is safer.
		msg := fmt.Sprintf("Unknown GPU topology requirement: %s", topologyRequirement)
		klog.Warningf("Pod %s failed filter on node %s: %s", pod.Name, node.Name, msg)
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)
	}

	klog.Infof("Pod %s PASSED filter on node %s", pod.Name, node.Name)
	return framework.NewStatus(framework.Success)
}

// NewGpuTopologyFilter initializes a new plugin and returns it.
func NewGpuTopologyFilter(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
	return &GpuTopologyFilter{}, nil
}

The `JobLocalityScore` Plugin

This plugin encourages co-location of pods belonging to the same job (identified by a job-name label). This is a simple form of gang scheduling that helps keep distributed training pods close together.

plugins/score.go

package plugins

import (
	"context"
	"fmt"

	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	JobLocalityScoreName = "JobLocalityScore"

	// Label to identify pods belonging to the same job
	LabelJobName = "job-name"
)

// JobLocalityScore is a plugin that favors nodes where other pods of the same job are running.
type JobLocalityScore struct {
	handle framework.Handle
}

var _ framework.ScorePlugin = &JobLocalityScore{}

// Name returns the name of the plugin.
func (s *JobLocalityScore) Name() string {
	return JobLocalityScoreName
}

// Score is the function that the scheduler framework calls to score a node.
func (s *JobLocalityScore) Score(ctx context.Context, state *framework.CycleState, p *corev1.Pod, nodeName string) (int64, *framework.Status) {
	// 1. Check if the pod belongs to a job.
	jobName, ok := p.Labels[LabelJobName]
	if !ok {
		// Not a job pod, this plugin doesn't apply.
		return 0, framework.NewStatus(framework.Success)
	}

	// 2. Get node information.
	nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
	}

	// 3. Iterate over existing pods on the node to find matches.
	score := int64(0)
	for _, existingPod := range nodeInfo.Pods {
		if existingPod.Labels[LabelJobName] == jobName {
			// Found another pod from the same job on this node. Increase the score.
			score += 25
		}
	}

	// Cap the score at the max score.
	if score > framework.MaxNodeScore {
		score = framework.MaxNodeScore
	}

	klog.Infof("Pod %s on node %s received locality score of %d", p.Name, nodeName, score)
	return score, framework.NewStatus(framework.Success)
}

// ScoreExtensions of the Score plugin.
func (s *JobLocalityScore) ScoreExtensions() framework.ScoreExtensions {
	return nil // We don't need normalization.
}

// NewJobLocalityScore initializes a new plugin and returns it.
func NewJobLocalityScore(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
	return &JobLocalityScore{
		handle: h,
	}, nil
}

Tying It All Together: `main.go`

This file registers our plugins with the scheduler framework's command-line interface.

main.go

package main

import (
	"os"

	"k8s.io/component-base/cli"
	"k8s.io/kubernetes/cmd/kube-scheduler/app"

	"github.com/your-org/gpu-scheduler/plugins"
	// Ensure that scheduler framework options are registered
	_ "k8s.io/kubernetes/pkg/scheduler/apis/config/scheme"
)

func main() {
	// Register the plugin with the scheduler framework.
	command := app.NewSchedulerCommand(
		app.WithPlugin(plugins.GpuTopologyFilterName, plugins.NewGpuTopologyFilter),
		app.WithPlugin(plugins.JobLocalityScoreName, plugins.NewJobLocalityScore),
	)

	code := cli.Run(command)
	os.Exit(code)
}

Packaging and Deployment

Now we need to build our scheduler, containerize it, and deploy it to the Kubernetes cluster.

`Dockerfile`

We'll use a multi-stage build to keep the final image small.

Dockerfile

# Build stage
FROM golang:1.21 as builder

WORKDIR /workspace

COPY go.mod go.mod
COPY go.sum go.sum
RUN go mod download

COPY . .

RUN CGO_ENABLED=0 GOOS=linux go build -a -o gpu-scheduler main.go

# Final stage
FROM gcr.io/distroless/static:nonroot

WORKDIR /
COPY --from=builder /workspace/gpu-scheduler .

USER 65532:65532

ENTRYPOINT ["/gpu-scheduler"]

Build and push the image:

bash

export IMG=your-registry/gpu-topology-scheduler:v0.1.0
docker build -t ${IMG} .
docker push ${IMG}

Kubernetes Deployment Manifests

Deploying a second scheduler requires a few components:

A KubeSchedulerConfiguration file to enable our plugins.

A ConfigMap to hold this configuration.

An RBAC setup (ClusterRole and ClusterRoleBinding) to give the scheduler permissions.

A Deployment to run the scheduler pod.

scheduler-config.yaml

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
      resourceName: gpu-topology-scheduler
      resourceNamespace: kube-system
    profiles:
      - schedulerName: gpu-topology-scheduler
        plugins:
          filter:
            enabled:
              - name: GpuTopologyFilter
          score:
            enabled:
              - name: JobLocalityScore
        pluginConfig:
          - name: GpuTopologyFilter
            args: {}
          - name: JobLocalityScore
            args: {}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gpu-scheduler-role
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "pods/binding", "replicationcontrollers", "services", "podstatuses"]
  verbs: ["get", "list", "watch", "create", "update"]
- apiGroups: ["apps"]
  resources: ["replicasets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["create", "get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-scheduler-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gpu-scheduler-role
subjects:
- kind: ServiceAccount
  name: default
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-topology-scheduler
  namespace: kube-system
  labels:
    app: gpu-topology-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-topology-scheduler
  template:
    metadata:
      labels:
        app: gpu-topology-scheduler
    spec:
      containers:
      - name: scheduler
        image: your-registry/gpu-topology-scheduler:v0.1.0 # <-- IMPORTANT: Use your image
        args:
        - "--config=/etc/kubernetes/scheduler-config.yaml"
        - "-v=4"
        volumeMounts:
        - name: scheduler-config-volume
          mountPath: /etc/kubernetes
      volumes:
      - name: scheduler-config-volume
        configMap:
          name: gpu-scheduler-config

Apply this manifest to your cluster:

kubectl apply -f scheduler-config.yaml

Using the Custom Scheduler

To use our new scheduler, a pod must specify its schedulerName in the spec. Here is an example of a pod that requires an NVLink-enabled node.

test-pod.yaml

yaml

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test-nvlink
  annotations:
    ml-workload.io/gpu-topology-required: "nvlink"
  labels:
    job-name: "distributed-training-job-1"
spec:
  schedulerName: gpu-topology-scheduler # <-- Use our scheduler!
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.0.0-base-ubuntu22.04
    command: ["sleep", "3600"]
    resources:
      limits:
        nvidia.com/gpu: 1

When you apply this pod, you can check the logs of your custom scheduler to see the filtering and scoring in action:

kubectl logs -n kube-system -l app=gpu-topology-scheduler -f

You should see log lines like:

Filtering pod cuda-test-nvlink on node gpu-node-1

Pod cuda-test-nvlink PASSED filter on node gpu-node-1

Pod cuda-test-nvlink on node gpu-node-1 received locality score of 0

Advanced Considerations and Edge Cases

This implementation is a solid foundation, but production environments demand more.

True Gang Scheduling: Our JobLocalityScore plugin only encourages* co-location. It doesn't guarantee it. If a job needs 4 pods to start simultaneously (all-or-nothing), this is insufficient. For true gang scheduling, you would need to implement the Permit extension point. The Permit plugin would wait for a minimum number of pods from a job to arrive at the Permit stage before allowing any of them to proceed to binding. This often requires a controller to manage pod groups and timeouts. Projects like Volcano are dedicated schedulers built for these batch workloads.

* Preemption: What happens if a high-priority training job arrives and the cluster is full of low-priority pods? Your custom scheduler needs to participate in the preemption logic. This involves configuring preemptionPolicy for PriorityClasses and potentially implementing a PostFilter plugin to nominate nodes for preemption if a pod is unschedulable.

* Performance and Scalability: In a cluster with thousands of nodes and a high rate of pod churn, your scheduler's performance is paramount. The Filter and Score functions are in the critical path. Avoid expensive computations or external API calls within them. Leverage the PreFilter stage to perform calculations once per pod rather than once per node. Use the framework.Handle to access the scheduler's shared informer cache instead of making direct API server calls.

* Observability: A scheduler that makes opaque decisions is a nightmare to debug. Instrument your plugins with metrics. Use the Prometheus client library for Go to expose counters and histograms.

* scheduler_plugin_filter_evaluations_total{plugin="GpuTopologyFilter", result="success|fail"}

* scheduler_plugin_score_evaluation_duration_seconds{plugin="JobLocalityScore"}

Add structured logging with klog to trace the decision-making process for a specific pod, which is invaluable when a user asks, "Why isn't my pod scheduling?"

Conclusion: The Power of Custom Scheduling

By moving beyond the default kube-scheduler, you unlock the ability to tailor Kubernetes to the specific needs of your most demanding workloads. We've demonstrated how to build a topology-aware scheduler that understands the physical layout of GPUs, a critical requirement for efficient ML training. This pattern of using node feature discovery to label nodes and a custom scheduler plugin to consume those labels is a powerful and extensible model.

The journey doesn't end here. The same principles can be applied to schedule based on network fabric (e.g., InfiniBand), storage locality, power consumption, or even real-time utilization metrics, transforming your Kubernetes cluster into a truly performance-optimized platform for any specialized workload.

The Problem: Why the Default Scheduler Fails High-Performance ML

Architectural Choices: Extender vs. Framework Plugin

Deep Dive: The Kubernetes Scheduler Framework Extension Points

Implementation Prerequisite: Discovering GPU Topology

Implementation: Building the Topology-Aware Scheduler Plugin

The `GpuTopologyFilter` Plugin

The `JobLocalityScore` Plugin

Tying It All Together: `main.go`

Packaging and Deployment

`Dockerfile`

Kubernetes Deployment Manifests

Using the Custom Scheduler

Advanced Considerations and Edge Cases

Conclusion: The Power of Custom Scheduling

Found this article helpful?