Building a GPU-Aware Kubernetes Scheduler for ML Workloads

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Problem: Why the Default Scheduler Fails High-Performance ML

The kube-scheduler is a marvel of engineering, but its default configuration is optimized for general-purpose stateless and stateful applications. When it comes to specialized, high-performance workloads like distributed machine learning training, its one-size-fits-all approach reveals critical limitations. The scheduler sees a node with eight NVIDIA A100 GPUs as simply having a capacity of nvidia.com/gpu: 8. It has no intrinsic understanding that four of those GPUs might be on one NUMA node connected by high-speed NVLink, while the other four are on another, connected only by the slower PCIe bus.

For a distributed training job using a framework like Horovod, PyTorch DistributedDataParallel, or DeepSpeed, the interconnect bandwidth between GPUs is a primary determinant of training throughput. A job requiring four GPUs might be scheduled by default across two different NUMA nodes, or even worse, across two different physical machines. The resulting communication overhead through PCIe or the network fabric can cripple performance, negating the benefits of distributed training.

Consider this common scenario:

A StatefulSet or a custom operator like MPIJob launches a 4-pod training job, with each pod requesting one GPU (nvidia.com/gpu: 1). The default scheduler might place them like this:

* pod-0 -> node-A, GPU-0

* pod-1 -> node-A, GPU-5

* pod-2 -> node-B, GPU-1

* pod-3 -> node-B, GPU-2

If GPU-0 and GPU-5 on node-A are on different sockets, communication between pod-0 and pod-1 is already suboptimal. The communication between pods on node-A and node-B is even worse, bottlenecked by the network. A topology-aware scheduler would understand that the ideal placement is to find a single node with four available GPUs connected by NVLink and place all four pods there.

This is the gap we will fill. We need a scheduler that can read node topology metadata and make intelligent placement decisions based on pod-specific requirements.

Architectural Choices: Extender vs. Framework Plugin

Before diving into code, it's crucial to choose the right architectural pattern for customizing the Kubernetes scheduler. There are two primary methods:

  • Scheduler Extender: This is the older, webhook-based approach. The kube-scheduler is configured with a URL to an external HTTP server (the extender). During the scheduling cycle, the scheduler sends a request containing the pod and a list of candidate nodes to the extender. The extender responds with a list of filtered (and optionally, scored) nodes.
  • * Pros: Language-agnostic (can be written in any language that can run an HTTP server), decoupled from the scheduler's lifecycle.

    * Cons: High latency due to network hops and serialization/deserialization overhead. Less powerful, as it has limited access to the scheduler's internal state and caching mechanisms. It's generally considered a legacy approach.

  • Scheduler Framework Plugin: Introduced in Kubernetes 1.15 and now the standard, this approach involves writing plugins in Go that are compiled directly into a scheduler binary. These plugins register for specific extension points (hooks) in the scheduling cycle.
  • * Pros: Extremely high performance as it runs in-process with the scheduler. Full access to the scheduler's internal cache and framework handle, allowing for complex and stateful logic. The de facto standard for serious scheduler customization.

    * Cons: Must be written in Go. Tightly coupled with the Kubernetes scheduler's source code and version.

    For performance-critical ML workloads, the latency introduced by an extender is unacceptable. We will be building a Scheduler Framework Plugin to ensure our scheduling decisions are as fast and efficient as possible.

    Deep Dive: The Kubernetes Scheduler Framework Extension Points

    The Scheduler Framework provides a series of extension points that allow us to inject custom logic. For our GPU-aware scheduler, the most important are Filter and Score.

    * Filter: This extension point is used to eliminate nodes that cannot run the pod. The scheduler runs Filter plugins sequentially. If any plugin marks a node as unschedulable, it's removed from consideration for the current pod. Our Filter plugin will check if a node has the required GPU topology (e.g., a sufficient number of GPUs connected by NVLink).

    * Score: After filtering, the remaining nodes are candidates. The Score extension point is used to rank these candidates. Each Score plugin returns an integer score (typically 0-100) for each node. The scheduler sums the scores from all Score plugins and picks the node with the highest total score. Our Score plugin will prioritize nodes that are a better fit, for instance, by preferring to co-locate pods from the same training job or implementing bin-packing logic.

    Other relevant points include PreFilter (for pre-computation before filtering nodes) and Permit (for implementing gang scheduling), but Filter and Score are the core of our solution.

    Implementation Prerequisite: Discovering GPU Topology

    Our scheduler can only make decisions based on data it has. It needs to know the GPU topology of each node in the cluster. The best way to achieve this is by using the NVIDIA GPU Feature Discovery (GFD) tool. GFD runs as a DaemonSet and automatically labels each node with detailed GPU information. For example, it can produce labels like:

    yaml
    nvidia.com/gpu.count: '8'
    nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
    nvidia.com/gpu.memory: '40536'
    nvidia.com/gpu.topology.nvlink: 'true'

    The nvidia.com/gpu.topology.nvlink: 'true' label is the critical piece of information our scheduler will use. For more complex scenarios, GFD can also expose metrics via the node-feature-discovery API, but for our purposes, simple node labels are sufficient.

    Implementation: Building the Topology-Aware Scheduler Plugin

    Let's write the Go code for our plugin. We'll create two plugins: GpuTopologyFilter and JobLocalityScore.

    Our project structure will look like this:

    text
    gpu-scheduler/
    ├── go.mod
    ├── go.sum
    ├── main.go
    └── plugins/
        ├── filter.go
        └── score.go

    First, set up the Go module:

    bash
    go mod init github.com/your-org/gpu-scheduler
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/klog/v2

    The `GpuTopologyFilter` Plugin

    This plugin will filter out nodes that don't meet the GPU topology requirements specified in a pod's annotations.

    plugins/filter.go

    go
    package plugins
    
    import (
    	"context"
    	"fmt"
    
    	corev1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	// Name is the name of the plugin used in the plugin registry and configurations.
    	GpuTopologyFilterName = "GpuTopologyFilter"
    
    	// Pod annotations
    	AnnotationGpuTopologyRequirement = "ml-workload.io/gpu-topology-required"
    
    	// Node labels from NVIDIA GFD
    	LabelGpuTopologyNvlink = "nvidia.com/gpu.topology.nvlink"
    )
    
    // GpuTopologyFilter is a scheduler plugin that filters nodes based on GPU topology.
    type GpuTopologyFilter struct{}
    
    var _ framework.FilterPlugin = &GpuTopologyFilter{}
    
    // Name returns name of the plugin.
    func (f *GpuTopologyFilter) Name() string {
    	return GpuTopologyFilterName
    }
    
    // Filter is the function that the scheduler framework calls for each node to check if the pod can be scheduled on it.
    func (f *GpuTopologyFilter) Filter(ctx context.Context, state *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    	klog.Infof("Filtering pod %s on node %s", pod.Name, nodeInfo.Node().Name)
    
    	// 1. Check if the pod even requests GPUs. If not, this plugin has no opinion.
    	requestsGpu := false
    	for _, container := range pod.Spec.Containers {
    		if _, ok := container.Resources.Limits["nvidia.com/gpu"]; ok {
    			requestsGpu = true
    			break
    		}
    	}
    	if !requestsGpu {
    		klog.Infof("Pod %s does not request GPUs. Skipping filter.", pod.Name)
    		return framework.NewStatus(framework.Success)
    	}
    
    	// 2. Check for the specific topology annotation on the pod.
    	topologyRequirement, ok := pod.Annotations[AnnotationGpuTopologyRequirement]
    	if !ok {
    		klog.Infof("Pod %s has no GPU topology requirement. Skipping filter.", pod.Name)
    		return framework.NewStatus(framework.Success)
    	}
    
    	// 3. We have a requirement. Now check if the node satisfies it.
    	node := nodeInfo.Node()
    	if node == nil {
    		return framework.NewStatus(framework.Error, "node not found")
    	}
    	nodeLabels := node.GetLabels()
    
    	switch topologyRequirement {
    	case "nvlink":
    		nvlinkAvailable, labelExists := nodeLabels[LabelGpuTopologyNvlink]
    		if !labelExists || nvlinkAvailable != "true" {
    			msg := fmt.Sprintf("Node %s lacks required NVLink topology", node.Name)
    			klog.Warningf("Pod %s failed filter on node %s: %s", pod.Name, node.Name, msg)
    			return framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)
    		}
    	default:
    		// If the annotation has an unknown value, we can choose to fail or succeed. Failing is safer.
    		msg := fmt.Sprintf("Unknown GPU topology requirement: %s", topologyRequirement)
    		klog.Warningf("Pod %s failed filter on node %s: %s", pod.Name, node.Name, msg)
    		return framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)
    	}
    
    	klog.Infof("Pod %s PASSED filter on node %s", pod.Name, node.Name)
    	return framework.NewStatus(framework.Success)
    }
    
    // NewGpuTopologyFilter initializes a new plugin and returns it.
    func NewGpuTopologyFilter(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
    	return &GpuTopologyFilter{}, nil
    }
    

    The `JobLocalityScore` Plugin

    This plugin encourages co-location of pods belonging to the same job (identified by a job-name label). This is a simple form of gang scheduling that helps keep distributed training pods close together.

    plugins/score.go

    go
    package plugins
    
    import (
    	"context"
    	"fmt"
    
    	corev1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	JobLocalityScoreName = "JobLocalityScore"
    
    	// Label to identify pods belonging to the same job
    	LabelJobName = "job-name"
    )
    
    // JobLocalityScore is a plugin that favors nodes where other pods of the same job are running.
    type JobLocalityScore struct {
    	handle framework.Handle
    }
    
    var _ framework.ScorePlugin = &JobLocalityScore{}
    
    // Name returns the name of the plugin.
    func (s *JobLocalityScore) Name() string {
    	return JobLocalityScoreName
    }
    
    // Score is the function that the scheduler framework calls to score a node.
    func (s *JobLocalityScore) Score(ctx context.Context, state *framework.CycleState, p *corev1.Pod, nodeName string) (int64, *framework.Status) {
    	// 1. Check if the pod belongs to a job.
    	jobName, ok := p.Labels[LabelJobName]
    	if !ok {
    		// Not a job pod, this plugin doesn't apply.
    		return 0, framework.NewStatus(framework.Success)
    	}
    
    	// 2. Get node information.
    	nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
    	}
    
    	// 3. Iterate over existing pods on the node to find matches.
    	score := int64(0)
    	for _, existingPod := range nodeInfo.Pods {
    		if existingPod.Labels[LabelJobName] == jobName {
    			// Found another pod from the same job on this node. Increase the score.
    			score += 25
    		}
    	}
    
    	// Cap the score at the max score.
    	if score > framework.MaxNodeScore {
    		score = framework.MaxNodeScore
    	}
    
    	klog.Infof("Pod %s on node %s received locality score of %d", p.Name, nodeName, score)
    	return score, framework.NewStatus(framework.Success)
    }
    
    // ScoreExtensions of the Score plugin.
    func (s *JobLocalityScore) ScoreExtensions() framework.ScoreExtensions {
    	return nil // We don't need normalization.
    }
    
    // NewJobLocalityScore initializes a new plugin and returns it.
    func NewJobLocalityScore(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &JobLocalityScore{
    		handle: h,
    	}, nil
    }

    Tying It All Together: `main.go`

    This file registers our plugins with the scheduler framework's command-line interface.

    main.go

    go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	"github.com/your-org/gpu-scheduler/plugins"
    	// Ensure that scheduler framework options are registered
    	_ "k8s.io/kubernetes/pkg/scheduler/apis/config/scheme"
    )
    
    func main() {
    	// Register the plugin with the scheduler framework.
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(plugins.GpuTopologyFilterName, plugins.NewGpuTopologyFilter),
    		app.WithPlugin(plugins.JobLocalityScoreName, plugins.NewJobLocalityScore),
    	)
    
    	code := cli.Run(command)
    	os.Exit(code)
    }

    Packaging and Deployment

    Now we need to build our scheduler, containerize it, and deploy it to the Kubernetes cluster.

    `Dockerfile`

    We'll use a multi-stage build to keep the final image small.

    Dockerfile
    # Build stage
    FROM golang:1.21 as builder
    
    WORKDIR /workspace
    
    COPY go.mod go.mod
    COPY go.sum go.sum
    RUN go mod download
    
    COPY . .
    
    RUN CGO_ENABLED=0 GOOS=linux go build -a -o gpu-scheduler main.go
    
    # Final stage
    FROM gcr.io/distroless/static:nonroot
    
    WORKDIR /
    COPY --from=builder /workspace/gpu-scheduler .
    
    USER 65532:65532
    
    ENTRYPOINT ["/gpu-scheduler"]

    Build and push the image:

    bash
    export IMG=your-registry/gpu-topology-scheduler:v0.1.0
    docker build -t ${IMG} .
    docker push ${IMG}

    Kubernetes Deployment Manifests

    Deploying a second scheduler requires a few components:

  • A KubeSchedulerConfiguration file to enable our plugins.
  • A ConfigMap to hold this configuration.
  • An RBAC setup (ClusterRole and ClusterRoleBinding) to give the scheduler permissions.
  • A Deployment to run the scheduler pod.
  • scheduler-config.yaml

    yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: gpu-scheduler-config
      namespace: kube-system
    data:
      scheduler-config.yaml: |
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
          resourceName: gpu-topology-scheduler
          resourceNamespace: kube-system
        profiles:
          - schedulerName: gpu-topology-scheduler
            plugins:
              filter:
                enabled:
                  - name: GpuTopologyFilter
              score:
                enabled:
                  - name: JobLocalityScore
            pluginConfig:
              - name: GpuTopologyFilter
                args: {}
              - name: JobLocalityScore
                args: {}
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: gpu-scheduler-role
    rules:
    - apiGroups: [""]
      resources: ["nodes", "pods", "pods/binding", "replicationcontrollers", "services", "podstatuses"]
      verbs: ["get", "list", "watch", "create", "update"]
    - apiGroups: ["apps"]
      resources: ["replicasets"]
      verbs: ["get", "list", "watch"]
    - apiGroups: ["storage.k8s.io"]
      resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
      verbs: ["get", "list", "watch"]
    - apiGroups: ["coordination.k8s.io"]
      resources: ["leases"]
      verbs: ["create", "get", "update"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-scheduler-binding
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: gpu-scheduler-role
    subjects:
    - kind: ServiceAccount
      name: default
      namespace: kube-system
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-topology-scheduler
      namespace: kube-system
      labels:
        app: gpu-topology-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: gpu-topology-scheduler
      template:
        metadata:
          labels:
            app: gpu-topology-scheduler
        spec:
          containers:
          - name: scheduler
            image: your-registry/gpu-topology-scheduler:v0.1.0 # <-- IMPORTANT: Use your image
            args:
            - "--config=/etc/kubernetes/scheduler-config.yaml"
            - "-v=4"
            volumeMounts:
            - name: scheduler-config-volume
              mountPath: /etc/kubernetes
          volumes:
          - name: scheduler-config-volume
            configMap:
              name: gpu-scheduler-config

    Apply this manifest to your cluster:

    kubectl apply -f scheduler-config.yaml

    Using the Custom Scheduler

    To use our new scheduler, a pod must specify its schedulerName in the spec. Here is an example of a pod that requires an NVLink-enabled node.

    test-pod.yaml

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-test-nvlink
      annotations:
        ml-workload.io/gpu-topology-required: "nvlink"
      labels:
        job-name: "distributed-training-job-1"
    spec:
      schedulerName: gpu-topology-scheduler # <-- Use our scheduler!
      containers:
      - name: cuda-container
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command: ["sleep", "3600"]
        resources:
          limits:
            nvidia.com/gpu: 1

    When you apply this pod, you can check the logs of your custom scheduler to see the filtering and scoring in action:

    kubectl logs -n kube-system -l app=gpu-topology-scheduler -f

    You should see log lines like:

    Filtering pod cuda-test-nvlink on node gpu-node-1

    Pod cuda-test-nvlink PASSED filter on node gpu-node-1

    Pod cuda-test-nvlink on node gpu-node-1 received locality score of 0

    Advanced Considerations and Edge Cases

    This implementation is a solid foundation, but production environments demand more.

    True Gang Scheduling: Our JobLocalityScore plugin only encourages* co-location. It doesn't guarantee it. If a job needs 4 pods to start simultaneously (all-or-nothing), this is insufficient. For true gang scheduling, you would need to implement the Permit extension point. The Permit plugin would wait for a minimum number of pods from a job to arrive at the Permit stage before allowing any of them to proceed to binding. This often requires a controller to manage pod groups and timeouts. Projects like Volcano are dedicated schedulers built for these batch workloads.

    * Preemption: What happens if a high-priority training job arrives and the cluster is full of low-priority pods? Your custom scheduler needs to participate in the preemption logic. This involves configuring preemptionPolicy for PriorityClasses and potentially implementing a PostFilter plugin to nominate nodes for preemption if a pod is unschedulable.

    * Performance and Scalability: In a cluster with thousands of nodes and a high rate of pod churn, your scheduler's performance is paramount. The Filter and Score functions are in the critical path. Avoid expensive computations or external API calls within them. Leverage the PreFilter stage to perform calculations once per pod rather than once per node. Use the framework.Handle to access the scheduler's shared informer cache instead of making direct API server calls.

    * Observability: A scheduler that makes opaque decisions is a nightmare to debug. Instrument your plugins with metrics. Use the Prometheus client library for Go to expose counters and histograms.

    * scheduler_plugin_filter_evaluations_total{plugin="GpuTopologyFilter", result="success|fail"}

    * scheduler_plugin_score_evaluation_duration_seconds{plugin="JobLocalityScore"}

    Add structured logging with klog to trace the decision-making process for a specific pod, which is invaluable when a user asks, "Why isn't my pod scheduling?"

    Conclusion: The Power of Custom Scheduling

    By moving beyond the default kube-scheduler, you unlock the ability to tailor Kubernetes to the specific needs of your most demanding workloads. We've demonstrated how to build a topology-aware scheduler that understands the physical layout of GPUs, a critical requirement for efficient ML training. This pattern of using node feature discovery to label nodes and a custom scheduler plugin to consume those labels is a powerful and extensible model.

    The journey doesn't end here. The same principles can be applied to schedule based on network fabric (e.g., InfiniBand), storage locality, power consumption, or even real-time utilization metrics, transforming your Kubernetes cluster into a truly performance-optimized platform for any specialized workload.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles