K8s Custom Schedulers for GPU Bin-Packing in ML Workloads

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Default Scheduler's Shortcomings for GPU Workloads

For senior engineers managing large-scale machine learning platforms on Kubernetes, the limitations of the default-scheduler become apparent quickly, especially with expensive GPU resources. While excellent for general-purpose stateless applications, its default policies—primarily NodeResourcesFit, NodeName, TaintToleration, and a balanced scoring strategy—fall short for specialized, high-value workloads. The core issues are:

  • GPU Fragmentation: The default scheduler's scoring logic often spreads pods across nodes to balance utilization. Consider a cluster with three 8-GPU nodes. If you schedule three separate 2-GPU jobs, the scheduler might place one on each node. Now, each node has 6 GPUs free. If a high-priority 8-GPU training job arrives, it cannot be scheduled, despite 18 total GPUs being available across the cluster. The resources are fragmented.
  • Lack of Topology Awareness: Modern multi-GPU servers feature high-speed interconnects like NVIDIA's NVLink or AMD's Infinity Fabric. For distributed training workloads (e.g., using torch.distributed), placing cooperating pods on GPUs connected by a high-speed interconnect is critical for performance. The default scheduler has no concept of this sub-node topology. It sees nvidia.com/gpu: 8 as eight fungible resources, not as four NVLink-paired sets.
  • Suboptimal Cost Efficiency (Bin-Packing): For cloud environments, consolidating workloads onto the minimum number of nodes (bin-packing) allows the cluster autoscaler to scale down unused nodes, directly reducing costs. The default scheduler's spreading behavior actively works against this goal.
  • To solve these production issues, we must move beyond simple nodeSelector or affinity rules and implement custom scheduling logic. This post will focus on building a high-performance, bin-packing scheduler for GPUs using the modern Kubernetes Scheduling Framework.


    Architecture: Scheduler Extenders vs. The Scheduling Framework

    Before we build, it's crucial to understand the two primary mechanisms for customizing scheduling in Kubernetes. While you might encounter legacy systems using extenders, the Scheduling Framework is the standard for modern implementations.

    The Legacy Approach: Scheduler Extenders

    A scheduler extender is an external webhook (HTTP server) that the default scheduler calls out to during its decision-making process. You configure the scheduler to send pod and node information to your extender's endpoints for two main operations:

    * Filter: The extender receives a list of nodes and returns a subset that are eligible to run the pod.

    * Prioritize (Score): The extender receives the filtered list of nodes and returns a score for each, indicating preference.

    Pros:

    * Language Agnostic: You can write it in any language that can host an HTTP server (Python, Node.js, etc.).

    * Decoupled: Runs as a separate process, isolated from the Kubernetes control plane.

    Cons:

    * Performance Overhead: Every scheduling decision for a relevant pod incurs at least two network round-trips. This latency is unacceptable in large, dynamic clusters with high pod churn.

    * Limited Integration: Extenders have a very coarse-grained view. They cannot hook into other critical parts of the scheduling cycle like binding or preemption.

    * State Management: Maintaining a consistent view of the cluster state in an external service is complex and prone to race conditions.

    The Modern Approach: The Scheduling Framework

    The Scheduling Framework, introduced in Kubernetes v1.15 and graduated to stable, provides a set of well-defined extension points (Go interfaces) that allow you to compile custom logic directly into the scheduler binary. These plugins run in-process, eliminating network overhead and providing deep integration.

    Key Extension Points:

    * QueueSort: Defines the order in which pods are taken from the scheduling queue.

    * PreFilter: Performs preliminary checks on a pod before iterating through nodes.

    * Filter: Similar to the extender's filter, determines if a node can run the pod. Can be stateful.

    * PostFilter: Called if no nodes passed the Filter phase. Useful for preemption logic.

    * PreScore: A pre-computation phase before scoring each node individually.

    * Score: The core of custom logic. Assigns an integer score to each node that passed the filter phase.

    * Reserve: Claims resources on a node before the pod is bound.

    * Permit: A final gate before binding, allowing for asynchronous checks (e.g., waiting for a resource quota to be approved).

    * PreBind / Bind / PostBind: Hooks around the process of binding the pod to the node.

    Decision: For any serious, performance-sensitive use case like GPU scheduling, the Scheduling Framework is the unequivocally correct choice. It offers superior performance, tighter integration, and a more robust model for state management.


    Implementing a GPU Bin-Packing Scheduler Plugin

    Our goal is to create a scheduler that prioritizes nodes with the highest existing GPU allocation. This will consolidate GPU pods, leaving other nodes completely free for large, multi-GPU jobs or for the cluster autoscaler to terminate.

    We'll implement a custom Score plugin. The scoring logic will be simple: a node's score is proportional to the percentage of its allocatable GPUs that are already requested by running pods.

    Project Setup

    First, set up a new Go project. We will be building a custom scheduler binary that includes our plugin.

    bash
    mkdir gpu-scheduler
    cd gpu-scheduler
    go mod init github.com/my-org/gpu-scheduler
    # Get the necessary Kubernetes dependencies
    go get k8s.io/component-base/[email protected]
    go get k8s.io/kubernetes/cmd/[email protected]

    Now, let's create our plugin file: pkg/scheduler/plugin.go.

    The `Score` Plugin Implementation

    Our plugin needs to implement the ScorePlugin interface from the scheduling framework.

    go
    // pkg/scheduler/plugin.go
    package scheduler
    
    import (
    	"context"
    	"fmt"
    
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	// Name is the name of the plugin used in the KubeSchedulerConfiguration.
    	Name              = "GPUBinPacking"
    	// GPUResourceName is the name of the GPU resource.
    	GPUResourceName = "nvidia.com/gpu"
    )
    
    // GPUBinPacking is a score plugin that favors nodes with higher GPU utilization.
    type GPUBinPacking struct {
    	handle framework.Handle
    }
    
    // Asserts that GPUBinPacking implements the ScorePlugin interface.
    var _ framework.ScorePlugin = &GPUBinPacking{}
    
    // Name returns the name of the plugin.
    func (pl *GPUBinPacking) Name() string {
    	return Name
    }
    
    // Score is the main logic for the plugin. It calculates a score for a node based on its GPU utilization.
    func (pl *GPUBinPacking) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
    	}
    
    	// Get the total allocatable GPUs on the node.
    	allocatableGPUs, ok := nodeInfo.Node().Status.Allocatable[GPUResourceName]
    	if !ok || allocatableGPUs.IsZero() {
    		// If the node has no allocatable GPUs, it's not a candidate for GPU pods.
    		// A score of 0 is appropriate, as it doesn't contribute to bin-packing.
    		klog.Infof("Node %s has no allocatable GPUs, scoring 0", nodeName)
    		return 0, framework.NewStatus(framework.Success)
    	}
    
    	totalGPUs := allocatableGPUs.Value()
    	if totalGPUs == 0 {
    		return 0, framework.NewStatus(framework.Success)
    	}
    
    	// Calculate the sum of GPUs requested by existing pods on the node.
    	requestedGPUs := int64(0)
    	for _, pod := range nodeInfo.Pods {
    		for _, container := range pod.Spec.Containers {
    			if req, ok := container.Resources.Requests[GPUResourceName]; ok {
    				requestedGPUs += req.Value()
    			}
    		}
    	}
    
    	// The score is the percentage of GPUs used, scaled to the framework's score range [0, 100].
    	// A higher score means the node is more utilized, which is what we want for bin-packing.
    	score := (requestedGPUs * framework.MaxNodeScore) / totalGPUs
    
    	klog.Infof("Node: %s, Allocatable GPUs: %d, Requested GPUs: %d, Score: %d", nodeName, totalGPUs, requestedGPUs, score)
    
    	return score, framework.NewStatus(framework.Success)
    }
    
    // ScoreExtensions returns a ScoreExtensions interface if the plugin implements it.
    func (pl *GPUBinPacking) ScoreExtensions() framework.ScoreExtensions {
    	return nil // We don't need normalization.
    }
    
    // New initializes a new plugin and returns it.
    func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &GPUBinPacking{
    		handle: h,
    	}, nil
    }
    

    Main Program to Register the Plugin

    Now, we need a main.go to create a new scheduler command and register our custom plugin.

    go
    // cmd/scheduler/main.go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	"github.com/my-org/gpu-scheduler/pkg/scheduler"
    )
    
    func main() {
    	// Register the plugin with the scheduler framework registry.
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(scheduler.Name, scheduler.New),
    	)
    
    	if err := cli.RunNoErrOutput(command); err != nil {
    		os.Exit(1)
    	}
    }

    This simple main function imports our plugin package and uses the app.WithPlugin option to make the GPUBinPacking plugin available to the scheduler's configuration.


    Edge Case: Implementing a Topology-Aware `Filter` Plugin

    Bin-packing is great, but what about performance-sensitive workloads that need GPUs connected via NVLink? A standard resources: { requests: { nvidia.com/gpu: 2 } } doesn't guarantee this. We can solve this with a custom Filter plugin that reads a pod annotation.

    Let's assume our nodes are labeled by an admin or a device plugin helper, for example: gpu-topology.my-org.com/nvlink-groups: "0-1,2-3,4-5,6-7".

    A pod can request a tightly-coupled pair by using an annotation:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: distributed-training-job-1
      annotations:
        gpu-topology.my-org.com/nvlink-count: "2"
    spec:
      schedulerName: gpu-binpacking-scheduler
      containers:
      - name: cuda-worker
        image: nvidia/cuda:11.4.0-base-ubuntu20.04
        resources:
          limits:
            nvidia.com/gpu: "2"
          requests:
            nvidia.com/gpu: "2"

    Let's implement a Filter plugin to enforce this.

    go
    // pkg/scheduler/topology_filter.go
    package scheduler
    
    import (
    	"context"
    	"strconv"
    	"strings"
    
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	TopologyFilterName      = "GPUTopologyFilter"
    	NVLinkAnnotation        = "gpu-topology.my-org.com/nvlink-count"
    	NVLinkGroupNodeLabel    = "gpu-topology.my-org.com/nvlink-groups"
    )
    
    // GPUTopologyFilter checks if a node has enough GPUs within a single NVLink group.
    type GPUTopologyFilter struct{}
    
    var _ framework.FilterPlugin = &GPUTopologyFilter{}
    
    func (f *GPUTopologyFilter) Name() string {
    	return TopologyFilterName
    }
    
    func (f *GPUTopologyFilter) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    	// Check if the pod requests NVLink-connected GPUs.
    	nvlinkCountStr, ok := pod.Annotations[NVLinkAnnotation]
    	if !ok {
    		// This pod doesn't care about topology, so we don't filter.
    		return framework.NewStatus(framework.Success)
    	}
    
    	nvlinkCount, err := strconv.Atoi(nvlinkCountStr)
    	if err != nil || nvlinkCount <= 1 {
    		// Invalid annotation or trivial request, pass.
    		return framework.NewStatus(framework.Success)
    	}
    
    	// Check if the node has the topology label.
    	node := nodeInfo.Node()
    	if node == nil {
    		return framework.NewStatus(framework.Error, "node not found")
    	}
    	nvlinkGroupsStr, ok := node.Labels[NVLinkGroupNodeLabel]
    	if !ok {
    		klog.Infof("Node %s is rejected for pod %s because it lacks the NVLink topology label", node.Name, pod.Name)
    		return framework.NewStatus(framework.UnschedulableAndUnresolvable, "Node lacks NVLink topology information")
    	}
    
    	// Check if any NVLink group on the node is large enough.
    	nvlinkGroups := strings.Split(nvlinkGroupsStr, ",")
    	for _, group := range nvlinkGroups {
    		gpuIndices := strings.Split(group, "-")
    		if len(gpuIndices) >= nvlinkCount {
    			// Found a suitable group. We don't need to check for available GPUs here, as the
    			// default NodeResourcesFit plugin already handles the total GPU count.
    			// A more advanced implementation would track allocation per-group.
    			klog.Infof("Node %s is a candidate for pod %s, found NVLink group of size %d", node.Name, pod.Name, len(gpuIndices))
    			return framework.NewStatus(framework.Success)
    		}
    	}
    
    	klog.Infof("Node %s is rejected for pod %s, no NVLink group of size %d found", node.Name, pod.Name, nvlinkCount)
    	return framework.NewStatus(framework.Unschedulable, "No available NVLink group of the required size")
    }
    
    // NewTopologyFilter initializes a new plugin and returns it.
    func NewTopologyFilter(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
    	return &GPUTopologyFilter{}, nil
    }

    We would then register this new plugin in our main.go as well:

    go
    // cmd/scheduler/main.go (updated)
    // ...
    command := app.NewSchedulerCommand(
        app.WithPlugin(scheduler.Name, scheduler.New), // Our Score plugin
        app.WithPlugin(scheduler.TopologyFilterName, scheduler.NewTopologyFilter), // Our new Filter plugin
    )
    // ...

    Deployment and Configuration in a Production Cluster

    Now that we have the code, we need to package and deploy it.

    1. Packaging with a Dockerfile

    Create a Dockerfile for a minimal, multi-stage Go build.

    dockerfile
    # Build stage
    FROM golang:1.21-alpine AS builder
    
    WORKDIR /app
    
    COPY go.mod go.sum ./
    RUN go mod download
    
    COPY . .
    
    # Build the scheduler binary
    RUN CGO_ENABLED=0 GOOS=linux go build -o /gpu-scheduler ./cmd/scheduler
    
    # Final stage
    FROM alpine:latest
    
    WORKDIR /root/
    
    # Copy the binary from the builder stage
    COPY --from=builder /gpu-scheduler .
    
    # The scheduler binary is the entrypoint
    ENTRYPOINT ["/root/gpu-scheduler"]

    Build and push the image:

    bash
    docker build -t your-registry/gpu-scheduler:v1.0.0 .
    docker push your-registry/gpu-scheduler:v1.0.0

    2. Kubernetes Manifests

    We need a Deployment, RBAC rules, and a KubeSchedulerConfiguration.

    scheduler-config.yaml: This ConfigMap holds the configuration for our scheduler profile.

    yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: gpu-scheduler-config
      namespace: kube-system
    data:
      scheduler-config.yaml: |
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
        profiles:
          - schedulerName: gpu-binpacking-scheduler
            plugins:
              # Default plugins are enabled at these extension points.
              # We add our custom plugins to the respective phases.
              filter:
                enabled:
                  - name: GPUTopologyFilter
              score:
                enabled:
                  - name: GPUBinPacking
                disabled:
                  # We disable NodeResourcesBalancedAllocation to enforce our bin-packing.
                  - name: NodeResourcesBalancedAllocation
            pluginConfig:
              - name: GPUBinPacking
                args: {}
              - name: GPUTopologyFilter
                args: {}

    scheduler-deployment.yaml: This deploys the scheduler itself.

    yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: gpu-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: gpu-scheduler-role
    rules:
      # Add all the necessary permissions for a scheduler.
      # This is a truncated list for brevity. Refer to the default system:kube-scheduler role.
      - apiGroups: [""]
        resources: ["nodes", "pods", "pods/binding", "replicationcontrollers"]
        verbs: ["get", "list", "watch", "create", "update"]
      - apiGroups: ["apps"]
        resources: ["replicasets"]
        verbs: ["get", "list", "watch"]
      # ... more rules required
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-scheduler-binding
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: gpu-scheduler-role
    subjects:
    - kind: ServiceAccount
      name: gpu-scheduler
      namespace: kube-system
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-scheduler
      namespace: kube-system
      labels:
        app: gpu-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: gpu-scheduler
      template:
        metadata:
          labels:
            app: gpu-scheduler
        spec:
          serviceAccountName: gpu-scheduler
          containers:
          - name: scheduler
            image: your-registry/gpu-scheduler:v1.0.0
            args:
            - --config=/etc/kubernetes/scheduler-config.yaml
            - --v=3 # Verbose logging
            volumeMounts:
            - name: scheduler-config-volume
              mountPath: /etc/kubernetes
          volumes:
          - name: scheduler-config-volume
            configMap:
              name: gpu-scheduler-config

    Apply these manifests:

    bash
    kubectl apply -f scheduler-config.yaml
    kubectl apply -f scheduler-deployment.yaml

    3. Using the Custom Scheduler

    To use the scheduler, simply specify schedulerName in your pod spec:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-job-1
    spec:
      schedulerName: gpu-binpacking-scheduler
      containers:
      - name: cuda-container
        image: nvidia/cuda:11.4.0-base-ubuntu20.04
        resources:
          limits:
            nvidia.com/gpu: "2"

    After applying, check the pod's events to confirm it was scheduled by your custom scheduler:

    bash
    kubectl describe pod gpu-job-1
    
    # ... Output ...
    Events:
      Type    Reason     Age    From                        Message
      ----    ------     ----   ----                        -------
      Normal  Scheduled  2s     gpu-binpacking-scheduler    Successfully assigned default/gpu-job-1 to node-g4dn-xlarge-1

    Advanced Considerations and Performance Tuning

    Preemption and Priority

    Our bin-packing strategy can create contention. A high-priority pod might need a spot on a fully-packed node currently occupied by low-priority pods. The Kubernetes scheduler handles this via preemption, but our custom scoring must cooperate.

    By default, our score doesn't consider pod priority. The TaintToleration and InterPodAffinity plugins (enabled by default) handle some of this, but in a heavily customized scheduler, you might need a PostFilter plugin. This plugin is called when a pod cannot be scheduled. It can identify lower-priority pods on nodes that could be preempted to make room for the current pod.

    Performance Benchmarking

    It's critical to validate that your custom plugin doesn't introduce scheduling latency.

    * Scheduler Throughput: Use a tool like kubemark (a hollow-node simulator) to create thousands of virtual nodes and pods. Measure the rate at which your scheduler can place pods (pods/second). Compare this against the default scheduler baseline.

    * Scheduling Latency: The scheduler itself exposes Prometheus metrics. Monitor scheduler_scheduling_algorithm_duration_seconds and scheduler_framework_extension_point_duration_seconds to pinpoint latency in your custom plugins. A Score plugin should typically execute in microseconds.

    As a concrete example, a poorly implemented plugin that makes external API calls or performs heavy computations could increase P99 scheduling latency from ~50ms to ~500ms, severely impacting cluster responsiveness.

    Observability

    Instrument your plugin with custom metrics for observability.

  • Expose Metrics: Use the Prometheus Go client library within your plugin to expose custom metrics. Register them with the scheduler's metric registry.
  • go
        // In your plugin's New function
        import "github.com/prometheus/client_golang/prometheus"
    
        var ( 
            binPackScores = prometheus.NewHistogramVec(
                prometheus.HistogramOpts{
                    Name:    "gpuscheduler_binpack_score_nanoseconds",
                    Help:    "Histogram of scores calculated by the bin-packing plugin.",
                },
                []string{"node_name"},
            )
        )
    
        // ... in New() ...
        prometheus.MustRegister(binPackScores)
    
        // ... in Score() ...
        binPackScores.WithLabelValues(nodeName).Observe(float64(score))
  • Scrape Metrics: Configure your Prometheus instance to scrape the /metrics endpoint of your custom scheduler's pod.
  • This allows you to create dashboards that visualize your scheduler's behavior, such as the distribution of scores across nodes, helping you debug and tune your logic.

    Conclusion: Beyond Bin-Packing

    We've built and deployed a production-ready custom scheduler that implements a GPU-aware bin-packing and topology-aware filtering strategy. This solves a tangible and expensive problem in ML infrastructure.

    The Kubernetes Scheduling Framework is an exceptionally powerful tool. The patterns discussed here can be extended to solve other complex scheduling problems:

    * Network-Aware Scheduling: For distributed data processing (like Spark), a Score plugin could query a service mesh or network monitoring tool to get real-time latency between nodes, then score nodes to minimize cross-node traffic for a given job.

    * I/O-Aware Scheduling: For database workloads, a plugin could favor nodes with local NVMe storage, scoring them higher than nodes with network-attached storage.

    * License-Aware Scheduling: In environments with floating software licenses (e.g., for EDA tools), a Permit plugin could check out a license from a license server before allowing the pod to be bound, preventing jobs from starting only to fail due to license unavailability.

    By moving beyond the default scheduler, you can transform Kubernetes from a generic container orchestrator into a highly optimized, application-aware platform tailored to your specific business and performance requirements.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles