K8s Custom Schedulers for GPU Bin-Packing in ML Workloads

October 13, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Default Scheduler's Shortcomings for GPU Workloads

For senior engineers managing large-scale machine learning platforms on Kubernetes, the limitations of the default-scheduler become apparent quickly, especially with expensive GPU resources. While excellent for general-purpose stateless applications, its default policies—primarily NodeResourcesFit, NodeName, TaintToleration, and a balanced scoring strategy—fall short for specialized, high-value workloads. The core issues are:

GPU Fragmentation: The default scheduler's scoring logic often spreads pods across nodes to balance utilization. Consider a cluster with three 8-GPU nodes. If you schedule three separate 2-GPU jobs, the scheduler might place one on each node. Now, each node has 6 GPUs free. If a high-priority 8-GPU training job arrives, it cannot be scheduled, despite 18 total GPUs being available across the cluster. The resources are fragmented.

Lack of Topology Awareness: Modern multi-GPU servers feature high-speed interconnects like NVIDIA's NVLink or AMD's Infinity Fabric. For distributed training workloads (e.g., using torch.distributed), placing cooperating pods on GPUs connected by a high-speed interconnect is critical for performance. The default scheduler has no concept of this sub-node topology. It sees nvidia.com/gpu: 8 as eight fungible resources, not as four NVLink-paired sets.

Suboptimal Cost Efficiency (Bin-Packing): For cloud environments, consolidating workloads onto the minimum number of nodes (bin-packing) allows the cluster autoscaler to scale down unused nodes, directly reducing costs. The default scheduler's spreading behavior actively works against this goal.

To solve these production issues, we must move beyond simple nodeSelector or affinity rules and implement custom scheduling logic. This post will focus on building a high-performance, bin-packing scheduler for GPUs using the modern Kubernetes Scheduling Framework.

Architecture: Scheduler Extenders vs. The Scheduling Framework

Before we build, it's crucial to understand the two primary mechanisms for customizing scheduling in Kubernetes. While you might encounter legacy systems using extenders, the Scheduling Framework is the standard for modern implementations.

The Legacy Approach: Scheduler Extenders

A scheduler extender is an external webhook (HTTP server) that the default scheduler calls out to during its decision-making process. You configure the scheduler to send pod and node information to your extender's endpoints for two main operations:

* Filter: The extender receives a list of nodes and returns a subset that are eligible to run the pod.

* Prioritize (Score): The extender receives the filtered list of nodes and returns a score for each, indicating preference.

Pros:

* Language Agnostic: You can write it in any language that can host an HTTP server (Python, Node.js, etc.).

* Decoupled: Runs as a separate process, isolated from the Kubernetes control plane.

Cons:

* Performance Overhead: Every scheduling decision for a relevant pod incurs at least two network round-trips. This latency is unacceptable in large, dynamic clusters with high pod churn.

* Limited Integration: Extenders have a very coarse-grained view. They cannot hook into other critical parts of the scheduling cycle like binding or preemption.

* State Management: Maintaining a consistent view of the cluster state in an external service is complex and prone to race conditions.

The Modern Approach: The Scheduling Framework

The Scheduling Framework, introduced in Kubernetes v1.15 and graduated to stable, provides a set of well-defined extension points (Go interfaces) that allow you to compile custom logic directly into the scheduler binary. These plugins run in-process, eliminating network overhead and providing deep integration.

Key Extension Points:

* QueueSort: Defines the order in which pods are taken from the scheduling queue.

* PreFilter: Performs preliminary checks on a pod before iterating through nodes.

* Filter: Similar to the extender's filter, determines if a node can run the pod. Can be stateful.

* PostFilter: Called if no nodes passed the Filter phase. Useful for preemption logic.

* PreScore: A pre-computation phase before scoring each node individually.

* Score: The core of custom logic. Assigns an integer score to each node that passed the filter phase.

* Reserve: Claims resources on a node before the pod is bound.

* Permit: A final gate before binding, allowing for asynchronous checks (e.g., waiting for a resource quota to be approved).

* PreBind / Bind / PostBind: Hooks around the process of binding the pod to the node.

Decision: For any serious, performance-sensitive use case like GPU scheduling, the Scheduling Framework is the unequivocally correct choice. It offers superior performance, tighter integration, and a more robust model for state management.

Implementing a GPU Bin-Packing Scheduler Plugin

Our goal is to create a scheduler that prioritizes nodes with the highest existing GPU allocation. This will consolidate GPU pods, leaving other nodes completely free for large, multi-GPU jobs or for the cluster autoscaler to terminate.

We'll implement a custom Score plugin. The scoring logic will be simple: a node's score is proportional to the percentage of its allocatable GPUs that are already requested by running pods.

Project Setup

First, set up a new Go project. We will be building a custom scheduler binary that includes our plugin.

bash

mkdir gpu-scheduler
cd gpu-scheduler
go mod init github.com/my-org/gpu-scheduler
# Get the necessary Kubernetes dependencies
go get k8s.io/component-base/[email protected]
go get k8s.io/kubernetes/cmd/[email protected]

Now, let's create our plugin file: pkg/scheduler/plugin.go.

The `Score` Plugin Implementation

Our plugin needs to implement the ScorePlugin interface from the scheduling framework.

// pkg/scheduler/plugin.go
package scheduler

import (
	"context"
	"fmt"

	v1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	// Name is the name of the plugin used in the KubeSchedulerConfiguration.
	Name              = "GPUBinPacking"
	// GPUResourceName is the name of the GPU resource.
	GPUResourceName = "nvidia.com/gpu"
)

// GPUBinPacking is a score plugin that favors nodes with higher GPU utilization.
type GPUBinPacking struct {
	handle framework.Handle
}

// Asserts that GPUBinPacking implements the ScorePlugin interface.
var _ framework.ScorePlugin = &GPUBinPacking{}

// Name returns the name of the plugin.
func (pl *GPUBinPacking) Name() string {
	return Name
}

// Score is the main logic for the plugin. It calculates a score for a node based on its GPU utilization.
func (pl *GPUBinPacking) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
	}

	// Get the total allocatable GPUs on the node.
	allocatableGPUs, ok := nodeInfo.Node().Status.Allocatable[GPUResourceName]
	if !ok || allocatableGPUs.IsZero() {
		// If the node has no allocatable GPUs, it's not a candidate for GPU pods.
		// A score of 0 is appropriate, as it doesn't contribute to bin-packing.
		klog.Infof("Node %s has no allocatable GPUs, scoring 0", nodeName)
		return 0, framework.NewStatus(framework.Success)
	}

	totalGPUs := allocatableGPUs.Value()
	if totalGPUs == 0 {
		return 0, framework.NewStatus(framework.Success)
	}

	// Calculate the sum of GPUs requested by existing pods on the node.
	requestedGPUs := int64(0)
	for _, pod := range nodeInfo.Pods {
		for _, container := range pod.Spec.Containers {
			if req, ok := container.Resources.Requests[GPUResourceName]; ok {
				requestedGPUs += req.Value()
			}
		}
	}

	// The score is the percentage of GPUs used, scaled to the framework's score range [0, 100].
	// A higher score means the node is more utilized, which is what we want for bin-packing.
	score := (requestedGPUs * framework.MaxNodeScore) / totalGPUs

	klog.Infof("Node: %s, Allocatable GPUs: %d, Requested GPUs: %d, Score: %d", nodeName, totalGPUs, requestedGPUs, score)

	return score, framework.NewStatus(framework.Success)
}

// ScoreExtensions returns a ScoreExtensions interface if the plugin implements it.
func (pl *GPUBinPacking) ScoreExtensions() framework.ScoreExtensions {
	return nil // We don't need normalization.
}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
	return &GPUBinPacking{
		handle: h,
	}, nil
}

Main Program to Register the Plugin

Now, we need a main.go to create a new scheduler command and register our custom plugin.

// cmd/scheduler/main.go
package main

import (
	"os"

	"k8s.io/component-base/cli"
	"k8s.io/kubernetes/cmd/kube-scheduler/app"

	"github.com/my-org/gpu-scheduler/pkg/scheduler"
)

func main() {
	// Register the plugin with the scheduler framework registry.
	command := app.NewSchedulerCommand(
		app.WithPlugin(scheduler.Name, scheduler.New),
	)

	if err := cli.RunNoErrOutput(command); err != nil {
		os.Exit(1)
	}
}

This simple main function imports our plugin package and uses the app.WithPlugin option to make the GPUBinPacking plugin available to the scheduler's configuration.

Edge Case: Implementing a Topology-Aware `Filter` Plugin

Bin-packing is great, but what about performance-sensitive workloads that need GPUs connected via NVLink? A standard resources: { requests: { nvidia.com/gpu: 2 } } doesn't guarantee this. We can solve this with a custom Filter plugin that reads a pod annotation.

Let's assume our nodes are labeled by an admin or a device plugin helper, for example: gpu-topology.my-org.com/nvlink-groups: "0-1,2-3,4-5,6-7".

A pod can request a tightly-coupled pair by using an annotation:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-job-1
  annotations:
    gpu-topology.my-org.com/nvlink-count: "2"
spec:
  schedulerName: gpu-binpacking-scheduler
  containers:
  - name: cuda-worker
    image: nvidia/cuda:11.4.0-base-ubuntu20.04
    resources:
      limits:
        nvidia.com/gpu: "2"
      requests:
        nvidia.com/gpu: "2"

Let's implement a Filter plugin to enforce this.

// pkg/scheduler/topology_filter.go
package scheduler

import (
	"context"
	"strconv"
	"strings"

	v1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	TopologyFilterName      = "GPUTopologyFilter"
	NVLinkAnnotation        = "gpu-topology.my-org.com/nvlink-count"
	NVLinkGroupNodeLabel    = "gpu-topology.my-org.com/nvlink-groups"
)

// GPUTopologyFilter checks if a node has enough GPUs within a single NVLink group.
type GPUTopologyFilter struct{}

var _ framework.FilterPlugin = &GPUTopologyFilter{}

func (f *GPUTopologyFilter) Name() string {
	return TopologyFilterName
}

func (f *GPUTopologyFilter) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	// Check if the pod requests NVLink-connected GPUs.
	nvlinkCountStr, ok := pod.Annotations[NVLinkAnnotation]
	if !ok {
		// This pod doesn't care about topology, so we don't filter.
		return framework.NewStatus(framework.Success)
	}

	nvlinkCount, err := strconv.Atoi(nvlinkCountStr)
	if err != nil || nvlinkCount <= 1 {
		// Invalid annotation or trivial request, pass.
		return framework.NewStatus(framework.Success)
	}

	// Check if the node has the topology label.
	node := nodeInfo.Node()
	if node == nil {
		return framework.NewStatus(framework.Error, "node not found")
	}
	nvlinkGroupsStr, ok := node.Labels[NVLinkGroupNodeLabel]
	if !ok {
		klog.Infof("Node %s is rejected for pod %s because it lacks the NVLink topology label", node.Name, pod.Name)
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, "Node lacks NVLink topology information")
	}

	// Check if any NVLink group on the node is large enough.
	nvlinkGroups := strings.Split(nvlinkGroupsStr, ",")
	for _, group := range nvlinkGroups {
		gpuIndices := strings.Split(group, "-")
		if len(gpuIndices) >= nvlinkCount {
			// Found a suitable group. We don't need to check for available GPUs here, as the
			// default NodeResourcesFit plugin already handles the total GPU count.
			// A more advanced implementation would track allocation per-group.
			klog.Infof("Node %s is a candidate for pod %s, found NVLink group of size %d", node.Name, pod.Name, len(gpuIndices))
			return framework.NewStatus(framework.Success)
		}
	}

	klog.Infof("Node %s is rejected for pod %s, no NVLink group of size %d found", node.Name, pod.Name, nvlinkCount)
	return framework.NewStatus(framework.Unschedulable, "No available NVLink group of the required size")
}

// NewTopologyFilter initializes a new plugin and returns it.
func NewTopologyFilter(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
	return &GPUTopologyFilter{}, nil
}

We would then register this new plugin in our main.go as well:

// cmd/scheduler/main.go (updated)
// ...
command := app.NewSchedulerCommand(
    app.WithPlugin(scheduler.Name, scheduler.New), // Our Score plugin
    app.WithPlugin(scheduler.TopologyFilterName, scheduler.NewTopologyFilter), // Our new Filter plugin
)
// ...

Deployment and Configuration in a Production Cluster

Now that we have the code, we need to package and deploy it.

1. Packaging with a Dockerfile

Create a Dockerfile for a minimal, multi-stage Go build.

dockerfile

# Build stage
FROM golang:1.21-alpine AS builder

WORKDIR /app

COPY go.mod go.sum ./
RUN go mod download

COPY . .

# Build the scheduler binary
RUN CGO_ENABLED=0 GOOS=linux go build -o /gpu-scheduler ./cmd/scheduler

# Final stage
FROM alpine:latest

WORKDIR /root/

# Copy the binary from the builder stage
COPY --from=builder /gpu-scheduler .

# The scheduler binary is the entrypoint
ENTRYPOINT ["/root/gpu-scheduler"]

Build and push the image:

bash

docker build -t your-registry/gpu-scheduler:v1.0.0 .
docker push your-registry/gpu-scheduler:v1.0.0

2. Kubernetes Manifests

We need a Deployment, RBAC rules, and a KubeSchedulerConfiguration.

scheduler-config.yaml: This ConfigMap holds the configuration for our scheduler profile.

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
    profiles:
      - schedulerName: gpu-binpacking-scheduler
        plugins:
          # Default plugins are enabled at these extension points.
          # We add our custom plugins to the respective phases.
          filter:
            enabled:
              - name: GPUTopologyFilter
          score:
            enabled:
              - name: GPUBinPacking
            disabled:
              # We disable NodeResourcesBalancedAllocation to enforce our bin-packing.
              - name: NodeResourcesBalancedAllocation
        pluginConfig:
          - name: GPUBinPacking
            args: {}
          - name: GPUTopologyFilter
            args: {}

scheduler-deployment.yaml: This deploys the scheduler itself.

yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gpu-scheduler-role
rules:
  # Add all the necessary permissions for a scheduler.
  # This is a truncated list for brevity. Refer to the default system:kube-scheduler role.
  - apiGroups: [""]
    resources: ["nodes", "pods", "pods/binding", "replicationcontrollers"]
    verbs: ["get", "list", "watch", "create", "update"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["get", "list", "watch"]
  # ... more rules required
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-scheduler-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gpu-scheduler-role
subjects:
- kind: ServiceAccount
  name: gpu-scheduler
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-scheduler
  namespace: kube-system
  labels:
    app: gpu-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-scheduler
  template:
    metadata:
      labels:
        app: gpu-scheduler
    spec:
      serviceAccountName: gpu-scheduler
      containers:
      - name: scheduler
        image: your-registry/gpu-scheduler:v1.0.0
        args:
        - --config=/etc/kubernetes/scheduler-config.yaml
        - --v=3 # Verbose logging
        volumeMounts:
        - name: scheduler-config-volume
          mountPath: /etc/kubernetes
      volumes:
      - name: scheduler-config-volume
        configMap:
          name: gpu-scheduler-config

Apply these manifests:

bash

kubectl apply -f scheduler-config.yaml
kubectl apply -f scheduler-deployment.yaml

3. Using the Custom Scheduler

To use the scheduler, simply specify schedulerName in your pod spec:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-job-1
spec:
  schedulerName: gpu-binpacking-scheduler
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.4.0-base-ubuntu20.04
    resources:
      limits:
        nvidia.com/gpu: "2"

After applying, check the pod's events to confirm it was scheduled by your custom scheduler:

bash

kubectl describe pod gpu-job-1

# ... Output ...
Events:
  Type    Reason     Age    From                        Message
  ----    ------     ----   ----                        -------
  Normal  Scheduled  2s     gpu-binpacking-scheduler    Successfully assigned default/gpu-job-1 to node-g4dn-xlarge-1

Advanced Considerations and Performance Tuning

Preemption and Priority

Our bin-packing strategy can create contention. A high-priority pod might need a spot on a fully-packed node currently occupied by low-priority pods. The Kubernetes scheduler handles this via preemption, but our custom scoring must cooperate.

By default, our score doesn't consider pod priority. The TaintToleration and InterPodAffinity plugins (enabled by default) handle some of this, but in a heavily customized scheduler, you might need a PostFilter plugin. This plugin is called when a pod cannot be scheduled. It can identify lower-priority pods on nodes that could be preempted to make room for the current pod.

Performance Benchmarking

It's critical to validate that your custom plugin doesn't introduce scheduling latency.

* Scheduler Throughput: Use a tool like kubemark (a hollow-node simulator) to create thousands of virtual nodes and pods. Measure the rate at which your scheduler can place pods (pods/second). Compare this against the default scheduler baseline.

* Scheduling Latency: The scheduler itself exposes Prometheus metrics. Monitor scheduler_scheduling_algorithm_duration_seconds and scheduler_framework_extension_point_duration_seconds to pinpoint latency in your custom plugins. A Score plugin should typically execute in microseconds.

As a concrete example, a poorly implemented plugin that makes external API calls or performs heavy computations could increase P99 scheduling latency from ~50ms to ~500ms, severely impacting cluster responsiveness.

Observability

Instrument your plugin with custom metrics for observability.

Expose Metrics: Use the Prometheus Go client library within your plugin to expose custom metrics. Register them with the scheduler's metric registry.

    // In your plugin's New function
    import "github.com/prometheus/client_golang/prometheus"

    var ( 
        binPackScores = prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name:    "gpuscheduler_binpack_score_nanoseconds",
                Help:    "Histogram of scores calculated by the bin-packing plugin.",
            },
            []string{"node_name"},
        )
    )

    // ... in New() ...
    prometheus.MustRegister(binPackScores)

    // ... in Score() ...
    binPackScores.WithLabelValues(nodeName).Observe(float64(score))

Scrape Metrics: Configure your Prometheus instance to scrape the /metrics endpoint of your custom scheduler's pod.

This allows you to create dashboards that visualize your scheduler's behavior, such as the distribution of scores across nodes, helping you debug and tune your logic.

Conclusion: Beyond Bin-Packing

We've built and deployed a production-ready custom scheduler that implements a GPU-aware bin-packing and topology-aware filtering strategy. This solves a tangible and expensive problem in ML infrastructure.

The Kubernetes Scheduling Framework is an exceptionally powerful tool. The patterns discussed here can be extended to solve other complex scheduling problems:

* Network-Aware Scheduling: For distributed data processing (like Spark), a Score plugin could query a service mesh or network monitoring tool to get real-time latency between nodes, then score nodes to minimize cross-node traffic for a given job.

* I/O-Aware Scheduling: For database workloads, a plugin could favor nodes with local NVMe storage, scoring them higher than nodes with network-attached storage.

* License-Aware Scheduling: In environments with floating software licenses (e.g., for EDA tools), a Permit plugin could check out a license from a license server before allowing the pod to be bound, preventing jobs from starting only to fail due to license unavailability.

By moving beyond the default scheduler, you can transform Kubernetes from a generic container orchestrator into a highly optimized, application-aware platform tailored to your specific business and performance requirements.

The Default Scheduler's Shortcomings for GPU Workloads

Architecture: Scheduler Extenders vs. The Scheduling Framework

The Legacy Approach: Scheduler Extenders

The Modern Approach: The Scheduling Framework

Implementing a GPU Bin-Packing Scheduler Plugin

Project Setup

The `Score` Plugin Implementation

Main Program to Register the Plugin

Edge Case: Implementing a Topology-Aware `Filter` Plugin

Deployment and Configuration in a Production Cluster

1. Packaging with a Dockerfile

2. Kubernetes Manifests

3. Using the Custom Scheduler

Advanced Considerations and Performance Tuning

Preemption and Priority

Performance Benchmarking

Observability

Conclusion: Beyond Bin-Packing

Found this article helpful?