GPU Bin-Packing with a Custom Kubernetes Scheduler Framework Plugin

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Problem: GPU Fragmentation and the Default Scheduler's Shortcomings

In any large-scale Kubernetes cluster running Machine Learning workloads, GPU resource management is a primary operational and financial challenge. The default Kubernetes scheduler (kube-scheduler) employs a set of scoring plugins that, by default, favor a 'spread' strategy. The NodeResourcesLeastAllocated plugin, for instance, gives higher scores to nodes with more available resources. While this is a reasonable default for general-purpose stateless applications, it's profoundly inefficient for expensive, non-divisible resources like GPUs.

This 'spread' logic leads to GPU fragmentation. Imagine a cluster with three 8-GPU nodes. The scheduler might place one 1-GPU pod on each node. Your cluster is now running three pods but is utilizing three expensive nodes, with 21 GPUs sitting idle. The Cluster Autoscaler sees that all three nodes are in use and will not scale them down. The ideal scenario, bin-packing, would be to place all three pods on a single node, leaving the other two completely empty and prime for termination by the autoscaler. This consolidation directly translates to significant cost savings.

While scheduler extenders were an early solution, they introduce network latency into the critical scheduling path and have limited access to the scheduler's internal state. The modern, performant, and correct approach is to implement a Scheduler Framework Plugin. This article provides a production-focused guide to building, configuring, and deploying a custom scoring plugin in Go to achieve GPU bin-packing.


Architectural Prerequisite: Scheduler Framework vs. Extenders

Before diving into code, it's crucial to understand why we're choosing the Scheduler Framework.

* Scheduler Extenders: An extender is a simple webhook. The scheduler makes an HTTP call to an external service at the Filter and Prioritize (scoring) stages.

* Pros: Language-agnostic.

* Cons: High latency (network hop), statelessness (the extender doesn't have access to the scheduler's cache), and increased operational complexity (managing another service).

* Scheduler Framework: A Go plugin interface that allows you to compile your custom logic directly into the scheduler binary. Your code runs in-process, giving it direct access to the scheduler's cache and eliminating network overhead.

* Pros: High performance, deep integration, access to the full scheduling context.

* Cons: Requires writing Go, slightly more complex initial setup.

For performance-critical, state-aware logic like resource-based bin-packing, the Scheduler Framework is the only production-viable choice.

Implementing the GPU Bin-Packing Scoring Plugin

Our goal is to create a plugin that implements the Score extension point. This plugin will calculate a score for each node based on its GPU utilization, giving the highest scores to nodes that are already running GPU workloads.

1. Project Setup

First, set up a Go project. We'll need dependencies from k8s.io repositories. Make sure your Go version is compatible with the target Kubernetes version's libraries.

bash
mkdir gpu-bin-pack-scheduler
cd gpu-bin-pack-scheduler
go mod init github.com/your-org/gpu-bin-pack-scheduler

# Get the necessary Kubernetes dependencies (adjust versions to match your cluster)
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]

2. Plugin Boilerplate

Create a file plugin.go. This will contain our plugin's definition, constructor, and the core logic.

go
// plugin.go
package main

import (
	"context"
	"fmt"

	"k8s.io/apimachinery/pkg/runtime"
	v1 "k8s.io/api/core/v1"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	// Name is the name of the plugin used in the scheduler configuration.
	Name              = "GPUBinPacking"
	// GPUDevice is the name of the GPU resource.
	GPUDevice v1.ResourceName = "nvidia.com/gpu"
)

// GPUBinPacking is a plugin that favors nodes with high GPU utilization.
type GPUBinPacking struct {
	handle framework.Handle
}

var _ framework.ScorePlugin = &GPUBinPacking{}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
	return &GPUBinPacking{
		handle: h,
	}, nil
}

// Name returns the name of the plugin.
func (pl *GPUBinPacking) Name() string {
	return Name
}

// The core scoring logic will go here...

This code sets up the basic structure. We define a GPUBinPacking struct that satisfies the framework.ScorePlugin interface. The New function is our constructor, and Name provides the identifier we'll use in the scheduler configuration YAML.

3. Implementing the `Score` Logic

The Score method is the heart of our plugin. It's called for every node that passes the Filter phase. It must return a score (from framework.MinNodeScore to framework.MaxNodeScore) and a status. A higher score means the node is a better fit.

Our logic will be:

  • Calculate the total requested GPUs by the incoming pod.
  • For the given node, get its total allocatable GPU capacity.
  • For the given node, get the sum of GPU requests from all pods already running on it.
  • Calculate a score based on the formula: (requestedGPUs + usedGPUs) / capacityGPUs. This rewards nodes where the new pod will increase utilization toward 100%.
  • Scale this score to the [0-100] range required by the framework.
  • go
    // Add this method to the GPUBinPacking struct in plugin.go
    
    // Score invoked at the score extension point.
    func (pl *GPUBinPacking) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
    	}
    
    	node := nodeInfo.Node()
    	if node == nil {
    		return 0, framework.AsStatus(fmt.Errorf("node %q not found", nodeName))
    	}
    
    	// Get allocatable GPUs on the node
    	allocatableGPUs, hasGPU := node.Status.Allocatable[GPUDevice]
    	if !hasGPU || allocatableGPUs.Value() == 0 {
            // If the node has no GPUs, or the pod doesn't request any, this plugin is neutral.
            // We return a 0 score, giving it no preference.
    		return 0, framework.NewStatus(framework.Success)
    	}
    
    	requestedGPUs := getPodGPURequest(pod)
    	if requestedGPUs == 0 {
            // If the pod doesn't need GPUs, we don't influence scheduling.
    		return 0, framework.NewStatus(framework.Success)
    	}
    
    	// Calculate used GPUs on the node
    	usedGPUs := getUsedGPUs(nodeInfo)
    
    	capacity := allocatableGPUs.Value()
    	utilization := (float64(usedGPUs) + float64(requestedGPUs)) / float64(capacity)
    
    	// If utilization > 1, it means the pod won't fit. 
        // This should ideally be caught by the Filter phase (e.g., NodeResourcesFit),
        // but as a safeguard, we return a 0 score.
    	if utilization > 1 {
    		klog.V(4).Infof("Pod %s/%s cannot fit on node %s due to GPU capacity", pod.Namespace, pod.Name, nodeName)
    		return 0, framework.NewStatus(framework.Success)
    	}
    
    	// Scale the score to be between MinNodeScore and MaxNodeScore (0-100)
    	score := int64(utilization * float64(framework.MaxNodeScore))
    
    	klog.V(5).Infof("GPU Bin-Packing Score for pod %s/%s on node %s: %d", pod.Namespace, pod.Name, nodeName, score)
    	return score, framework.NewStatus(framework.Success)
    }
    
    // ScoreExtensions of the Score plugin.
    func (pl *GPUBinPacking) ScoreExtensions() framework.ScoreExtensions {
    	// We don't need normalization, so we return nil.
    	return nil
    }
    
    // Helper function to get total GPU request for a pod
    func getPodGPURequest(pod *v1.Pod) int64 {
    	var total int64
    	for _, container := range pod.Spec.Containers {
    		if req, ok := container.Resources.Requests[GPUDevice]; ok {
    			total += req.Value()
    		}
    	}
    	return total
    }
    
    // Helper function to get used GPUs on a node
    func getUsedGPUs(nodeInfo *framework.NodeInfo) int64 {
    	var used int64
    	for _, pod := range nodeInfo.Pods {
    		// Ignore terminal pods
    		if pod.Pod.Status.Phase == v1.PodSucceeded || pod.Pod.Status.Phase == v1.PodFailed {
    			continue
    		}
    		used += getPodGPURequest(pod.Pod)
    	}
    	return used
    }

    4. Main Function to Register the Plugin

    Finally, we need a main.go file to register our plugin with the scheduler's command-line interface.

    go
    // main.go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	// Import our plugin
    	"github.com/your-org/gpu-bin-pack-scheduler"
    )
    
    func main() {
    	// Register the plugin
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(gpu_bin_pack_scheduler.Name, gpu_bin_pack_scheduler.New),
    	)
    
    	if err := cli.RunNoErrOutput(command); err != nil {
    		os.Exit(1)
    	}
    }

    Now you can build this into a container image.

    Dockerfile
    FROM golang:1.21-alpine
    
    WORKDIR /go/src/app
    COPY go.mod go.sum ./
    RUN go mod download
    
    COPY . .
    
    RUN CGO_ENABLED=0 GOOS=linux go build -o /usr/local/bin/gpu-bin-pack-scheduler .
    
    # The final image is just the binary
    FROM alpine:latest
    COPY --from=0 /usr/local/bin/gpu-bin-pack-scheduler /usr/local/bin/kube-scheduler
    
    # The custom scheduler binary must be named kube-scheduler to be a drop-in replacement
    ENTRYPOINT ["/usr/local/bin/kube-scheduler"]

    Deployment and Configuration

    Running a custom scheduler isn't as simple as deploying a pod. You need to provide a specific configuration file that tells Kubernetes how to use your plugin.

    1. KubeSchedulerConfiguration

    This ConfigMap defines a scheduler profile. We create a new profile named gpu-bin-packer and configure its plugins section.

    Crucially, we:

  • Disable the default NodeResourcesLeastAllocated scoring plugin.
  • Enable our GPUBinPacking plugin in the score phase.
  • yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: gpu-scheduler-config
      namespace: kube-system
    data:
      scheduler-config.yaml: |
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
        profiles:
          - schedulerName: default-scheduler
            # The default profile remains unchanged
    
          - schedulerName: gpu-bin-packer
            plugins:
              # Default plugins are enabled in each phase.
              # We only need to specify what we want to change.
              score:
                disabled:
                  # Disable the default plugin that spreads pods
                  - name: NodeResourcesLeastAllocated
                enabled:
                  - name: GPUBinPacking
                    weight: 100 # Give our plugin maximum weight
                  # We still want other default scoring plugins to run
                  - name: NodeResourcesBalancedAllocation
                    weight: 5
                  - name: ImageLocality
                    weight: 5

    2. Deployment of the Custom Scheduler

    Next, we deploy our custom scheduler as a Deployment in the kube-system namespace. It will run alongside the default kube-scheduler.

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-bin-pack-scheduler
      namespace: kube-system
      labels:
        component: gpu-bin-pack-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          component: gpu-bin-pack-scheduler
      template:
        metadata:
          labels:
            component: gpu-bin-pack-scheduler
        spec:
          serviceAccountName: kube-scheduler # Use a ServiceAccount with appropriate permissions
          containers:
            - name: scheduler-plugin
              image: your-registry/gpu-bin-pack-scheduler:latest
              args:
                - --config=/etc/kubernetes/scheduler-config.yaml
                - --v=4 # Increase verbosity for debugging
              resources:
                requests:
                  cpu: "100m"
                  memory: "256Mi"
              volumeMounts:
                - name: scheduler-config-volume
                  mountPath: /etc/kubernetes
          volumes:
            - name: scheduler-config-volume
              configMap:
                name: gpu-scheduler-config

    Note: You will need to create a ClusterRole and ClusterRoleBinding that grants your scheduler's ServiceAccount permissions equivalent to the default system:kube-scheduler role.

    3. Using the Custom Scheduler

    To have a pod scheduled by our new scheduler, simply specify its schedulerName in the pod spec:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-workload-1
    spec:
      schedulerName: gpu-bin-packer # This is the key!
      containers:
        - name: cuda-container
          image: nvidia/cuda:11.4.0-base-ubuntu20.04
          command: ["sleep", "3600"]
          resources:
            limits:
              nvidia.com/gpu: 1

    When you apply this pod, you can check the logs of your gpu-bin-pack-scheduler pod to see the scoring decisions.

    Advanced Considerations and Edge Cases

    Performance Profiling

    The scheduler is a critical, latency-sensitive component. A slow scoring plugin can degrade the performance of the entire cluster. Use Go's built-in pprof to profile your plugin under load. The kube-scheduler exposes a /debug/pprof endpoint. Ensure your scoring logic is efficient and avoids expensive computations or I/O. Our implementation relies on the scheduler's in-memory cache (SnapshotSharedLister), which is extremely fast.

    Interaction with the Cluster Autoscaler

    This is the entire point of our exercise. With the bin-packing scheduler consolidating GPU pods onto a minimal set of nodes, other GPU nodes will become completely empty. The Cluster Autoscaler will identify these nodes as unneeded (all pods can be scheduled elsewhere) and, after a configurable timeout (e.g., 10 minutes), will terminate them. When new GPU pods arrive, the Cluster Autoscaler will provision new nodes as required. This pack-and-scale cycle is the key to cost optimization.

    Edge Case: Multi-GPU Pods and NUMA Topology

    Our current scoring logic is simple. It doesn't account for NUMA topology. A high-performance computing (HPC) or distributed training job might require 4 GPUs that are all on the same NUMA node for low-latency interconnect. An 8-GPU machine might have 4 GPUs on NUMA node 0 and 4 on NUMA node 1. If 1 GPU is already used on each NUMA node, the machine has 6 free GPUs, but cannot satisfy a 4-GPU same-NUMA-node request.

    To solve this, you would need to extend your scheduler with a custom Filter plugin.

    • The pod would need to express its topology requirements (e.g., via annotations).
  • The Filter plugin would read these annotations.
    • It would need to get detailed node topology information, perhaps from a device plugin like the NVIDIA GPU Operator, which can expose topology as node labels.
  • The Filter plugin would then check if the node has a NUMA node with enough free GPUs to satisfy the pod's request. If not, the node is filtered out before the scoring phase.
  • This demonstrates the power of combining different plugin extension points to solve complex, real-world scheduling problems.

    Conclusion

    Moving beyond the default Kubernetes scheduler is a significant step toward a mature, optimized, and cost-effective infrastructure, especially for specialized workloads. By leveraging the Scheduler Framework, you can inject domain-specific logic directly into the heart of Kubernetes. The GPU bin-packing plugin we developed is not a theoretical exercise; it is a practical, high-impact solution to a common problem in MLOps. It directly enables better hardware utilization and facilitates aggressive, cost-saving autoscaling, turning a simple scheduling tweak into a substantial financial and operational win.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles