Custom Kubernetes Scheduler Plugins for GPU-Aware Bin Packing

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inefficiency of Default Scheduling in GPU-Intensive Clusters

As a senior engineer responsible for a Kubernetes cluster that serves machine learning training jobs, you've likely encountered a frustrating and costly paradox: your cluster reports low overall GPU utilization, yet new large-scale training pods frequently get stuck in a Pending state due to Insufficient nvidia.com/gpu resources. This isn't a bug; it's a direct consequence of the default Kubernetes scheduler's design philosophy.

The default scheduler, guided by plugins like NodeResourcesLeastAllocated, prioritizes spreading workloads evenly across available nodes. For stateless web applications, this is a sensible strategy for resilience. For GPU workloads, it's catastrophic. This "spreading" behavior leads to resource fragmentation. A cluster of nodes, each with a small number of GPUs available, becomes incapable of scheduling a pod that requires a large number of GPUs, even if the total number of free GPUs across the cluster is more than sufficient.

Consider this scenario:

  • Cluster: 4 nodes, each with 8 NVIDIA A100 GPUs.
  • Total Capacity: 32 GPUs.
  • Workload: Four pods arrive, each requesting 4 GPUs (nvidia.com/gpu: 4).
  • The NodeResourcesLeastAllocated score plugin will likely place one pod on each of the four nodes. The state of the cluster becomes:

  • Node 1: 4/8 GPUs used.
  • Node 2: 4/8 GPUs used.
  • Node 3: 4/8 GPUs used.
  • Node 4: 4/8 GPUs used.
  • Now, a critical, high-priority pod arrives requesting 8 GPUs. Despite 16 GPUs being technically free in the cluster, no single node can satisfy the request. The pod is unschedulable. This fragmentation directly translates to wasted resources, delayed experiments, and inflated cloud bills.

    The solution is to invert the scheduler's logic for GPU pods. Instead of spreading them out, we must pack them tightly onto as few nodes as possible. This strategy, known as bin packing, ensures that nodes are either fully utilized or completely empty, maximizing the availability of nodes with full GPU capacity for large jobs. To achieve this in Kubernetes, we must build a custom scheduler plugin.

    This article provides a production-focused guide to implementing such a plugin. We will build a GPUBinPack scheduler plugin from scratch in Go, integrate it into the Kubernetes Scheduler Framework, and deploy it to a cluster. We will go beyond a simple implementation to discuss performance, edge cases like heterogeneous GPU types, and the critical interaction with preemption.


    The Kubernetes Scheduler Framework: Our Extension Points

    Before we write any code, it's crucial to understand the specific extension points within the kube-scheduler that we will leverage. The modern scheduler is not a monolith but a collection of plugins operating at different phases of the scheduling cycle. For our bin packing goal, the most important phases are Filter and Score.

  • Filter: This phase is a predicate step. Plugins in this phase inspect a pod and a node and return a status of Success, Unschedulable, or Error. If any filter plugin returns Unschedulable, the node is immediately eliminated from consideration for the current pod. The default NodeResourcesFit plugin already handles basic GPU availability checks here, so we likely won't need a custom filter.
  • Score: This is the heart of our implementation. After filtering, the scheduler is left with a list of candidate nodes that can run the pod. The Score phase ranks these candidates. Each score plugin inspects a node and returns an integer score (default range 0-100). The scheduler sums the scores from all active score plugins for each node. The node with the highest final score is chosen.
  • Our strategy is to create a custom Score plugin that gives higher scores to nodes that are already heavily utilized with GPU workloads. This will guide the scheduler to "pack" new GPU pods onto these nodes.

  • PreScore: An optional but important optimization phase. If multiple pods need the same pre-computed data about nodes for scoring, a PreScore plugin can compute it once and cache it in the CycleState. This avoids redundant calculations in the Score phase, which is called for every candidate node.
  • We will primarily implement the ScorePlugin interface, but we will also design a PreScorePlugin to demonstrate production-grade performance optimization.

    go
    // A brief look at the interfaces we'll be implementing from k8s.io/kubernetes/pkg/scheduler/framework
    
    // ScorePlugin is an interface for Score plugins.
    // The score indicates how desirable the node is. The scheduler will choose a node with the highest score.
    type ScorePlugin interface {
    	Plugin
    	Score(ctx context.Context, state *CycleState, p *v1.Pod, nodeName string) (int64, *Status)
    	ScoreExtensions() ScoreExtensions
    }
    
    // PreScorePlugin is an interface for PreScore plugins.
    // These plugins are called before Score plugins to pre-compute data that can be shared across Score plugins.
    type PreScorePlugin interface {
    	Plugin
    	PreScore(ctx context.Context, state *CycleState, pod *v1.Pod, nodes []*v1.Node) *Status
    }

    By replacing the default NodeResourcesLeastAllocated plugin with our own GPUBinPack Score plugin, we can fundamentally change the scheduling behavior for our target workloads.


    Implementation: The `GPUBinPack` Go Plugin

    Let's begin building the plugin. We'll start with a standard Go project structure and add the necessary Kubernetes dependencies.

    Project Setup:

    bash
    # Create project directory
    mkdir gpu-bin-pack-scheduler
    cd gpu-bin-pack-scheduler
    
    # Initialize Go module
    go mod init github.com/your-org/gpu-bin-pack-scheduler
    
    # Get Kubernetes dependencies (use a specific version matching your target cluster)
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]

    The Core Logic: GPUBinPack Struct and Score Function

    We'll create a file pkg/scheduler/plugin.go to house our implementation. The core idea of our scoring algorithm is:

    Score = (requestedGPUsOnNode + incomingPodGPUs) / allocatableGPUsOnNode * MaxNodeScore

    A higher ratio indicates a more "full" node, which is exactly what we want for bin packing.

    go
    // pkg/scheduler/plugin.go
    package scheduler
    
    import (
    	"context"
    	"fmt"
    
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	// Name is the name of the plugin used in the plugin registry and configurations.
    	Name = "GPUBinPack"
    	// gpuResourceName is the name of the GPU resource we are targeting.
    	gpuResourceName = "nvidia.com/gpu"
    )
    
    // GPUBinPack is a scheduler plugin that implements bin packing for GPUs.
    type GPUBinPack struct {
    	h framework.Handle
    }
    
    var _ framework.PreScorePlugin = &GPUBinPack{}
    var _ framework.ScorePlugin = &GPUBinPack{}
    
    // Name returns the name of the plugin.
    func (pl *GPUBinPack) Name() string {
    	return Name
    }
    
    // New initializes a new plugin and returns it.
    func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &GPUBinPack{
    		h: h,
    	}, nil
    }
    
    // stateData is a struct to hold pre-computed node GPU information.
    type stateData struct {
    	requestedGPUs int64
    }
    
    // Clone is required to implement framework.StateData.
    func (s *stateData) Clone() framework.StateData {
    	return s
    }
    
    // PreScore is called once per scheduling cycle to pre-compute and cache node data.
    func (pl *GPUBinPack) PreScore(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodes []*v1.Node) *framework.Status {
    	// If the incoming pod doesn't request GPUs, this plugin is a no-op for it.
    	if !requestsGPU(pod) {
    		return framework.NewStatus(framework.Skip)
    	}
    
    	nodeInfoList, err := pl.h.SnapshotSharedLister().NodeInfos().List()
    	if err != nil {
    		return framework.AsStatus(fmt.Errorf("listing node infos: %w", err))
    	}
    
    	for _, nodeInfo := range nodeInfoList {
    		node := nodeInfo.Node()
    		if node == nil {
    			continue
    		}
    
    		requestedGPUs := int64(0)
    		for _, p := range nodeInfo.Pods {
    			requestedGPUs += podRequestedGpu(p.Pod)
    		}
    		
    		state.Write(pl.preScoreStateKey(node.Name), &stateData{requestedGPUs: requestedGPUs})
    	}
    	return nil
    }
    
    func (pl *GPUBinPack) preScoreStateKey(nodeName string) framework.StateKey {
    	return framework.StateKey(fmt.Sprintf("%s/%s", Name, nodeName))
    }
    
    // Score is called for each node that passed the Filter phase.
    func (pl *GPUBinPack) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    	// Skip scoring if the pod doesn't request GPUs.
    	if !requestsGPU(pod) {
    		return 0, framework.NewStatus(framework.Skip)
    	}
    
    	nodeInfo, err := pl.h.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
    	}
    
    	node := nodeInfo.Node()
    	allocatableGPUs, hasGPU := node.Status.Allocatable[gpuResourceName]
    	if !hasGPU || allocatableGPUs.Value() == 0 {
    		// This should ideally not happen if NodeResourcesFit filter is active, but defensive coding is good.
    		return 0, nil
    	}
    	
    	// Retrieve pre-computed data from CycleState.
    	data, err := state.Read(pl.preScoreStateKey(nodeName))
    	if err != nil {
    		// This indicates an issue with PreScore, but we can calculate on the fly as a fallback.
    		klog.Errorf("Failed to read state from PreScore: %v", err)
    		return 0, framework.AsStatus(err)
    	}
    
    	preScoredData, ok := data.(*stateData)
    	if !ok {
    		return 0, framework.AsStatus(fmt.Errorf("invalid state data: %T", data))
    	}
    
    	requestedGPUsOnNode := preScoredData.requestedGPUs
    	incomingPodGPUs := podRequestedGpu(pod)
    	totalAllocatableGPUs := allocatableGPUs.Value()
    
    	// The core bin packing logic
    	finalRequestedGPUs := requestedGPUsOnNode + incomingPodGPUs
    	score := (float64(finalRequestedGPUs) / float64(totalAllocatableGPUs)) * float64(framework.MaxNodeScore)
    
    	klog.Infof("GPUBinPack Score for node '%s': ((%d + %d) / %d) * 100 = %d", nodeName, requestedGPUsOnNode, incomingPodGPUs, totalAllocatableGPUs, int64(score))
    
    	return int64(score), nil
    }
    
    // ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
    func (pl *GPUBinPack) ScoreExtensions() framework.ScoreExtensions {
    	return nil
    }
    
    // Helper function to check if a pod requests GPUs.
    func requestsGPU(pod *v1.Pod) bool {
    	for _, container := range pod.Spec.Containers {
    		if _, ok := container.Resources.Limits[gpuResourceName]; ok {
    			return true
    		}
    	}
    	return false
    }
    
    // Helper function to get the number of GPUs a pod requests.
    func podRequestedGpu(pod *v1.Pod) int64 {
    	totalGpus := int64(0)
    	for _, container := range pod.Spec.Containers {
    		if gpu, ok := container.Resources.Limits[gpuResourceName]; ok {
    			totalGpus += gpu.Value()
    		}
    	}
    	return totalGpus
    }

    Analysis of the Implementation:

  • PreScore Optimization: Instead of calculating the current GPU usage of a node every time Score is called (once per node), we do it once for all nodes in PreScore. We iterate through the nodeInfoList, calculate the sum of GPU requests from all pods currently on that node, and store it in the CycleState using a unique key. This significantly reduces redundant computation, especially in large clusters.
  • Score Logic: The Score function retrieves the pre-calculated usage from the CycleState. It then adds the incoming pod's GPU request and normalizes this value against the node's total allocatable GPUs. The result is scaled to the MaxNodeScore (which is 100). Nodes that become fuller after scheduling the pod receive a higher score.
  • Graceful Skip: The requestsGPU helper ensures that our plugin doesn't interfere with the scheduling of non-GPU pods. By returning framework.Skip, we tell the scheduler framework to ignore our plugin's score for that particular pod, letting other plugins like NodeResourcesLeastAllocated handle them as usual.
  • Edge Case Handling: We explicitly check if a node has any allocatable GPUs (node.Status.Allocatable[gpuResourceName]). This prevents division-by-zero errors and ensures our plugin doesn't score nodes that are irrelevant to our workload.
  • Main Program to Register the Plugin

    Now we need an entrypoint that registers our plugin with the scheduler framework.

    go
    // cmd/scheduler/main.go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	"github.com/your-org/gpu-bin-pack-scheduler/pkg/scheduler"
    )
    
    func main() {
    	// Register the plugin with the scheduler framework registry
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(scheduler.Name, scheduler.New),
    	)
    
    	code := cli.Run(command)
    	os.Exit(code)
    }

    This main.go uses the standard kube-scheduler command but includes our plugin in the registry, making it available for selection via the scheduler configuration.


    Building and Deploying the Custom Scheduler

    With the code complete, we now package it into a container and deploy it to our Kubernetes cluster.

    Dockerfile for an Optimized Build

    We use a multi-stage Dockerfile to keep the final image small and secure.

    dockerfile
    # Dockerfile
    
    # Build stage
    FROM golang:1.21-alpine AS builder
    
    WORKDIR /app
    
    COPY go.mod go.sum ./
    RUN go mod download
    
    COPY . .
    
    # Build the binary
    RUN CGO_ENABLED=0 GOOS=linux go build -a -o gpu-bin-pack-scheduler ./cmd/scheduler
    
    # Final stage
    FROM alpine:latest
    
    WORKDIR /root/
    
    # Copy the binary from the builder stage
    COPY --from=builder /app/gpu-bin-pack-scheduler .
    
    # Run the scheduler
    ENTRYPOINT ["./gpu-bin-pack-scheduler"]

    Build and push the image to a registry accessible by your cluster:

    bash
    docker build -t your-registry/gpu-bin-pack-scheduler:v1.0.0 .
    docker push your-registry/gpu-bin-pack-scheduler:v1.0.0

    Kubernetes Manifests for Deployment

    Deploying a custom scheduler requires three key components:

  • RBAC: A ClusterRole and ClusterRoleBinding to give the scheduler's ServiceAccount permissions to read cluster state (pods, nodes) and write binding objects.
  • Configuration: A KubeSchedulerConfiguration file (provided via a ConfigMap) that defines our scheduler profile, enabling our GPUBinPack plugin and disabling conflicting default plugins.
  • Deployment: A Deployment to run the scheduler pod itself.
  • Here are the complete manifests:

    yaml
    # scheduler-manifests.yaml
    
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: gpu-bin-pack-scheduler-sa
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-bin-pack-scheduler-as-kube-scheduler
    subjects:
    - kind: ServiceAccount
      name: gpu-bin-pack-scheduler-sa
      namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:kube-scheduler
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-bin-pack-scheduler-as-volume-scheduler
    subjects:
    - kind: ServiceAccount
      name: gpu-bin-pack-scheduler-sa
      namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:volume-scheduler
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: gpu-bin-pack-scheduler-config
      namespace: kube-system
    data:
      scheduler-config.yaml: |
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
          resourceName: gpu-bin-pack-scheduler
          resourceNamespace: kube-system
        profiles:
          - schedulerName: gpu-bin-pack-scheduler
            plugins:
              # We keep the default plugins for other phases
              queueSort:
                enabled:
                  - name: PrioritySort
              filter:
                enabled:
                  - name: NodeUnschedulable
                  - name: NodeName
                  - name: TaintToleration
                  - name: NodeAffinity
                  - name: NodePorts
                  - name: NodeResourcesFit
                  - name: VolumeRestrictions
                  - name: EbsLimits
                  - name: GcePdLimits
                  - name: NodeVolumeLimits
                  - name: AzureDiskLimits
                  - name: VolumeBinding
                  - name: VolumeZone
                  - name: PodTopologySpread
                  - name: InterPodAffinity
              preScore:
                enabled:
                  - name: GPUBinPack
              score:
                enabled:
                  - name: GPUBinPack
                    weight: 10
                  # We explicitly keep some default scoring plugins but disable the one we are replacing
                  - name: PodTopologySpread
                    weight: 2
                  - name: TaintToleration
                    weight: 1
                  - name: NodeAffinity
                    weight: 1
                disabled:
                  - name: NodeResourcesLeastAllocated
                  - name: NodeResourcesMostAllocated # Also disable this to avoid conflict
            pluginConfig:
              - name: GPUBinPack
                args: {}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-bin-pack-scheduler
      namespace: kube-system
      labels:
        component: gpu-bin-pack-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          component: gpu-bin-pack-scheduler
      template:
        metadata:
          labels:
            component: gpu-bin-pack-scheduler
        spec:
          serviceAccountName: gpu-bin-pack-scheduler-sa
          containers:
          - name: scheduler
            image: your-registry/gpu-bin-pack-scheduler:v1.0.0
            args:
            - --config=/etc/kubernetes/scheduler-config.yaml
            - --v=3 # Verbose logging to see scores
            resources:
              requests:
                cpu: "100m"
                memory: "256Mi"
              limits:
                cpu: "500m"
                memory: "512Mi"
            volumeMounts:
            - name: scheduler-config-volume
              mountPath: /etc/kubernetes
          volumes:
          - name: scheduler-config-volume
            configMap:
              name: gpu-bin-pack-scheduler-config

    Key Configuration Details:

  • schedulerName: We define a new profile named gpu-bin-pack-scheduler. Pods must specify this name in their spec to be handled by our scheduler.
  • plugins: We enable our GPUBinPack plugin in both the preScore and score phases.
  • disabled: This is critical. We explicitly disable NodeResourcesLeastAllocated to prevent it from interfering with our bin packing logic. Its score would directly counteract ours.
  • weight: We give our plugin a high weight (10) to ensure its decision has a significant impact relative to other active scoring plugins like PodTopologySpread.
  • Apply the manifests:

    kubectl apply -f scheduler-manifests.yaml


    Verification and Advanced Considerations

    With the scheduler running, let's test it. Create two GPU pods, each specifying the new scheduler.

    yaml
    # test-pods.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod-1
    spec:
      schedulerName: gpu-bin-pack-scheduler # Use our scheduler
      containers:
      - name: cuda-container
        image: nvidia/cuda:12.1.0-base-ubuntu22.04
        command: ["sleep", "3600"]
        resources:
          limits:
            nvidia.com/gpu: "4"
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod-2
    spec:
      schedulerName: gpu-bin-pack-scheduler # Use our scheduler
      containers:
      - name: cuda-container
        image: nvidia/cuda:12.1.0-base-ubuntu22.04
        command: ["sleep", "3600"]
        resources:
          limits:
            nvidia.com/gpu: "4"

    Apply this manifest and observe the placement:

    kubectl apply -f test-pods.yaml

    kubectl get pods -o wide

    You should see both gpu-pod-1 and gpu-pod-2 scheduled onto the same node, assuming that node has at least 8 GPUs. You can also inspect the scheduler logs to see the scores it calculated:

    kubectl logs -n kube-system deployment/gpu-bin-pack-scheduler

    Advanced Consideration: Heterogeneous GPU Types

    Real-world clusters often have nodes with different GPU models (e.g., T4 for inference, A100 for training). Our current plugin is naive to this; it only considers the count of nvidia.com/gpu. A production implementation should be enhanced to handle this.

    Solution: Use node labels to identify GPU types (e.g., gpu-type=a100). The pod would request a specific type using a nodeSelector or nodeAffinity.

    Our plugin can be modified to:

  • Check if the pod has a gpu-type node selector.
  • In the Score function, only consider nodes that match the requested label.
  • Calculate the bin packing score relative to the allocatable GPUs on nodes of that specific type. This prevents trying to pack an A100 pod onto a node full of T4s.
  • Advanced Consideration: Interaction with Preemption

    Bin packing can create tension with pod priority and preemption. Imagine a node is fully packed with low-priority pods. A high-priority pod arrives that could fit on this node if some low-priority pods were preempted.

  • The default preemption logic, executed in the PostFilter phase, will kick in. It will identify victim pods on the packed node and evict them to make room.
    • This is generally desirable behavior. Our bin packing strategy successfully consolidated resources, making the node a viable (and likely only) candidate for the high-priority pod after preemption.
    • The risk is cascading preemptions. If the scheduler is too aggressive with packing, it might create a small number of "hot" nodes that become constant targets for preemption, leading to churn for low-priority workloads.

    Engineers must balance the aggressiveness of bin packing (e.g., by adjusting the plugin's weight) with the stability needs of lower-priority jobs. For some environments, it might be preferable to have a slightly less packed cluster to provide more landing spots for high-priority pods without requiring preemption.

    Conclusion

    The Kubernetes default scheduler is a general-purpose tool, and like any such tool, it falls short in specialized, high-performance domains like GPU-based computing. By leveraging the scheduler framework, we can move beyond its limitations and implement sophisticated, domain-specific logic. We have demonstrated a complete, production-ready workflow for creating a GPU bin packing plugin that directly addresses resource fragmentation, leading to higher cluster utilization, faster scheduling for large jobs, and significant cost savings.

    This level of customization is not trivial. It requires a deep understanding of both Kubernetes internals and the specific workload's characteristics. However, for organizations operating at scale, the investment in building custom scheduler plugins is a powerful lever for optimizing performance and efficiency, transforming Kubernetes from a generic container orchestrator into a finely tuned, application-aware computing platform.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles