Optimizing GPU Utilization with a Custom Kubernetes Bin-Packing Scheduler

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The High Cost of GPU Fragmentation in Kubernetes

In modern cloud-native environments, particularly those running AI/ML workloads, NVIDIA GPUs are a critical but costly resource. The default Kubernetes scheduler (default-scheduler) employs a set of scoring plugins, such as NodeResourcesLeastAllocated, that favor spreading pods across as many nodes as possible. While this strategy promotes resilience for general-purpose stateless applications, it's profoundly inefficient and expensive for GPU workloads.

Consider a common scenario: a cluster with a node pool of expensive g2-standard-4 nodes, each equipped with one NVIDIA L4 GPU. If you schedule four separate training jobs, each requesting a single GPU, the default scheduler will likely place each pod on a different node. The result? Four active, high-cost nodes, each with its GPU at 100% utilization but its CPU and memory potentially underutilized. More critically, the Cluster Autoscaler sees all four nodes as active and cannot scale down the pool, even if the total workload could theoretically fit on a single, more powerful node. This fragmentation directly translates to wasted resources and a massively inflated cloud bill.

To solve this, we need to invert the scheduler's logic for these specific workloads. Instead of spreading, we need to pack them as tightly as possible. This is a classic bin-packing problem. The goal is to fill up one GPU node completely before scheduling a pod on a second one. This consolidation frees up entire nodes, making them eligible for termination by the Cluster Autoscaler.

This article details the implementation of a custom Kubernetes scheduler focused exclusively on GPU bin-packing. We will leverage the scheduler-framework in Go to create a custom scoring plugin that prioritizes nodes with the highest existing GPU allocation, deploy it as a secondary scheduler in our cluster, and demonstrate its direct impact on workload placement and cost efficiency.


The Scheduler Framework: Our Toolkit for Customization

The Kubernetes scheduler is not a monolith. Since v1.19, it's built on a highly pluggable architecture called the scheduler framework. This framework defines a series of extension points (interfaces) that allow developers to inject custom logic into the scheduling lifecycle. For our purposes, the most important extension point is Score.

Here's a quick refresher on the scheduling cycle phases and their relevant extension points:

  • Sorting: QueueSort plugins sort pods in the scheduling queue.
  • Filtering: PreFilter and Filter plugins eliminate nodes that cannot run the pod (e.g., insufficient resources, failed taints/tolerations).
  • Scoring: PreScore and Score plugins rank the nodes that passed the filtering phase. Each Score plugin returns an integer score for each node (e.g., 0-100). The scheduler then sums the weighted scores from all active scoring plugins to determine the final rank.
  • Binding: Reserve, Permit, PreBind, Bind, and PostBind plugins execute the binding of the pod to the chosen node.
  • Our custom scheduler will implement the ScorePlugin interface. We will design a scoring function that gives the highest scores to nodes that already have GPU pods running, thereby encouraging the scheduler to place new GPU pods on those same nodes until they are full.

    Designing the GPU Bin-Packing Scoring Logic

    The core of our custom scheduler is its scoring algorithm. The logic should be simple, effective, and computationally inexpensive.

    Our goal is to reward nodes that have a higher number of allocated GPUs. A simple linear scoring function will suffice:

    score = (maxScore * allocatedGpuRequests) / totalGpuCapacity

    Let's break this down:

    * maxScore: The maximum score a plugin can return, typically framework.MaxNodeScore (which is 100).

    * allocatedGpuRequests: The sum of GPU requests from all pods currently scheduled on the node.

    * totalGpuCapacity: The total number of allocatable GPUs on the node.

    Example Walkthrough:

    Assume a cluster with two nodes, node-a and node-b, each with 4 GPUs.

    node-a currently has 3 pods, each requesting 1 GPU. allocatedGpuRequests = 3. Its score would be (100 3) / 4 = 75.

    node-b has 1 pod requesting 1 GPU. allocatedGpuRequests = 1. Its score would be (100 1) / 4 = 25.

    A new pod requesting 1 GPU arrives. Our plugin will score node-a at 75 and node-b at 25. The scheduler will choose node-a, packing the fourth pod onto it and leaving node-b with minimal utilization, making it a prime candidate for scale-down if its single pod is later removed.

    This approach directly counteracts the default NodeResourcesLeastAllocated behavior and achieves our bin-packing goal.

    Implementation: Building the Scheduler in Go

    Let's translate our design into a functional Go program. We'll create a new scheduler plugin and compile it into a scheduler binary.

    Step 1: Project Setup

    Initialize a new Go module.

    bash
    # Create project directory
    mkdir gpu-bin-packer-scheduler
    cd gpu-bin-packer-scheduler
    
    # Initialize Go module
    go mod init github.com/your-org/gpu-bin-packer-scheduler
    
    # Add Kubernetes dependencies
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/klog/[email protected]

    Note: Ensure you use compatible versions of Kubernetes libraries. This example targets Kubernetes v1.28.

    Step 2: Implementing the Scoring Plugin

    Create a file pkg/scheduler/plugin.go to house our plugin's logic.

    go
    package scheduler
    
    import (
    	"context"
    	"fmt"
    
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	// Name is the name of the plugin used in the plugin registry and configurations.
    	Name = "GpuBinPacker"
    	// GpuResourceName is the name of the GPU resource.
    	GpuResourceName = "nvidia.com/gpu"
    )
    
    // GpuBinPacker is a score plugin that favors nodes with higher GPU allocation.
    type GpuBinPacker struct {
    	handle framework.Handle
    }
    
    // Asserts that GpuBinPacker implements the ScorePlugin interface.
    var _ framework.ScorePlugin = &GpuBinPacker{}
    
    // Name returns the name of the plugin.
    func (pl *GpuBinPacker) Name() string {
    	return Name
    }
    
    // Score is the function that ranks a node.
    func (pl *GpuBinPacker) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.AsStatus(fmt.Errorf("getting node %q from Snapshot: %w", nodeName, err))
    	}
    
    	node := nodeInfo.Node()
    	if node == nil {
    		return 0, framework.AsStatus(fmt.Errorf("node %q not found", nodeName))
    	}
    
    	// If the node has no GPUs, it should not be scored by this plugin.
    	totalGpuCapacity, hasGpu := node.Status.Allocatable[GpuResourceName]
    	if !hasGpu || totalGpuCapacity.IsZero() {
    		klog.V(4).Infof("Node %s has no allocatable GPUs, scoring 0", nodeName)
    		return 0, nil
    	}
    
    	// Calculate the sum of GPU requests from all pods on the node.
    	allocatedGpu := int64(0)
    	for _, pod := range nodeInfo.Pods {
    		for _, container := range pod.Spec.Containers {
    			if req, ok := container.Resources.Requests[GpuResourceName]; ok {
    				allocatedGpu += req.Value()
    			}
    		}
    	}
    
    	// The score is the percentage of GPUs allocated, scaled to the max score.
    	// Higher allocation = higher score = more likely to be chosen.
    	score := (framework.MaxNodeScore * allocatedGpu) / totalGpuCapacity.Value()
    
    	klog.V(5).Infof("Node: %s, Allocated GPUs: %d, Total GPUs: %d, Score: %d", nodeName, allocatedGpu, totalGpuCapacity.Value(), score)
    
    	return score, nil
    }
    
    // ScoreExtensions returns a ScoreExtensions interface if it exists.
    func (pl *GpuBinPacker) ScoreExtensions() framework.ScoreExtensions {
    	return nil // We don't need to implement this for our simple case.
    }
    
    // New initializes a new plugin and returns it.
    func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &GpuBinPacker{
    		handle: h,
    	}, nil
    }

    Key Implementation Details:

    * handle.SnapshotSharedLister().NodeInfos().Get(nodeName): This is the efficient, canonical way to get node and pod information within the scheduler framework. It uses a cached snapshot of the cluster state for the current scheduling cycle, avoiding direct API server calls for every scoring calculation.

    * node.Status.Allocatable[GpuResourceName]: We check for the presence and capacity of nvidia.com/gpu resources. If a node has no GPUs, we give it a score of 0.

    * Resource Iteration: We iterate through all pods on the nodeInfo and sum their GPU requests. This is the core of our allocatedGpuRequests calculation.

    * Scoring Formula: The final score is calculated exactly as designed. klog statements are added for verbose logging, which is invaluable for debugging scheduling decisions.

    Step 3: Registering the Plugin and Building the Scheduler Binary

    Now, we create the main entrypoint for our custom scheduler in cmd/scheduler/main.go.

    go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	"github.com/your-org/gpu-bin-packer-scheduler/pkg/scheduler"
    )
    
    func main() {
    	// Register the custom plugin with the scheduler framework registry.
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(scheduler.Name, scheduler.New),
    	)
    
    	code := cli.Run(command)
    	os.Exit(code)
    }

    This code is surprisingly simple. We use the app.NewSchedulerCommand from kube-scheduler's own codebase and use the app.WithPlugin option to register our GpuBinPacker plugin under its name. This creates a new scheduler binary that is functionally identical to the default kube-scheduler but now includes our custom plugin, ready to be activated via configuration.


    Deployment and Configuration in a Production Cluster

    Building the binary is only half the battle. Deploying it correctly with the right configuration and permissions is critical.

    Step 1: Containerize the Scheduler

    We need a lean, production-ready Docker image. A multi-stage Dockerfile is perfect for this.

    Dockerfile
    # --- Build Stage ---
    FROM golang:1.21-alpine AS builder
    
    WORKDIR /app
    
    # Copy go.mod and go.sum files to download dependencies
    COPY go.mod go.sum ./
    RUN go mod download
    
    # Copy the source code
    COPY . .
    
    # Build the scheduler binary
    # CGO_ENABLED=0 is important for a static binary
    # -ldflags "-w -s" strips debug symbols to reduce size
    RUN CGO_ENABLED=0 go build -ldflags "-w -s" -o /gpu-bin-packer-scheduler ./cmd/scheduler
    
    # --- Final Stage ---
    FROM alpine:3.18
    
    # Copy the static binary from the builder stage
    COPY --from=builder /gpu-bin-packer-scheduler /usr/local/bin/scheduler
    
    # The scheduler runs as non-root user for security
    USER 65532:65532
    
    ENTRYPOINT ["/usr/local/bin/scheduler"]

    Build and push this image to your container registry:

    bash
    docker build -t your-registry/gpu-bin-packer-scheduler:v1.0.0 .
    docker push your-registry/gpu-bin-packer-scheduler:v1.0.0

    Step 2: RBAC Configuration

    Our scheduler needs permissions to interact with the Kubernetes API server. It needs to get nodes, pods, and endpoints, as well as update pod status (to bind them to a node). We'll use the same ClusterRole as the default kube-scheduler for simplicity and correctness.

    deploy/rbac.yaml:

    yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: gpu-bin-packer-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-bin-packer-scheduler-as-scheduler
    subjects:
    - kind: ServiceAccount
      name: gpu-bin-packer-scheduler
      namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:kube-scheduler # Use the built-in role
      apiGroup: rbac.authorization.k8s.io

    Step 3: KubeSchedulerConfiguration

    This is where we activate our plugin. We create a KubeSchedulerConfiguration file that defines a new scheduler profile. In this profile, we disable the default scoring plugins that spread pods and enable our GpuBinPacker plugin with a high weight.

    deploy/scheduler-config.yaml:

    yaml
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
      resourceNamespace: kube-system
      resourceName: gpu-bin-packer-scheduler
    clientConnection:
      kubeconfig: /etc/kubernetes/scheduler.conf # This will be provided by the in-cluster config
    
    profiles:
      - schedulerName: gpu-bin-packer-scheduler
        plugins:
          # Default plugins are still used for Filter, Reserve, etc.
          # We are only customizing the Score phase.
          score:
            # Disable default plugins that conflict with our bin-packing goal.
            disabled:
            - name: NodeResourcesLeastAllocated
            - name: NodeResourcesBalancedAllocation
            # Enable our custom plugin with a weight.
            enabled:
            - name: GpuBinPacker
              weight: 100

    We will mount this configuration into our scheduler's pod via a ConfigMap.

    bash
    kubectl create configmap gpu-scheduler-config --from-file=deploy/scheduler-config.yaml -n kube-system

    Step 4: Deploying the Scheduler

    Finally, we create a Deployment for our scheduler in the kube-system namespace.

    deploy/deployment.yaml:

    yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-bin-packer-scheduler
      namespace: kube-system
      labels:
        app: gpu-bin-packer-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: gpu-bin-packer-scheduler
      template:
        metadata:
          labels:
            app: gpu-bin-packer-scheduler
        spec:
          serviceAccountName: gpu-bin-packer-scheduler
          containers:
          - name: scheduler
            image: your-registry/gpu-bin-packer-scheduler:v1.0.0
            args:
            - --config=/etc/kubernetes/scheduler-config.yaml
            - --v=4 # Set log level for debugging
            resources:
              requests:
                cpu: "100m"
                memory: "100Mi"
            volumeMounts:
            - name: scheduler-config-volume
              mountPath: /etc/kubernetes
          volumes:
          - name: scheduler-config-volume
            configMap:
              name: gpu-scheduler-config

    Apply all the manifests:

    bash
    kubectl apply -f deploy/rbac.yaml
    kubectl apply -f deploy/deployment.yaml

    Your custom scheduler is now running and ready to accept pods!


    Using the Scheduler and Verifying the Behavior

    To use our new scheduler, a pod must specify its schedulerName in the pod spec.

    Let's create four identical pods requesting one GPU each.

    test-pods.yaml:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod-1
    spec:
      schedulerName: gpu-bin-packer-scheduler # <-- Use our custom scheduler
      containers:
      - name: cuda-container
        image: nvidia/cuda:12.1.0-base-ubuntu22.04
        command: ["sleep", "3600"]
        resources:
          limits:
            nvidia.com/gpu: 1
    ---
    # ... (repeat for gpu-pod-2, gpu-pod-3, gpu-pod-4)

    Verification Scenario:

  • Initial State: Assume a cluster with a GPU node pool managed by the Cluster Autoscaler. Let's say it has two n2d-standard-8 nodes, each with 2 NVIDIA T4 GPUs.
  • Deploy Pod 1: kubectl apply -f test-pods.yaml (just pod 1). It gets scheduled on node-1.
  • Deploy Pod 2: Apply the manifest for pod 2. Check the scheduler logs (kubectl logs -n kube-system -l app=gpu-bin-packer-scheduler). You should see that node-1 gets a score of (100 * 1) / 2 = 50 while node-2 gets a score of 0. The pod is placed on node-1.
  • Deploy Pods 3 & 4: Apply the manifests for pods 3 and 4. Pod 3 will land on node-2 because node-1 is now full (its filter plugins will fail). Pod 4 will be bin-packed with pod 3 on node-2.
  • Now, let's observe the output:

    bash
    $ kubectl get pods -o wide
    
    NAME        READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
    gpu-pod-1   1/1     Running   0          5m    10.4.1.5    node-1   <none>           <none>
    gpu-pod-2   1/1     Running   0          4m    10.4.1.6    node-1   <none>           <none>
    gpu-pod-3   1/1     Running   0          3m    10.4.2.8    node-2   <none>           <none>
    gpu-pod-4   1/1     Running   0          2m    10.4.2.9    node-2   <none>           <none>

    The pods are tightly packed. Now, if you kubectl delete pod gpu-pod-3 gpu-pod-4, node-2 becomes completely empty. The Cluster Autoscaler will identify this underutilized node and, after a configurable timeout (typically 10 minutes), terminate it, achieving our goal of cost savings.

    Advanced Considerations and Edge Cases

    Handling Heterogeneous GPU Types

    In a real-world cluster, you might have nodes with different GPU types (e.g., nvidia.com/gpu-type-t4 and nvidia.com/gpu-type-a100). Our current scheduler only looks at nvidia.com/gpu and doesn't differentiate. This is a problem because a pod requesting a T4 should not be scored based on A100 utilization.

    To handle this, the scoring logic must be more sophisticated. It needs to check the specific GPU resource type requested by the pod.

    Improved Score Logic Snippet:

    go
    // Inside the Score function...
    
    // Find the specific GPU type the pod is requesting.
    var podGpuRequestType v1.ResourceName
    for _, container := range p.Spec.Containers {
        for resourceName := range container.Resources.Requests {
            if strings.HasPrefix(string(resourceName), "nvidia.com/") {
                podGpuRequestType = resourceName
                break
            }
        }
        if podGpuRequestType != "" {
            break
        }
    }
    
    // If the pod doesn't request a GPU, this plugin shouldn't score.
    if podGpuRequestType == "" {
        return 0, nil
    }
    
    // Now, calculate score based on the specific GPU type.
    totalGpuCapacity, hasGpu := node.Status.Allocatable[podGpuRequestType]
    if !hasGpu || totalGpuCapacity.IsZero() {
        return 0, nil
    }
    
    allocatedGpu := int64(0)
    for _, pod := range nodeInfo.Pods {
        for _, container := range pod.Spec.Containers {
            if req, ok := container.Resources.Requests[podGpuRequestType]; ok {
                allocatedGpu += req.Value()
            }
        }
    }
    
    score := (framework.MaxNodeScore * allocatedGpu) / totalGpuCapacity.Value()
    return score, nil

    This revised logic correctly isolates the scoring to the specific GPU resource type the incoming pod requests, making the scheduler viable in a heterogeneous environment.

    Performance at Scale

    The Score function is in the hot path of the scheduling loop. It's called for every pod for every node that passes the filter stage. Our current implementation iterates through all pods on a node (O(P) where P is the number of pods on the node). In clusters with nodes running hundreds of pods, this could introduce latency.

    For hyper-scale clusters, you might consider pre-calculation. The PreScore extension point could be used to calculate the GPU allocation for all nodes once per scheduling cycle and store it in the CycleState. The Score function would then just read this pre-computed value from the state, making it an O(1) operation.

    Preemption and Priority

    Our scheduler only influences pod placement through scoring. It does not interfere with Kubernetes's built-in preemption mechanisms. If a high-priority pod needs to be scheduled and the cluster is full, the default preemption logic will still kick in to evict lower-priority pods to make space. Bin-packing can sometimes create resource hotspots, making preemption more likely, which is a trade-off to consider. For batch-style, non-interactive workloads like ML training, this is often an acceptable trade-off for the cost benefits.

    Conclusion

    The default Kubernetes scheduler is a powerful, general-purpose tool, but its one-size-fits-all approach can be suboptimal for specialized, high-cost resources. By leveraging the scheduler framework, we can surgically inject domain-specific logic to solve critical operational problems like GPU fragmentation.

    The bin-packing scheduler we've built is not just a theoretical exercise; it's a production-grade pattern that directly addresses a significant cost driver in AI/ML platforms. By consolidating GPU workloads, it maximizes resource utilization and works in concert with the Cluster Autoscaler to dynamically adjust infrastructure to the precise demands of the workload, turning a scheduling problem into a direct and substantial cost-saving solution.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles