Optimizing GPU Utilization with a Custom Kubernetes Bin-Packing Scheduler

October 14, 2025

16 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The High Cost of GPU Fragmentation in Kubernetes

In modern cloud-native environments, particularly those running AI/ML workloads, NVIDIA GPUs are a critical but costly resource. The default Kubernetes scheduler (default-scheduler) employs a set of scoring plugins, such as NodeResourcesLeastAllocated, that favor spreading pods across as many nodes as possible. While this strategy promotes resilience for general-purpose stateless applications, it's profoundly inefficient and expensive for GPU workloads.

Consider a common scenario: a cluster with a node pool of expensive g2-standard-4 nodes, each equipped with one NVIDIA L4 GPU. If you schedule four separate training jobs, each requesting a single GPU, the default scheduler will likely place each pod on a different node. The result? Four active, high-cost nodes, each with its GPU at 100% utilization but its CPU and memory potentially underutilized. More critically, the Cluster Autoscaler sees all four nodes as active and cannot scale down the pool, even if the total workload could theoretically fit on a single, more powerful node. This fragmentation directly translates to wasted resources and a massively inflated cloud bill.

To solve this, we need to invert the scheduler's logic for these specific workloads. Instead of spreading, we need to pack them as tightly as possible. This is a classic bin-packing problem. The goal is to fill up one GPU node completely before scheduling a pod on a second one. This consolidation frees up entire nodes, making them eligible for termination by the Cluster Autoscaler.

This article details the implementation of a custom Kubernetes scheduler focused exclusively on GPU bin-packing. We will leverage the scheduler-framework in Go to create a custom scoring plugin that prioritizes nodes with the highest existing GPU allocation, deploy it as a secondary scheduler in our cluster, and demonstrate its direct impact on workload placement and cost efficiency.

The Scheduler Framework: Our Toolkit for Customization

The Kubernetes scheduler is not a monolith. Since v1.19, it's built on a highly pluggable architecture called the scheduler framework. This framework defines a series of extension points (interfaces) that allow developers to inject custom logic into the scheduling lifecycle. For our purposes, the most important extension point is Score.

Here's a quick refresher on the scheduling cycle phases and their relevant extension points:

Sorting: QueueSort plugins sort pods in the scheduling queue.

Filtering: PreFilter and Filter plugins eliminate nodes that cannot run the pod (e.g., insufficient resources, failed taints/tolerations).

Scoring: PreScore and Score plugins rank the nodes that passed the filtering phase. Each Score plugin returns an integer score for each node (e.g., 0-100). The scheduler then sums the weighted scores from all active scoring plugins to determine the final rank.

Binding: Reserve, Permit, PreBind, Bind, and PostBind plugins execute the binding of the pod to the chosen node.

Our custom scheduler will implement the ScorePlugin interface. We will design a scoring function that gives the highest scores to nodes that already have GPU pods running, thereby encouraging the scheduler to place new GPU pods on those same nodes until they are full.

Designing the GPU Bin-Packing Scoring Logic

The core of our custom scheduler is its scoring algorithm. The logic should be simple, effective, and computationally inexpensive.

Our goal is to reward nodes that have a higher number of allocated GPUs. A simple linear scoring function will suffice:

score = (maxScore * allocatedGpuRequests) / totalGpuCapacity

Let's break this down:

* maxScore: The maximum score a plugin can return, typically framework.MaxNodeScore (which is 100).

* allocatedGpuRequests: The sum of GPU requests from all pods currently scheduled on the node.

* totalGpuCapacity: The total number of allocatable GPUs on the node.

Example Walkthrough:

Assume a cluster with two nodes, node-a and node-b, each with 4 GPUs.

node-a currently has 3 pods, each requesting 1 GPU. allocatedGpuRequests = 3. Its score would be (100 3) / 4 = 75.

node-b has 1 pod requesting 1 GPU. allocatedGpuRequests = 1. Its score would be (100 1) / 4 = 25.

A new pod requesting 1 GPU arrives. Our plugin will score node-a at 75 and node-b at 25. The scheduler will choose node-a, packing the fourth pod onto it and leaving node-b with minimal utilization, making it a prime candidate for scale-down if its single pod is later removed.

This approach directly counteracts the default NodeResourcesLeastAllocated behavior and achieves our bin-packing goal.

Implementation: Building the Scheduler in Go

Let's translate our design into a functional Go program. We'll create a new scheduler plugin and compile it into a scheduler binary.

Step 1: Project Setup

Initialize a new Go module.

bash

# Create project directory
mkdir gpu-bin-packer-scheduler
cd gpu-bin-packer-scheduler

# Initialize Go module
go mod init github.com/your-org/gpu-bin-packer-scheduler

# Add Kubernetes dependencies
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/klog/[email protected]

Note: Ensure you use compatible versions of Kubernetes libraries. This example targets Kubernetes v1.28.

Step 2: Implementing the Scoring Plugin

Create a file pkg/scheduler/plugin.go to house our plugin's logic.

package scheduler

import (
	"context"
	"fmt"

	v1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/klog/v2"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)

const (
	// Name is the name of the plugin used in the plugin registry and configurations.
	Name = "GpuBinPacker"
	// GpuResourceName is the name of the GPU resource.
	GpuResourceName = "nvidia.com/gpu"
)

// GpuBinPacker is a score plugin that favors nodes with higher GPU allocation.
type GpuBinPacker struct {
	handle framework.Handle
}

// Asserts that GpuBinPacker implements the ScorePlugin interface.
var _ framework.ScorePlugin = &GpuBinPacker{}

// Name returns the name of the plugin.
func (pl *GpuBinPacker) Name() string {
	return Name
}

// Score is the function that ranks a node.
func (pl *GpuBinPacker) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.AsStatus(fmt.Errorf("getting node %q from Snapshot: %w", nodeName, err))
	}

	node := nodeInfo.Node()
	if node == nil {
		return 0, framework.AsStatus(fmt.Errorf("node %q not found", nodeName))
	}

	// If the node has no GPUs, it should not be scored by this plugin.
	totalGpuCapacity, hasGpu := node.Status.Allocatable[GpuResourceName]
	if !hasGpu || totalGpuCapacity.IsZero() {
		klog.V(4).Infof("Node %s has no allocatable GPUs, scoring 0", nodeName)
		return 0, nil
	}

	// Calculate the sum of GPU requests from all pods on the node.
	allocatedGpu := int64(0)
	for _, pod := range nodeInfo.Pods {
		for _, container := range pod.Spec.Containers {
			if req, ok := container.Resources.Requests[GpuResourceName]; ok {
				allocatedGpu += req.Value()
			}
		}
	}

	// The score is the percentage of GPUs allocated, scaled to the max score.
	// Higher allocation = higher score = more likely to be chosen.
	score := (framework.MaxNodeScore * allocatedGpu) / totalGpuCapacity.Value()

	klog.V(5).Infof("Node: %s, Allocated GPUs: %d, Total GPUs: %d, Score: %d", nodeName, allocatedGpu, totalGpuCapacity.Value(), score)

	return score, nil
}

// ScoreExtensions returns a ScoreExtensions interface if it exists.
func (pl *GpuBinPacker) ScoreExtensions() framework.ScoreExtensions {
	return nil // We don't need to implement this for our simple case.
}

// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
	return &GpuBinPacker{
		handle: h,
	}, nil
}

Key Implementation Details:

* handle.SnapshotSharedLister().NodeInfos().Get(nodeName): This is the efficient, canonical way to get node and pod information within the scheduler framework. It uses a cached snapshot of the cluster state for the current scheduling cycle, avoiding direct API server calls for every scoring calculation.

* node.Status.Allocatable[GpuResourceName]: We check for the presence and capacity of nvidia.com/gpu resources. If a node has no GPUs, we give it a score of 0.

* Resource Iteration: We iterate through all pods on the nodeInfo and sum their GPU requests. This is the core of our allocatedGpuRequests calculation.

* Scoring Formula: The final score is calculated exactly as designed. klog statements are added for verbose logging, which is invaluable for debugging scheduling decisions.

Step 3: Registering the Plugin and Building the Scheduler Binary

Now, we create the main entrypoint for our custom scheduler in cmd/scheduler/main.go.

package main

import (
	"os"

	"k8s.io/component-base/cli"
	"k8s.io/kubernetes/cmd/kube-scheduler/app"

	"github.com/your-org/gpu-bin-packer-scheduler/pkg/scheduler"
)

func main() {
	// Register the custom plugin with the scheduler framework registry.
	command := app.NewSchedulerCommand(
		app.WithPlugin(scheduler.Name, scheduler.New),
	)

	code := cli.Run(command)
	os.Exit(code)
}

This code is surprisingly simple. We use the app.NewSchedulerCommand from kube-scheduler's own codebase and use the app.WithPlugin option to register our GpuBinPacker plugin under its name. This creates a new scheduler binary that is functionally identical to the default kube-scheduler but now includes our custom plugin, ready to be activated via configuration.

Deployment and Configuration in a Production Cluster

Building the binary is only half the battle. Deploying it correctly with the right configuration and permissions is critical.

Step 1: Containerize the Scheduler

We need a lean, production-ready Docker image. A multi-stage Dockerfile is perfect for this.

Dockerfile

# --- Build Stage ---
FROM golang:1.21-alpine AS builder

WORKDIR /app

# Copy go.mod and go.sum files to download dependencies
COPY go.mod go.sum ./
RUN go mod download

# Copy the source code
COPY . .

# Build the scheduler binary
# CGO_ENABLED=0 is important for a static binary
# -ldflags "-w -s" strips debug symbols to reduce size
RUN CGO_ENABLED=0 go build -ldflags "-w -s" -o /gpu-bin-packer-scheduler ./cmd/scheduler

# --- Final Stage ---
FROM alpine:3.18

# Copy the static binary from the builder stage
COPY --from=builder /gpu-bin-packer-scheduler /usr/local/bin/scheduler

# The scheduler runs as non-root user for security
USER 65532:65532

ENTRYPOINT ["/usr/local/bin/scheduler"]

Build and push this image to your container registry:

bash

docker build -t your-registry/gpu-bin-packer-scheduler:v1.0.0 .
docker push your-registry/gpu-bin-packer-scheduler:v1.0.0

Step 2: RBAC Configuration

Our scheduler needs permissions to interact with the Kubernetes API server. It needs to get nodes, pods, and endpoints, as well as update pod status (to bind them to a node). We'll use the same ClusterRole as the default kube-scheduler for simplicity and correctness.

deploy/rbac.yaml:

yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-bin-packer-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-bin-packer-scheduler-as-scheduler
subjects:
- kind: ServiceAccount
  name: gpu-bin-packer-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler # Use the built-in role
  apiGroup: rbac.authorization.k8s.io

Step 3: KubeSchedulerConfiguration

This is where we activate our plugin. We create a KubeSchedulerConfiguration file that defines a new scheduler profile. In this profile, we disable the default scoring plugins that spread pods and enable our GpuBinPacker plugin with a high weight.

deploy/scheduler-config.yaml:

yaml

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  resourceNamespace: kube-system
  resourceName: gpu-bin-packer-scheduler
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf # This will be provided by the in-cluster config

profiles:
  - schedulerName: gpu-bin-packer-scheduler
    plugins:
      # Default plugins are still used for Filter, Reserve, etc.
      # We are only customizing the Score phase.
      score:
        # Disable default plugins that conflict with our bin-packing goal.
        disabled:
        - name: NodeResourcesLeastAllocated
        - name: NodeResourcesBalancedAllocation
        # Enable our custom plugin with a weight.
        enabled:
        - name: GpuBinPacker
          weight: 100

We will mount this configuration into our scheduler's pod via a ConfigMap.

bash

kubectl create configmap gpu-scheduler-config --from-file=deploy/scheduler-config.yaml -n kube-system

Step 4: Deploying the Scheduler

Finally, we create a Deployment for our scheduler in the kube-system namespace.

deploy/deployment.yaml:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-bin-packer-scheduler
  namespace: kube-system
  labels:
    app: gpu-bin-packer-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-bin-packer-scheduler
  template:
    metadata:
      labels:
        app: gpu-bin-packer-scheduler
    spec:
      serviceAccountName: gpu-bin-packer-scheduler
      containers:
      - name: scheduler
        image: your-registry/gpu-bin-packer-scheduler:v1.0.0
        args:
        - --config=/etc/kubernetes/scheduler-config.yaml
        - --v=4 # Set log level for debugging
        resources:
          requests:
            cpu: "100m"
            memory: "100Mi"
        volumeMounts:
        - name: scheduler-config-volume
          mountPath: /etc/kubernetes
      volumes:
      - name: scheduler-config-volume
        configMap:
          name: gpu-scheduler-config

Apply all the manifests:

bash

kubectl apply -f deploy/rbac.yaml
kubectl apply -f deploy/deployment.yaml

Your custom scheduler is now running and ready to accept pods!

Using the Scheduler and Verifying the Behavior

To use our new scheduler, a pod must specify its schedulerName in the pod spec.

Let's create four identical pods requesting one GPU each.

test-pods.yaml:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-1
spec:
  schedulerName: gpu-bin-packer-scheduler # <-- Use our custom scheduler
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.1.0-base-ubuntu22.04
    command: ["sleep", "3600"]
    resources:
      limits:
        nvidia.com/gpu: 1
---
# ... (repeat for gpu-pod-2, gpu-pod-3, gpu-pod-4)

Verification Scenario:

Initial State: Assume a cluster with a GPU node pool managed by the Cluster Autoscaler. Let's say it has two n2d-standard-8 nodes, each with 2 NVIDIA T4 GPUs.

Deploy Pod 1: kubectl apply -f test-pods.yaml (just pod 1). It gets scheduled on node-1.

Deploy Pod 2: Apply the manifest for pod 2. Check the scheduler logs (kubectl logs -n kube-system -l app=gpu-bin-packer-scheduler). You should see that node-1 gets a score of (100 * 1) / 2 = 50 while node-2 gets a score of 0. The pod is placed on node-1.

Deploy Pods 3 & 4: Apply the manifests for pods 3 and 4. Pod 3 will land on node-2 because node-1 is now full (its filter plugins will fail). Pod 4 will be bin-packed with pod 3 on node-2.

Now, let's observe the output:

bash

$ kubectl get pods -o wide

NAME        READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
gpu-pod-1   1/1     Running   0          5m    10.4.1.5    node-1   <none>           <none>
gpu-pod-2   1/1     Running   0          4m    10.4.1.6    node-1   <none>           <none>
gpu-pod-3   1/1     Running   0          3m    10.4.2.8    node-2   <none>           <none>
gpu-pod-4   1/1     Running   0          2m    10.4.2.9    node-2   <none>           <none>

The pods are tightly packed. Now, if you kubectl delete pod gpu-pod-3 gpu-pod-4, node-2 becomes completely empty. The Cluster Autoscaler will identify this underutilized node and, after a configurable timeout (typically 10 minutes), terminate it, achieving our goal of cost savings.

Advanced Considerations and Edge Cases

Handling Heterogeneous GPU Types

In a real-world cluster, you might have nodes with different GPU types (e.g., nvidia.com/gpu-type-t4 and nvidia.com/gpu-type-a100). Our current scheduler only looks at nvidia.com/gpu and doesn't differentiate. This is a problem because a pod requesting a T4 should not be scored based on A100 utilization.

To handle this, the scoring logic must be more sophisticated. It needs to check the specific GPU resource type requested by the pod.

Improved Score Logic Snippet:

// Inside the Score function...

// Find the specific GPU type the pod is requesting.
var podGpuRequestType v1.ResourceName
for _, container := range p.Spec.Containers {
    for resourceName := range container.Resources.Requests {
        if strings.HasPrefix(string(resourceName), "nvidia.com/") {
            podGpuRequestType = resourceName
            break
        }
    }
    if podGpuRequestType != "" {
        break
    }
}

// If the pod doesn't request a GPU, this plugin shouldn't score.
if podGpuRequestType == "" {
    return 0, nil
}

// Now, calculate score based on the specific GPU type.
totalGpuCapacity, hasGpu := node.Status.Allocatable[podGpuRequestType]
if !hasGpu || totalGpuCapacity.IsZero() {
    return 0, nil
}

allocatedGpu := int64(0)
for _, pod := range nodeInfo.Pods {
    for _, container := range pod.Spec.Containers {
        if req, ok := container.Resources.Requests[podGpuRequestType]; ok {
            allocatedGpu += req.Value()
        }
    }
}

score := (framework.MaxNodeScore * allocatedGpu) / totalGpuCapacity.Value()
return score, nil

This revised logic correctly isolates the scoring to the specific GPU resource type the incoming pod requests, making the scheduler viable in a heterogeneous environment.

Performance at Scale

The Score function is in the hot path of the scheduling loop. It's called for every pod for every node that passes the filter stage. Our current implementation iterates through all pods on a node (O(P) where P is the number of pods on the node). In clusters with nodes running hundreds of pods, this could introduce latency.

For hyper-scale clusters, you might consider pre-calculation. The PreScore extension point could be used to calculate the GPU allocation for all nodes once per scheduling cycle and store it in the CycleState. The Score function would then just read this pre-computed value from the state, making it an O(1) operation.

Preemption and Priority

Our scheduler only influences pod placement through scoring. It does not interfere with Kubernetes's built-in preemption mechanisms. If a high-priority pod needs to be scheduled and the cluster is full, the default preemption logic will still kick in to evict lower-priority pods to make space. Bin-packing can sometimes create resource hotspots, making preemption more likely, which is a trade-off to consider. For batch-style, non-interactive workloads like ML training, this is often an acceptable trade-off for the cost benefits.

Conclusion

The default Kubernetes scheduler is a powerful, general-purpose tool, but its one-size-fits-all approach can be suboptimal for specialized, high-cost resources. By leveraging the scheduler framework, we can surgically inject domain-specific logic to solve critical operational problems like GPU fragmentation.

The bin-packing scheduler we've built is not just a theoretical exercise; it's a production-grade pattern that directly addresses a significant cost driver in AI/ML platforms. By consolidating GPU workloads, it maximizes resource utilization and works in concert with the Cluster Autoscaler to dynamically adjust infrastructure to the precise demands of the workload, turning a scheduling problem into a direct and substantial cost-saving solution.

The High Cost of GPU Fragmentation in Kubernetes

The Scheduler Framework: Our Toolkit for Customization

Designing the GPU Bin-Packing Scoring Logic

Implementation: Building the Scheduler in Go

Step 1: Project Setup

Step 2: Implementing the Scoring Plugin

Step 3: Registering the Plugin and Building the Scheduler Binary

Deployment and Configuration in a Production Cluster

Step 1: Containerize the Scheduler

Step 2: RBAC Configuration

Step 3: KubeSchedulerConfiguration

Step 4: Deploying the Scheduler

Using the Scheduler and Verifying the Behavior

Advanced Considerations and Edge Cases

Handling Heterogeneous GPU Types

Performance at Scale

Preemption and Priority

Conclusion

Found this article helpful?