GPU Bin-Packing with a Custom Kubernetes Scheduler Framework Plugin
The Problem: GPU Fragmentation and the Default Scheduler's Shortcomings
In any large-scale Kubernetes cluster running Machine Learning workloads, GPU resource management is a primary operational and financial challenge. The default Kubernetes scheduler (kube-scheduler
) employs a set of scoring plugins that, by default, favor a 'spread' strategy. The NodeResourcesLeastAllocated
plugin, for instance, gives higher scores to nodes with more available resources. While this is a reasonable default for general-purpose stateless applications, it's profoundly inefficient for expensive, non-divisible resources like GPUs.
This 'spread' logic leads to GPU fragmentation. Imagine a cluster with three 8-GPU nodes. The scheduler might place one 1-GPU pod on each node. Your cluster is now running three pods but is utilizing three expensive nodes, with 21 GPUs sitting idle. The Cluster Autoscaler sees that all three nodes are in use and will not scale them down. The ideal scenario, bin-packing, would be to place all three pods on a single node, leaving the other two completely empty and prime for termination by the autoscaler. This consolidation directly translates to significant cost savings.
While scheduler extenders were an early solution, they introduce network latency into the critical scheduling path and have limited access to the scheduler's internal state. The modern, performant, and correct approach is to implement a Scheduler Framework Plugin. This article provides a production-focused guide to building, configuring, and deploying a custom scoring plugin in Go to achieve GPU bin-packing.
Architectural Prerequisite: Scheduler Framework vs. Extenders
Before diving into code, it's crucial to understand why we're choosing the Scheduler Framework.
* Scheduler Extenders: An extender is a simple webhook. The scheduler makes an HTTP call to an external service at the Filter
and Prioritize
(scoring) stages.
* Pros: Language-agnostic.
* Cons: High latency (network hop), statelessness (the extender doesn't have access to the scheduler's cache), and increased operational complexity (managing another service).
* Scheduler Framework: A Go plugin interface that allows you to compile your custom logic directly into the scheduler binary. Your code runs in-process, giving it direct access to the scheduler's cache and eliminating network overhead.
* Pros: High performance, deep integration, access to the full scheduling context.
* Cons: Requires writing Go, slightly more complex initial setup.
For performance-critical, state-aware logic like resource-based bin-packing, the Scheduler Framework is the only production-viable choice.
Implementing the GPU Bin-Packing Scoring Plugin
Our goal is to create a plugin that implements the Score
extension point. This plugin will calculate a score for each node based on its GPU utilization, giving the highest scores to nodes that are already running GPU workloads.
1. Project Setup
First, set up a Go project. We'll need dependencies from k8s.io
repositories. Make sure your Go version is compatible with the target Kubernetes version's libraries.
mkdir gpu-bin-pack-scheduler
cd gpu-bin-pack-scheduler
go mod init github.com/your-org/gpu-bin-pack-scheduler
# Get the necessary Kubernetes dependencies (adjust versions to match your cluster)
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
2. Plugin Boilerplate
Create a file plugin.go
. This will contain our plugin's definition, constructor, and the core logic.
// plugin.go
package main
import (
"context"
"fmt"
"k8s.io/apimachinery/pkg/runtime"
v1 "k8s.io/api/core/v1"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
// Name is the name of the plugin used in the scheduler configuration.
Name = "GPUBinPacking"
// GPUDevice is the name of the GPU resource.
GPUDevice v1.ResourceName = "nvidia.com/gpu"
)
// GPUBinPacking is a plugin that favors nodes with high GPU utilization.
type GPUBinPacking struct {
handle framework.Handle
}
var _ framework.ScorePlugin = &GPUBinPacking{}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
return &GPUBinPacking{
handle: h,
}, nil
}
// Name returns the name of the plugin.
func (pl *GPUBinPacking) Name() string {
return Name
}
// The core scoring logic will go here...
This code sets up the basic structure. We define a GPUBinPacking
struct that satisfies the framework.ScorePlugin
interface. The New
function is our constructor, and Name
provides the identifier we'll use in the scheduler configuration YAML.
3. Implementing the `Score` Logic
The Score
method is the heart of our plugin. It's called for every node that passes the Filter
phase. It must return a score (from framework.MinNodeScore
to framework.MaxNodeScore
) and a status. A higher score means the node is a better fit.
Our logic will be:
- Calculate the total requested GPUs by the incoming pod.
- For the given node, get its total allocatable GPU capacity.
- For the given node, get the sum of GPU requests from all pods already running on it.
(requestedGPUs + usedGPUs) / capacityGPUs
. This rewards nodes where the new pod will increase utilization toward 100%.[0-100]
range required by the framework.// Add this method to the GPUBinPacking struct in plugin.go
// Score invoked at the score extension point.
func (pl *GPUBinPacking) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
}
node := nodeInfo.Node()
if node == nil {
return 0, framework.AsStatus(fmt.Errorf("node %q not found", nodeName))
}
// Get allocatable GPUs on the node
allocatableGPUs, hasGPU := node.Status.Allocatable[GPUDevice]
if !hasGPU || allocatableGPUs.Value() == 0 {
// If the node has no GPUs, or the pod doesn't request any, this plugin is neutral.
// We return a 0 score, giving it no preference.
return 0, framework.NewStatus(framework.Success)
}
requestedGPUs := getPodGPURequest(pod)
if requestedGPUs == 0 {
// If the pod doesn't need GPUs, we don't influence scheduling.
return 0, framework.NewStatus(framework.Success)
}
// Calculate used GPUs on the node
usedGPUs := getUsedGPUs(nodeInfo)
capacity := allocatableGPUs.Value()
utilization := (float64(usedGPUs) + float64(requestedGPUs)) / float64(capacity)
// If utilization > 1, it means the pod won't fit.
// This should ideally be caught by the Filter phase (e.g., NodeResourcesFit),
// but as a safeguard, we return a 0 score.
if utilization > 1 {
klog.V(4).Infof("Pod %s/%s cannot fit on node %s due to GPU capacity", pod.Namespace, pod.Name, nodeName)
return 0, framework.NewStatus(framework.Success)
}
// Scale the score to be between MinNodeScore and MaxNodeScore (0-100)
score := int64(utilization * float64(framework.MaxNodeScore))
klog.V(5).Infof("GPU Bin-Packing Score for pod %s/%s on node %s: %d", pod.Namespace, pod.Name, nodeName, score)
return score, framework.NewStatus(framework.Success)
}
// ScoreExtensions of the Score plugin.
func (pl *GPUBinPacking) ScoreExtensions() framework.ScoreExtensions {
// We don't need normalization, so we return nil.
return nil
}
// Helper function to get total GPU request for a pod
func getPodGPURequest(pod *v1.Pod) int64 {
var total int64
for _, container := range pod.Spec.Containers {
if req, ok := container.Resources.Requests[GPUDevice]; ok {
total += req.Value()
}
}
return total
}
// Helper function to get used GPUs on a node
func getUsedGPUs(nodeInfo *framework.NodeInfo) int64 {
var used int64
for _, pod := range nodeInfo.Pods {
// Ignore terminal pods
if pod.Pod.Status.Phase == v1.PodSucceeded || pod.Pod.Status.Phase == v1.PodFailed {
continue
}
used += getPodGPURequest(pod.Pod)
}
return used
}
4. Main Function to Register the Plugin
Finally, we need a main.go
file to register our plugin with the scheduler's command-line interface.
// main.go
package main
import (
"os"
"k8s.io/component-base/cli"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
// Import our plugin
"github.com/your-org/gpu-bin-pack-scheduler"
)
func main() {
// Register the plugin
command := app.NewSchedulerCommand(
app.WithPlugin(gpu_bin_pack_scheduler.Name, gpu_bin_pack_scheduler.New),
)
if err := cli.RunNoErrOutput(command); err != nil {
os.Exit(1)
}
}
Now you can build this into a container image.
FROM golang:1.21-alpine
WORKDIR /go/src/app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /usr/local/bin/gpu-bin-pack-scheduler .
# The final image is just the binary
FROM alpine:latest
COPY --from=0 /usr/local/bin/gpu-bin-pack-scheduler /usr/local/bin/kube-scheduler
# The custom scheduler binary must be named kube-scheduler to be a drop-in replacement
ENTRYPOINT ["/usr/local/bin/kube-scheduler"]
Deployment and Configuration
Running a custom scheduler isn't as simple as deploying a pod. You need to provide a specific configuration file that tells Kubernetes how to use your plugin.
1. KubeSchedulerConfiguration
This ConfigMap
defines a scheduler profile. We create a new profile named gpu-bin-packer
and configure its plugins
section.
Crucially, we:
NodeResourcesLeastAllocated
scoring plugin.GPUBinPacking
plugin in the score
phase.apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: kube-system
data:
scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
profiles:
- schedulerName: default-scheduler
# The default profile remains unchanged
- schedulerName: gpu-bin-packer
plugins:
# Default plugins are enabled in each phase.
# We only need to specify what we want to change.
score:
disabled:
# Disable the default plugin that spreads pods
- name: NodeResourcesLeastAllocated
enabled:
- name: GPUBinPacking
weight: 100 # Give our plugin maximum weight
# We still want other default scoring plugins to run
- name: NodeResourcesBalancedAllocation
weight: 5
- name: ImageLocality
weight: 5
2. Deployment of the Custom Scheduler
Next, we deploy our custom scheduler as a Deployment
in the kube-system
namespace. It will run alongside the default kube-scheduler
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-bin-pack-scheduler
namespace: kube-system
labels:
component: gpu-bin-pack-scheduler
spec:
replicas: 1
selector:
matchLabels:
component: gpu-bin-pack-scheduler
template:
metadata:
labels:
component: gpu-bin-pack-scheduler
spec:
serviceAccountName: kube-scheduler # Use a ServiceAccount with appropriate permissions
containers:
- name: scheduler-plugin
image: your-registry/gpu-bin-pack-scheduler:latest
args:
- --config=/etc/kubernetes/scheduler-config.yaml
- --v=4 # Increase verbosity for debugging
resources:
requests:
cpu: "100m"
memory: "256Mi"
volumeMounts:
- name: scheduler-config-volume
mountPath: /etc/kubernetes
volumes:
- name: scheduler-config-volume
configMap:
name: gpu-scheduler-config
Note: You will need to create a ClusterRole
and ClusterRoleBinding
that grants your scheduler's ServiceAccount
permissions equivalent to the default system:kube-scheduler
role.
3. Using the Custom Scheduler
To have a pod scheduled by our new scheduler, simply specify its schedulerName
in the pod spec:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload-1
spec:
schedulerName: gpu-bin-packer # This is the key!
containers:
- name: cuda-container
image: nvidia/cuda:11.4.0-base-ubuntu20.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: 1
When you apply this pod, you can check the logs of your gpu-bin-pack-scheduler
pod to see the scoring decisions.
Advanced Considerations and Edge Cases
Performance Profiling
The scheduler is a critical, latency-sensitive component. A slow scoring plugin can degrade the performance of the entire cluster. Use Go's built-in pprof
to profile your plugin under load. The kube-scheduler
exposes a /debug/pprof
endpoint. Ensure your scoring logic is efficient and avoids expensive computations or I/O. Our implementation relies on the scheduler's in-memory cache (SnapshotSharedLister
), which is extremely fast.
Interaction with the Cluster Autoscaler
This is the entire point of our exercise. With the bin-packing scheduler consolidating GPU pods onto a minimal set of nodes, other GPU nodes will become completely empty. The Cluster Autoscaler will identify these nodes as unneeded (all pods can be scheduled elsewhere) and, after a configurable timeout (e.g., 10 minutes), will terminate them. When new GPU pods arrive, the Cluster Autoscaler will provision new nodes as required. This pack-and-scale
cycle is the key to cost optimization.
Edge Case: Multi-GPU Pods and NUMA Topology
Our current scoring logic is simple. It doesn't account for NUMA topology. A high-performance computing (HPC) or distributed training job might require 4 GPUs that are all on the same NUMA node for low-latency interconnect. An 8-GPU machine might have 4 GPUs on NUMA node 0 and 4 on NUMA node 1. If 1 GPU is already used on each NUMA node, the machine has 6 free GPUs, but cannot satisfy a 4-GPU same-NUMA-node request.
To solve this, you would need to extend your scheduler with a custom Filter
plugin.
- The pod would need to express its topology requirements (e.g., via annotations).
Filter
plugin would read these annotations.- It would need to get detailed node topology information, perhaps from a device plugin like the NVIDIA GPU Operator, which can expose topology as node labels.
Filter
plugin would then check if the node has a NUMA node with enough free GPUs to satisfy the pod's request. If not, the node is filtered out before the scoring phase.This demonstrates the power of combining different plugin extension points to solve complex, real-world scheduling problems.
Conclusion
Moving beyond the default Kubernetes scheduler is a significant step toward a mature, optimized, and cost-effective infrastructure, especially for specialized workloads. By leveraging the Scheduler Framework, you can inject domain-specific logic directly into the heart of Kubernetes. The GPU bin-packing plugin we developed is not a theoretical exercise; it is a practical, high-impact solution to a common problem in MLOps. It directly enables better hardware utilization and facilitates aggressive, cost-saving autoscaling, turning a simple scheduling tweak into a substantial financial and operational win.