K8s Custom Schedulers for GPU Bin-Packing in ML Workloads
The Default Scheduler's Shortcomings for GPU Workloads
For senior engineers managing large-scale machine learning platforms on Kubernetes, the limitations of the default-scheduler become apparent quickly, especially with expensive GPU resources. While excellent for general-purpose stateless applications, its default policies—primarily NodeResourcesFit, NodeName, TaintToleration, and a balanced scoring strategy—fall short for specialized, high-value workloads. The core issues are:
torch.distributed), placing cooperating pods on GPUs connected by a high-speed interconnect is critical for performance. The default scheduler has no concept of this sub-node topology. It sees nvidia.com/gpu: 8 as eight fungible resources, not as four NVLink-paired sets.To solve these production issues, we must move beyond simple nodeSelector or affinity rules and implement custom scheduling logic. This post will focus on building a high-performance, bin-packing scheduler for GPUs using the modern Kubernetes Scheduling Framework.
Architecture: Scheduler Extenders vs. The Scheduling Framework
Before we build, it's crucial to understand the two primary mechanisms for customizing scheduling in Kubernetes. While you might encounter legacy systems using extenders, the Scheduling Framework is the standard for modern implementations.
The Legacy Approach: Scheduler Extenders
A scheduler extender is an external webhook (HTTP server) that the default scheduler calls out to during its decision-making process. You configure the scheduler to send pod and node information to your extender's endpoints for two main operations:
* Filter: The extender receives a list of nodes and returns a subset that are eligible to run the pod.
* Prioritize (Score): The extender receives the filtered list of nodes and returns a score for each, indicating preference.
Pros:
* Language Agnostic: You can write it in any language that can host an HTTP server (Python, Node.js, etc.).
* Decoupled: Runs as a separate process, isolated from the Kubernetes control plane.
Cons:
* Performance Overhead: Every scheduling decision for a relevant pod incurs at least two network round-trips. This latency is unacceptable in large, dynamic clusters with high pod churn.
* Limited Integration: Extenders have a very coarse-grained view. They cannot hook into other critical parts of the scheduling cycle like binding or preemption.
* State Management: Maintaining a consistent view of the cluster state in an external service is complex and prone to race conditions.
The Modern Approach: The Scheduling Framework
The Scheduling Framework, introduced in Kubernetes v1.15 and graduated to stable, provides a set of well-defined extension points (Go interfaces) that allow you to compile custom logic directly into the scheduler binary. These plugins run in-process, eliminating network overhead and providing deep integration.
Key Extension Points:
* QueueSort: Defines the order in which pods are taken from the scheduling queue.
* PreFilter: Performs preliminary checks on a pod before iterating through nodes.
* Filter: Similar to the extender's filter, determines if a node can run the pod. Can be stateful.
* PostFilter: Called if no nodes passed the Filter phase. Useful for preemption logic.
* PreScore: A pre-computation phase before scoring each node individually.
* Score: The core of custom logic. Assigns an integer score to each node that passed the filter phase.
* Reserve: Claims resources on a node before the pod is bound.
* Permit: A final gate before binding, allowing for asynchronous checks (e.g., waiting for a resource quota to be approved).
* PreBind / Bind / PostBind: Hooks around the process of binding the pod to the node.
Decision: For any serious, performance-sensitive use case like GPU scheduling, the Scheduling Framework is the unequivocally correct choice. It offers superior performance, tighter integration, and a more robust model for state management.
Implementing a GPU Bin-Packing Scheduler Plugin
Our goal is to create a scheduler that prioritizes nodes with the highest existing GPU allocation. This will consolidate GPU pods, leaving other nodes completely free for large, multi-GPU jobs or for the cluster autoscaler to terminate.
We'll implement a custom Score plugin. The scoring logic will be simple: a node's score is proportional to the percentage of its allocatable GPUs that are already requested by running pods.
Project Setup
First, set up a new Go project. We will be building a custom scheduler binary that includes our plugin.
mkdir gpu-scheduler
cd gpu-scheduler
go mod init github.com/my-org/gpu-scheduler
# Get the necessary Kubernetes dependencies
go get k8s.io/component-base/[email protected]
go get k8s.io/kubernetes/cmd/[email protected]
Now, let's create our plugin file: pkg/scheduler/plugin.go.
The `Score` Plugin Implementation
Our plugin needs to implement the ScorePlugin interface from the scheduling framework.
// pkg/scheduler/plugin.go
package scheduler
import (
"context"
"fmt"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
// Name is the name of the plugin used in the KubeSchedulerConfiguration.
Name = "GPUBinPacking"
// GPUResourceName is the name of the GPU resource.
GPUResourceName = "nvidia.com/gpu"
)
// GPUBinPacking is a score plugin that favors nodes with higher GPU utilization.
type GPUBinPacking struct {
handle framework.Handle
}
// Asserts that GPUBinPacking implements the ScorePlugin interface.
var _ framework.ScorePlugin = &GPUBinPacking{}
// Name returns the name of the plugin.
func (pl *GPUBinPacking) Name() string {
return Name
}
// Score is the main logic for the plugin. It calculates a score for a node based on its GPU utilization.
func (pl *GPUBinPacking) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
}
// Get the total allocatable GPUs on the node.
allocatableGPUs, ok := nodeInfo.Node().Status.Allocatable[GPUResourceName]
if !ok || allocatableGPUs.IsZero() {
// If the node has no allocatable GPUs, it's not a candidate for GPU pods.
// A score of 0 is appropriate, as it doesn't contribute to bin-packing.
klog.Infof("Node %s has no allocatable GPUs, scoring 0", nodeName)
return 0, framework.NewStatus(framework.Success)
}
totalGPUs := allocatableGPUs.Value()
if totalGPUs == 0 {
return 0, framework.NewStatus(framework.Success)
}
// Calculate the sum of GPUs requested by existing pods on the node.
requestedGPUs := int64(0)
for _, pod := range nodeInfo.Pods {
for _, container := range pod.Spec.Containers {
if req, ok := container.Resources.Requests[GPUResourceName]; ok {
requestedGPUs += req.Value()
}
}
}
// The score is the percentage of GPUs used, scaled to the framework's score range [0, 100].
// A higher score means the node is more utilized, which is what we want for bin-packing.
score := (requestedGPUs * framework.MaxNodeScore) / totalGPUs
klog.Infof("Node: %s, Allocatable GPUs: %d, Requested GPUs: %d, Score: %d", nodeName, totalGPUs, requestedGPUs, score)
return score, framework.NewStatus(framework.Success)
}
// ScoreExtensions returns a ScoreExtensions interface if the plugin implements it.
func (pl *GPUBinPacking) ScoreExtensions() framework.ScoreExtensions {
return nil // We don't need normalization.
}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
return &GPUBinPacking{
handle: h,
}, nil
}
Main Program to Register the Plugin
Now, we need a main.go to create a new scheduler command and register our custom plugin.
// cmd/scheduler/main.go
package main
import (
"os"
"k8s.io/component-base/cli"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
"github.com/my-org/gpu-scheduler/pkg/scheduler"
)
func main() {
// Register the plugin with the scheduler framework registry.
command := app.NewSchedulerCommand(
app.WithPlugin(scheduler.Name, scheduler.New),
)
if err := cli.RunNoErrOutput(command); err != nil {
os.Exit(1)
}
}
This simple main function imports our plugin package and uses the app.WithPlugin option to make the GPUBinPacking plugin available to the scheduler's configuration.
Edge Case: Implementing a Topology-Aware `Filter` Plugin
Bin-packing is great, but what about performance-sensitive workloads that need GPUs connected via NVLink? A standard resources: { requests: { nvidia.com/gpu: 2 } } doesn't guarantee this. We can solve this with a custom Filter plugin that reads a pod annotation.
Let's assume our nodes are labeled by an admin or a device plugin helper, for example: gpu-topology.my-org.com/nvlink-groups: "0-1,2-3,4-5,6-7".
A pod can request a tightly-coupled pair by using an annotation:
apiVersion: v1
kind: Pod
metadata:
name: distributed-training-job-1
annotations:
gpu-topology.my-org.com/nvlink-count: "2"
spec:
schedulerName: gpu-binpacking-scheduler
containers:
- name: cuda-worker
image: nvidia/cuda:11.4.0-base-ubuntu20.04
resources:
limits:
nvidia.com/gpu: "2"
requests:
nvidia.com/gpu: "2"
Let's implement a Filter plugin to enforce this.
// pkg/scheduler/topology_filter.go
package scheduler
import (
"context"
"strconv"
"strings"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
TopologyFilterName = "GPUTopologyFilter"
NVLinkAnnotation = "gpu-topology.my-org.com/nvlink-count"
NVLinkGroupNodeLabel = "gpu-topology.my-org.com/nvlink-groups"
)
// GPUTopologyFilter checks if a node has enough GPUs within a single NVLink group.
type GPUTopologyFilter struct{}
var _ framework.FilterPlugin = &GPUTopologyFilter{}
func (f *GPUTopologyFilter) Name() string {
return TopologyFilterName
}
func (f *GPUTopologyFilter) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// Check if the pod requests NVLink-connected GPUs.
nvlinkCountStr, ok := pod.Annotations[NVLinkAnnotation]
if !ok {
// This pod doesn't care about topology, so we don't filter.
return framework.NewStatus(framework.Success)
}
nvlinkCount, err := strconv.Atoi(nvlinkCountStr)
if err != nil || nvlinkCount <= 1 {
// Invalid annotation or trivial request, pass.
return framework.NewStatus(framework.Success)
}
// Check if the node has the topology label.
node := nodeInfo.Node()
if node == nil {
return framework.NewStatus(framework.Error, "node not found")
}
nvlinkGroupsStr, ok := node.Labels[NVLinkGroupNodeLabel]
if !ok {
klog.Infof("Node %s is rejected for pod %s because it lacks the NVLink topology label", node.Name, pod.Name)
return framework.NewStatus(framework.UnschedulableAndUnresolvable, "Node lacks NVLink topology information")
}
// Check if any NVLink group on the node is large enough.
nvlinkGroups := strings.Split(nvlinkGroupsStr, ",")
for _, group := range nvlinkGroups {
gpuIndices := strings.Split(group, "-")
if len(gpuIndices) >= nvlinkCount {
// Found a suitable group. We don't need to check for available GPUs here, as the
// default NodeResourcesFit plugin already handles the total GPU count.
// A more advanced implementation would track allocation per-group.
klog.Infof("Node %s is a candidate for pod %s, found NVLink group of size %d", node.Name, pod.Name, len(gpuIndices))
return framework.NewStatus(framework.Success)
}
}
klog.Infof("Node %s is rejected for pod %s, no NVLink group of size %d found", node.Name, pod.Name, nvlinkCount)
return framework.NewStatus(framework.Unschedulable, "No available NVLink group of the required size")
}
// NewTopologyFilter initializes a new plugin and returns it.
func NewTopologyFilter(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &GPUTopologyFilter{}, nil
}
We would then register this new plugin in our main.go as well:
// cmd/scheduler/main.go (updated)
// ...
command := app.NewSchedulerCommand(
app.WithPlugin(scheduler.Name, scheduler.New), // Our Score plugin
app.WithPlugin(scheduler.TopologyFilterName, scheduler.NewTopologyFilter), // Our new Filter plugin
)
// ...
Deployment and Configuration in a Production Cluster
Now that we have the code, we need to package and deploy it.
1. Packaging with a Dockerfile
Create a Dockerfile for a minimal, multi-stage Go build.
# Build stage
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
# Build the scheduler binary
RUN CGO_ENABLED=0 GOOS=linux go build -o /gpu-scheduler ./cmd/scheduler
# Final stage
FROM alpine:latest
WORKDIR /root/
# Copy the binary from the builder stage
COPY --from=builder /gpu-scheduler .
# The scheduler binary is the entrypoint
ENTRYPOINT ["/root/gpu-scheduler"]
Build and push the image:
docker build -t your-registry/gpu-scheduler:v1.0.0 .
docker push your-registry/gpu-scheduler:v1.0.0
2. Kubernetes Manifests
We need a Deployment, RBAC rules, and a KubeSchedulerConfiguration.
scheduler-config.yaml: This ConfigMap holds the configuration for our scheduler profile.
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: kube-system
data:
scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
profiles:
- schedulerName: gpu-binpacking-scheduler
plugins:
# Default plugins are enabled at these extension points.
# We add our custom plugins to the respective phases.
filter:
enabled:
- name: GPUTopologyFilter
score:
enabled:
- name: GPUBinPacking
disabled:
# We disable NodeResourcesBalancedAllocation to enforce our bin-packing.
- name: NodeResourcesBalancedAllocation
pluginConfig:
- name: GPUBinPacking
args: {}
- name: GPUTopologyFilter
args: {}
scheduler-deployment.yaml: This deploys the scheduler itself.
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: gpu-scheduler-role
rules:
# Add all the necessary permissions for a scheduler.
# This is a truncated list for brevity. Refer to the default system:kube-scheduler role.
- apiGroups: [""]
resources: ["nodes", "pods", "pods/binding", "replicationcontrollers"]
verbs: ["get", "list", "watch", "create", "update"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["get", "list", "watch"]
# ... more rules required
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-scheduler-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gpu-scheduler-role
subjects:
- kind: ServiceAccount
name: gpu-scheduler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-scheduler
namespace: kube-system
labels:
app: gpu-scheduler
spec:
replicas: 1
selector:
matchLabels:
app: gpu-scheduler
template:
metadata:
labels:
app: gpu-scheduler
spec:
serviceAccountName: gpu-scheduler
containers:
- name: scheduler
image: your-registry/gpu-scheduler:v1.0.0
args:
- --config=/etc/kubernetes/scheduler-config.yaml
- --v=3 # Verbose logging
volumeMounts:
- name: scheduler-config-volume
mountPath: /etc/kubernetes
volumes:
- name: scheduler-config-volume
configMap:
name: gpu-scheduler-config
Apply these manifests:
kubectl apply -f scheduler-config.yaml
kubectl apply -f scheduler-deployment.yaml
3. Using the Custom Scheduler
To use the scheduler, simply specify schedulerName in your pod spec:
apiVersion: v1
kind: Pod
metadata:
name: gpu-job-1
spec:
schedulerName: gpu-binpacking-scheduler
containers:
- name: cuda-container
image: nvidia/cuda:11.4.0-base-ubuntu20.04
resources:
limits:
nvidia.com/gpu: "2"
After applying, check the pod's events to confirm it was scheduled by your custom scheduler:
kubectl describe pod gpu-job-1
# ... Output ...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2s gpu-binpacking-scheduler Successfully assigned default/gpu-job-1 to node-g4dn-xlarge-1
Advanced Considerations and Performance Tuning
Preemption and Priority
Our bin-packing strategy can create contention. A high-priority pod might need a spot on a fully-packed node currently occupied by low-priority pods. The Kubernetes scheduler handles this via preemption, but our custom scoring must cooperate.
By default, our score doesn't consider pod priority. The TaintToleration and InterPodAffinity plugins (enabled by default) handle some of this, but in a heavily customized scheduler, you might need a PostFilter plugin. This plugin is called when a pod cannot be scheduled. It can identify lower-priority pods on nodes that could be preempted to make room for the current pod.
Performance Benchmarking
It's critical to validate that your custom plugin doesn't introduce scheduling latency.
* Scheduler Throughput: Use a tool like kubemark (a hollow-node simulator) to create thousands of virtual nodes and pods. Measure the rate at which your scheduler can place pods (pods/second). Compare this against the default scheduler baseline.
* Scheduling Latency: The scheduler itself exposes Prometheus metrics. Monitor scheduler_scheduling_algorithm_duration_seconds and scheduler_framework_extension_point_duration_seconds to pinpoint latency in your custom plugins. A Score plugin should typically execute in microseconds.
As a concrete example, a poorly implemented plugin that makes external API calls or performs heavy computations could increase P99 scheduling latency from ~50ms to ~500ms, severely impacting cluster responsiveness.
Observability
Instrument your plugin with custom metrics for observability.
// In your plugin's New function
import "github.com/prometheus/client_golang/prometheus"
var (
binPackScores = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gpuscheduler_binpack_score_nanoseconds",
Help: "Histogram of scores calculated by the bin-packing plugin.",
},
[]string{"node_name"},
)
)
// ... in New() ...
prometheus.MustRegister(binPackScores)
// ... in Score() ...
binPackScores.WithLabelValues(nodeName).Observe(float64(score))
/metrics endpoint of your custom scheduler's pod.This allows you to create dashboards that visualize your scheduler's behavior, such as the distribution of scores across nodes, helping you debug and tune your logic.
Conclusion: Beyond Bin-Packing
We've built and deployed a production-ready custom scheduler that implements a GPU-aware bin-packing and topology-aware filtering strategy. This solves a tangible and expensive problem in ML infrastructure.
The Kubernetes Scheduling Framework is an exceptionally powerful tool. The patterns discussed here can be extended to solve other complex scheduling problems:
* Network-Aware Scheduling: For distributed data processing (like Spark), a Score plugin could query a service mesh or network monitoring tool to get real-time latency between nodes, then score nodes to minimize cross-node traffic for a given job.
* I/O-Aware Scheduling: For database workloads, a plugin could favor nodes with local NVMe storage, scoring them higher than nodes with network-attached storage.
* License-Aware Scheduling: In environments with floating software licenses (e.g., for EDA tools), a Permit plugin could check out a license from a license server before allowing the pod to be bound, preventing jobs from starting only to fail due to license unavailability.
By moving beyond the default scheduler, you can transform Kubernetes from a generic container orchestrator into a highly optimized, application-aware platform tailored to your specific business and performance requirements.