Custom Kubernetes Schedulers for GPU-Intensive ML Workloads
The Inadequacy of the Default Scheduler for GPU Workloads
The kube-scheduler
is designed for versatility, balancing resource requests (cpu
, memory
) across a cluster. However, when scheduling GPU-intensive machine learning workloads, especially distributed training jobs, this generalist approach reveals critical gaps. Senior engineers managing MLOps platforms quickly discover that the default scheduler is blind to the nuanced requirements of high-performance computing.
Key limitations include:
nvidia.com/gpu: 4
) as a fungible commodity. It cannot differentiate between an NVIDIA A100 and a V100 on its own. A training job compiled with CUDA features specific to the Ampere architecture may be scheduled on a node with older Volta GPUs, causing runtime failures. While nodeSelector
or nodeAffinity
can enforce this, it's a static, manual constraint that doesn't integrate into a dynamic scoring and balancing model.LeastAllocated
strategy), which is often undesirable for GPU clusters. For ML workloads, you often want to pack pods tightly onto as few nodes as possible (MostAllocated
strategy) to maximize locality and potentially scale down unused nodes to save costs. A custom scheduler allows you to implement this logic precisely.These limitations are not bugs; they are consequences of a design optimized for general-purpose stateless applications. To truly optimize GPU clusters, we need to inject domain-specific knowledge directly into the scheduling logic. This is precisely what the Kubernetes scheduler framework allows.
Architecture of a Custom Scheduler Plugin
Instead of forking and modifying kube-scheduler
, the modern approach is to build plugins that hook into its well-defined Scheduler Framework. This framework provides extension points that allow you to implement custom logic without rewriting the entire scheduler.
For our GPU-aware scheduler, we will focus on two primary extension points:
FilterPlugin
): This is a predicate function. For a given pod, the scheduler iterates through all nodes and runs the Filter
plugin. If the plugin returns Success
, the node is a feasible candidate. If it returns Unschedulable
, the node is immediately rejected for this pod. We will use this to ensure a node has the correct type and number of GPUs.ScorePlugin
): After filtering, all feasible nodes are passed to the Score
plugin. This plugin assigns an integer score (typically 0-100) to each node, where a higher score is better. The scheduler then selects the node with the highest total score from all scoring plugins. We will use this to implement our NVLink topology-aware logic.The overall process for a single pod is:
Pending Pod
-> PreFilter
-> Filter
(on all nodes in parallel) -> PostFilter
-> PreScore
-> Score
(on all filtered nodes in parallel) -> Reserve
-> Bind
We will implement a scheduler that performs the following logic:
* Filter Phase:
* Check if the pod requests GPUs (nvidia.com/gpu
).
* Check if the node has a label specifying its GPU model (e.g., gpu-model=nvidia-a100
).
* Ensure the pod's requested GPU model (via an annotation like gpu-workload-model: nvidia-a100
) matches the node's label.
* Reject the node if there's a mismatch or if the required GPUs are not available.
* Score Phase:
* For multi-GPU pods, check for a node annotation that describes its NVLink topology (e.g., nvlink-topology: '0-1,2-3'
).
* Assign a high score to nodes where the requested number of GPUs can be placed on a single, fully-connected NVLink bridge.
* Assign a lower score to nodes where GPUs are split across buses.
* Assign the lowest score to nodes without topology information.
Let's build this in Go.
Core Implementation in Go
We'll use the official k8s.io/kubernetes
and k8s.io/client-go
libraries. Ensure your Go environment is set up.
Project Setup:
go mod init custom-gpu-scheduler
go get k8s.io/component-base/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
main.go
: This file will register our custom plugin and start the scheduler component.
// main.go
package main
import (
"os"
"k8s.io/component-base/logs"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
"custom-gpu-scheduler/pkg/scheduler"
)
func main() {
logs.InitLogs()
defer logs.FlushLogs()
// Register our custom plugin with the scheduler framework.
// The command constructor from kube-scheduler/app will build a scheduler
// that includes our plugin.
command := app.NewSchedulerCommand(
app.WithPlugin(scheduler.Name, scheduler.New),
)
if err := command.Execute(); err != nil {
os.Exit(1)
}
}
pkg/scheduler/scheduler.go
: This is where our core logic resides.
// pkg/scheduler/scheduler.go
package scheduler
import (
"context"
"fmt"
"strconv"
"strings"
"github.com/go-logr/logr"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
Name = "GpuTopologyScheduler"
// Annotations used on Pods and Nodes
PodGpuModelAnnotation = "gpu-workload-model"
NodeGpuModelLabel = "gpu-model"
NodeGpuTopologyAnnotation = "nvlink-topology"
// Resource name
NvidiaGpuResource = "nvidia.com/gpu"
)
// GpuTopologyScheduler is a scheduler plugin that's aware of GPU models and NVLink topology.
type GpuTopologyScheduler struct {
handle framework.Handle
log logr.Logger
}
// Ensure GpuTopologyScheduler implements the necessary interfaces.
var _ framework.FilterPlugin = &GpuTopologyScheduler{}
var _ framework.ScorePlugin = &GpuTopologyScheduler{}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
return &GpuTopologyScheduler{
handle: h,
log: klog.FromContext(context.Background()),
}, nil
}
// Name returns the name of the plugin.
func (s *GpuTopologyScheduler) Name() string {
return Name
}
// Filter plugin implementation
func (s *GpuTopologyScheduler) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
log := s.log.WithValues("pod", klog.KObj(pod), "node", klog.KObj(nodeInfo.Node()))
// Step 1: Check if the Pod requests GPUs. If not, this plugin has no opinion.
requestedGpus := getGpuRequest(pod)
if requestedGpus == 0 {
log.V(4).Info("Pod does not request GPUs, skipping filter")
return framework.NewStatus(framework.Skip)
}
// Step 2: Check if the Node has GPUs.
node := nodeInfo.Node()
if node.Status.Allocatable.Name(NvidiaGpuResource, "").Value() == 0 {
log.V(2).Info("Node has no allocatable GPUs")
return framework.NewStatus(framework.Unschedulable, "node has no GPUs")
}
// Step 3: Enforce GPU model affinity.
podGpuModel, ok := pod.Annotations[PodGpuModelAnnotation]
if !ok {
// If pod doesn't specify a model, we don't apply a model-based filter.
// A production scheduler might reject such pods.
log.V(4).Info("Pod does not specify a GPU model annotation, skipping model check")
return framework.NewStatus(framework.Success)
}
nodeGpuModel, ok := node.Labels[NodeGpuModelLabel]
if !ok {
log.V(2).Info("Node does not have GPU model label", "label", NodeGpuModelLabel)
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node missing label %s", NodeGpuModelLabel))
}
if podGpuModel != nodeGpuModel {
log.V(2).Info("GPU model mismatch", "pod_model", podGpuModel, "node_model", nodeGpuModel)
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("pod requires GPU model %s, but node has %s", podGpuModel, nodeGpuModel))
}
log.V(4).Info("Pod and Node GPU models match, filter successful")
return framework.NewStatus(framework.Success)
}
// Score plugin implementation
func (s *GpuTopologyScheduler) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.AsStatus(fmt.Errorf("getting node info for %s: %w", nodeName, err))
}
node := nodeInfo.Node()
log := s.log.WithValues("pod", klog.KObj(pod), "node", klog.KObj(node))
requestedGpus := getGpuRequest(pod)
if requestedGpus <= 1 {
// For single-GPU pods, topology doesn't matter. Give a neutral score.
return 50, framework.NewStatus(framework.Success)
}
topology, ok := node.Annotations[NodeGpuTopologyAnnotation]
if !ok {
log.V(3).Info("Node is missing topology annotation, giving low score")
return 10, framework.NewStatus(framework.Success) // Low score for nodes without topology info
}
// Example topology: "0-1,2-3|4-5,6-7" means 0&1 are linked, 2&3 are linked, etc.
// A more robust implementation would parse this properly.
// For this example, we'll just check if a group of 'requestedGpus' size exists.
maxGroupSize := 0
groups := strings.Split(topology, "|")
for _, group := range groups {
linkedGpus := strings.Split(group, "-")
if len(linkedGpus) > maxGroupSize {
maxGroupSize = len(linkedGpus)
}
}
if int64(maxGroupSize) >= requestedGpus {
log.V(2).Info("Node has a suitable NVLink group for the pod", "requested_gpus", requestedGpus, "max_group_size", maxGroupSize)
return 100, framework.NewStatus(framework.Success) // Highest score for perfect topology match
}
log.V(2).Info("Node does not have a large enough NVLink group", "requested_gpus", requestedGpus, "max_group_size", maxGroupSize)
return 20, framework.NewStatus(framework.Success) // Higher than no-info, but lower than perfect match
}
// ScoreExtensions returns a ScoreExtensions interface if the plugin implements one, or nil.
func (s *GpuTopologyScheduler) ScoreExtensions() framework.ScoreExtensions {
return nil // We don't need normalization
}
// Helper to get GPU request from a Pod spec
func getGpuRequest(pod *v1.Pod) int64 {
var count int64
for _, container := range pod.Spec.Containers {
if val, ok := container.Resources.Limits[NvidiaGpuResource]; ok {
count += val.Value()
}
}
return count
}
This code provides the fundamental building blocks. The Filter
method performs a hard rejection based on GPU model compatibility. The Score
method implements our business logic: it gives the highest score to nodes that can contain the entire multi-GPU pod within a single NVLink group, promoting optimal performance.
Deployment and Configuration
Now, let's get this scheduler running in a cluster.
1. Containerize the Scheduler
Create a Dockerfile
:
# Use a distroless base image for a smaller, more secure final image.
FROM gcr.io/distroless/static:nonroot
WORKDIR /
COPY custom-gpu-scheduler .
USER nonroot:nonroot
ENTRYPOINT ["/custom-gpu-scheduler"]
Build and push the image:
# Build the static Go binary
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o custom-gpu-scheduler .
# Build and push the Docker image
docker build -t your-repo/custom-gpu-scheduler:v0.1.0 .
docker push your-repo/custom-gpu-scheduler:v0.1.0
2. Create the Scheduler Configuration
This KubeSchedulerConfiguration
file tells the scheduler binary to activate our custom plugin and disable default plugins that we don't need or want to interfere.
scheduler-config.yaml
:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: "/etc/kubernetes/scheduler.conf" # Path inside the pod
profiles:
- schedulerName: gpu-topology-scheduler
plugins:
# Our custom plugins are enabled here in the desired phases
filter:
enabled:
- name: "GpuTopologyScheduler"
score:
enabled:
- name: "GpuTopologyScheduler"
# We must also enable the default plugins for other phases
queueSort:
enabled:
- name: "PrioritySort"
preFilter:
enabled:
- name: "NodeResourcesFit"
bind:
enabled:
- name: "DefaultBinder"
# We explicitly disable default plugins in phases we've overridden
pluginConfig:
- name: "DefaultPreemption"
args: {}
- name: "InterPodAffinity"
args: {}
- name: "NodeAffinity"
args: {}
# Configure our plugin if needed (not in this example)
- name: "GpuTopologyScheduler"
args: {}
3. Deploy the Scheduler to Kubernetes
We need a Deployment
, ServiceAccount
, and the necessary ClusterRoles
for the scheduler to read Pods and Nodes and to bind Pods to Nodes.
scheduler-deployment.yaml
:
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-topology-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.ioio/v1
kind: ClusterRole
metadata:
name: gpu-topology-scheduler-role
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "pods/status", "replicationcontrollers", "services"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps", "extensions"]
resources: ["replicasets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["bindings", "pods/binding"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create", "get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-topology-scheduler-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gpu-topology-scheduler-role
subjects:
- kind: ServiceAccount
name: gpu-topology-scheduler
namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-config
namespace: kube-system
data:
scheduler-config.yaml: | # Paste the content of scheduler-config.yaml here
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
# ... (rest of config)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-topology-scheduler
namespace: kube-system
labels:
app: gpu-topology-scheduler
spec:
replicas: 2 # Run with HA
selector:
matchLabels:
app: gpu-topology-scheduler
template:
metadata:
labels:
app: gpu-topology-scheduler
spec:
serviceAccountName: gpu-topology-scheduler
containers:
- name: scheduler
image: your-repo/custom-gpu-scheduler:v0.1.0
args:
- "/custom-gpu-scheduler"
- "--config=/etc/kubernetes/scheduler-config.yaml"
- "--v=3"
volumeMounts:
- name: scheduler-config-volume
mountPath: /etc/kubernetes
volumes:
- name: scheduler-config-volume
configMap:
name: scheduler-config
Apply this manifest: kubectl apply -f scheduler-deployment.yaml
.
Using the Custom Scheduler
To use the scheduler, a Pod simply needs to specify its schedulerName
.
First, let's prepare our nodes. This would typically be done by a daemonset or an operator.
# Node 1: 8x A100s, with two 4-GPU NVLink groups
kubectl label node node-1 gpu-model=nvidia-a100
kubectl annotate node node-1 nvlink-topology='0-1-2-3|4-5-6-7'
# Node 2: 8x V100s, with no topology info provided
kubectl label node node-2 gpu-model=nvidia-v100
Now, let's create a Pod that requests 4 A100 GPUs for a distributed training job.
training-pod.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: distributed-training-job-a100
annotations:
gpu-workload-model: "nvidia-a100"
spec:
schedulerName: gpu-topology-scheduler # KEY: This tells Kubernetes to use our scheduler
containers:
- name: training-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: 4
When you kubectl apply -f training-pod.yaml
, the following will happen:
gpu-topology-scheduler
.Filter
plugin runs. * It evaluates node-1
: The pod asks for nvidia-a100
, the node has label gpu-model=nvidia-a100
. Pass.
* It evaluates node-2
: The pod asks for nvidia-a100
, the node has label gpu-model=nvidia-v100
. Fail. node-2
is filtered out.
Score
plugin runs on the remaining candidates (only node-1
).* It sees the pod requests 4 GPUs.
* It reads node-1
's annotation nvlink-topology='0-1-2-3|4-5-6-7'
.
* It finds a group of size 4 that can fit the request.
* It assigns node-1
a score of 100.
node-1
is the highest-scoring node, and the pod is bound to it.If we submitted a similar pod asking for V100s, it would be correctly placed on node-2
. If we submitted a 5-GPU A100 pod, it would be scheduled on node-1
but would receive a lower score (20) because it doesn't fit a single NVLink group, signaling a potential performance compromise.
Advanced Considerations and Edge Cases
Building a custom scheduler for production requires more than just the core logic. Here are critical factors senior engineers must address.
Scheduling Latency and Throughput
Every line of code in your Filter
and Score
plugins adds latency to pod scheduling. If your logic involves complex computations or external API calls (an anti-pattern!), you can significantly slow down the scheduler. The Kubernetes project aims for a scheduling throughput of thousands of pods per second.
Mitigation:
* Performance Profiling: Use Go's built-in pprof tooling. The scheduler binary can expose a pprof endpoint. Analyze CPU and memory profiles under load to identify bottlenecks in your plugins.
* Caching: For data that doesn't change often (like node topology), use the framework.Handle
to access the scheduler's shared snapshot lister. This is an in-memory cache of cluster state, far faster than querying the API server directly.
* Pre-computation: If scoring is complex, consider if some parts can be pre-computed and stored in the framework.CycleState
. This is a key-value store that persists for the duration of a single pod's scheduling attempt, allowing you to pass data from PreFilter
to Filter
to Score
without recalculating.
High Availability (HA)
A single scheduler instance is a single point of failure. If it crashes, no new pods (using that scheduler) can be scheduled.
Solution:
Run multiple replicas of your scheduler deployment (as shown in our Deployment
manifest with replicas: 2
). The --leader-elect=true
flag (enabled by default in the KubeSchedulerConfiguration
) is crucial. It uses a Lease object in Kubernetes to ensure that only one scheduler replica is active at any given time. If the leader fails, another replica will acquire the lease and take over, providing seamless failover.
Preemption
What happens if a high-priority ML training job arrives but there are no available resources, while low-priority batch processing pods are occupying the needed GPUs? The job will remain pending.
Solution:
Your scheduler needs to implement the Preemption
logic. This is a complex process where the scheduler identifies low-priority pods that can be evicted to make room for the high-priority pod. You can enable the DefaultPreemption
plugin in your configuration. However, for a truly custom behavior, you might need to implement your own PreFilter
logic that checks for potential preemption candidates, which significantly increases the complexity of your scheduler.
Race Conditions and State Staleness
The scheduler operates on a snapshot of the cluster's state. It's possible for a node's state to change after the Score
phase but before the Bind
phase. For example, another pod lands on your chosen node, consuming the GPUs you thought were free.
Solution:
The Bind
phase is the final, atomic operation. The DefaultBinder
plugin will attempt to create the binding object. If the node no longer has the resources, the Kubelet on that node will reject the pod, and the pod will go back to the Pending
state to be rescheduled. Your scheduler should be stateless and idempotent; it should make the same correct decision given the same cluster state, regardless of previous failed attempts.
Conclusion: The Trade-off of Customization
Building a custom Kubernetes scheduler is a powerful but non-trivial undertaking. It's not a solution for every problem. You should only consider it when you have high-value, specialized workloads where the performance and efficiency gains from bespoke scheduling logic justify the development and maintenance overhead.
For GPU-intensive ML platforms, the calculus often leans in favor of custom schedulers. The ability to perform topology-aware scheduling, enforce fine-grained resource constraints, and implement custom bin-packing strategies can translate directly into faster training times, higher hardware utilization, and significant cost savings. By leveraging the scheduler framework, you can inject this critical domain knowledge into your cluster, transforming it from a general-purpose compute grid into a highly optimized, high-performance computing environment.