Building a GPU-Aware Kubernetes Scheduler for ML Workloads
The Problem: Why the Default Scheduler Fails High-Performance ML
The kube-scheduler is a marvel of engineering, but its default configuration is optimized for general-purpose stateless and stateful applications. When it comes to specialized, high-performance workloads like distributed machine learning training, its one-size-fits-all approach reveals critical limitations. The scheduler sees a node with eight NVIDIA A100 GPUs as simply having a capacity of nvidia.com/gpu: 8. It has no intrinsic understanding that four of those GPUs might be on one NUMA node connected by high-speed NVLink, while the other four are on another, connected only by the slower PCIe bus.
For a distributed training job using a framework like Horovod, PyTorch DistributedDataParallel, or DeepSpeed, the interconnect bandwidth between GPUs is a primary determinant of training throughput. A job requiring four GPUs might be scheduled by default across two different NUMA nodes, or even worse, across two different physical machines. The resulting communication overhead through PCIe or the network fabric can cripple performance, negating the benefits of distributed training.
Consider this common scenario:
A StatefulSet or a custom operator like MPIJob launches a 4-pod training job, with each pod requesting one GPU (nvidia.com/gpu: 1). The default scheduler might place them like this:
* pod-0 -> node-A, GPU-0
* pod-1 -> node-A, GPU-5
* pod-2 -> node-B, GPU-1
* pod-3 -> node-B, GPU-2
If GPU-0 and GPU-5 on node-A are on different sockets, communication between pod-0 and pod-1 is already suboptimal. The communication between pods on node-A and node-B is even worse, bottlenecked by the network. A topology-aware scheduler would understand that the ideal placement is to find a single node with four available GPUs connected by NVLink and place all four pods there.
This is the gap we will fill. We need a scheduler that can read node topology metadata and make intelligent placement decisions based on pod-specific requirements.
Architectural Choices: Extender vs. Framework Plugin
Before diving into code, it's crucial to choose the right architectural pattern for customizing the Kubernetes scheduler. There are two primary methods:
kube-scheduler is configured with a URL to an external HTTP server (the extender). During the scheduling cycle, the scheduler sends a request containing the pod and a list of candidate nodes to the extender. The extender responds with a list of filtered (and optionally, scored) nodes. * Pros: Language-agnostic (can be written in any language that can run an HTTP server), decoupled from the scheduler's lifecycle.
* Cons: High latency due to network hops and serialization/deserialization overhead. Less powerful, as it has limited access to the scheduler's internal state and caching mechanisms. It's generally considered a legacy approach.
* Pros: Extremely high performance as it runs in-process with the scheduler. Full access to the scheduler's internal cache and framework handle, allowing for complex and stateful logic. The de facto standard for serious scheduler customization.
* Cons: Must be written in Go. Tightly coupled with the Kubernetes scheduler's source code and version.
For performance-critical ML workloads, the latency introduced by an extender is unacceptable. We will be building a Scheduler Framework Plugin to ensure our scheduling decisions are as fast and efficient as possible.
Deep Dive: The Kubernetes Scheduler Framework Extension Points
The Scheduler Framework provides a series of extension points that allow us to inject custom logic. For our GPU-aware scheduler, the most important are Filter and Score.
* Filter: This extension point is used to eliminate nodes that cannot run the pod. The scheduler runs Filter plugins sequentially. If any plugin marks a node as unschedulable, it's removed from consideration for the current pod. Our Filter plugin will check if a node has the required GPU topology (e.g., a sufficient number of GPUs connected by NVLink).
* Score: After filtering, the remaining nodes are candidates. The Score extension point is used to rank these candidates. Each Score plugin returns an integer score (typically 0-100) for each node. The scheduler sums the scores from all Score plugins and picks the node with the highest total score. Our Score plugin will prioritize nodes that are a better fit, for instance, by preferring to co-locate pods from the same training job or implementing bin-packing logic.
Other relevant points include PreFilter (for pre-computation before filtering nodes) and Permit (for implementing gang scheduling), but Filter and Score are the core of our solution.
Implementation Prerequisite: Discovering GPU Topology
Our scheduler can only make decisions based on data it has. It needs to know the GPU topology of each node in the cluster. The best way to achieve this is by using the NVIDIA GPU Feature Discovery (GFD) tool. GFD runs as a DaemonSet and automatically labels each node with detailed GPU information. For example, it can produce labels like:
nvidia.com/gpu.count: '8'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
nvidia.com/gpu.memory: '40536'
nvidia.com/gpu.topology.nvlink: 'true'
The nvidia.com/gpu.topology.nvlink: 'true' label is the critical piece of information our scheduler will use. For more complex scenarios, GFD can also expose metrics via the node-feature-discovery API, but for our purposes, simple node labels are sufficient.
Implementation: Building the Topology-Aware Scheduler Plugin
Let's write the Go code for our plugin. We'll create two plugins: GpuTopologyFilter and JobLocalityScore.
Our project structure will look like this:
gpu-scheduler/
├── go.mod
├── go.sum
├── main.go
└── plugins/
├── filter.go
└── score.go
First, set up the Go module:
go mod init github.com/your-org/gpu-scheduler
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/klog/v2
The `GpuTopologyFilter` Plugin
This plugin will filter out nodes that don't meet the GPU topology requirements specified in a pod's annotations.
plugins/filter.go
package plugins
import (
"context"
"fmt"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
// Name is the name of the plugin used in the plugin registry and configurations.
GpuTopologyFilterName = "GpuTopologyFilter"
// Pod annotations
AnnotationGpuTopologyRequirement = "ml-workload.io/gpu-topology-required"
// Node labels from NVIDIA GFD
LabelGpuTopologyNvlink = "nvidia.com/gpu.topology.nvlink"
)
// GpuTopologyFilter is a scheduler plugin that filters nodes based on GPU topology.
type GpuTopologyFilter struct{}
var _ framework.FilterPlugin = &GpuTopologyFilter{}
// Name returns name of the plugin.
func (f *GpuTopologyFilter) Name() string {
return GpuTopologyFilterName
}
// Filter is the function that the scheduler framework calls for each node to check if the pod can be scheduled on it.
func (f *GpuTopologyFilter) Filter(ctx context.Context, state *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
klog.Infof("Filtering pod %s on node %s", pod.Name, nodeInfo.Node().Name)
// 1. Check if the pod even requests GPUs. If not, this plugin has no opinion.
requestsGpu := false
for _, container := range pod.Spec.Containers {
if _, ok := container.Resources.Limits["nvidia.com/gpu"]; ok {
requestsGpu = true
break
}
}
if !requestsGpu {
klog.Infof("Pod %s does not request GPUs. Skipping filter.", pod.Name)
return framework.NewStatus(framework.Success)
}
// 2. Check for the specific topology annotation on the pod.
topologyRequirement, ok := pod.Annotations[AnnotationGpuTopologyRequirement]
if !ok {
klog.Infof("Pod %s has no GPU topology requirement. Skipping filter.", pod.Name)
return framework.NewStatus(framework.Success)
}
// 3. We have a requirement. Now check if the node satisfies it.
node := nodeInfo.Node()
if node == nil {
return framework.NewStatus(framework.Error, "node not found")
}
nodeLabels := node.GetLabels()
switch topologyRequirement {
case "nvlink":
nvlinkAvailable, labelExists := nodeLabels[LabelGpuTopologyNvlink]
if !labelExists || nvlinkAvailable != "true" {
msg := fmt.Sprintf("Node %s lacks required NVLink topology", node.Name)
klog.Warningf("Pod %s failed filter on node %s: %s", pod.Name, node.Name, msg)
return framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)
}
default:
// If the annotation has an unknown value, we can choose to fail or succeed. Failing is safer.
msg := fmt.Sprintf("Unknown GPU topology requirement: %s", topologyRequirement)
klog.Warningf("Pod %s failed filter on node %s: %s", pod.Name, node.Name, msg)
return framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)
}
klog.Infof("Pod %s PASSED filter on node %s", pod.Name, node.Name)
return framework.NewStatus(framework.Success)
}
// NewGpuTopologyFilter initializes a new plugin and returns it.
func NewGpuTopologyFilter(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &GpuTopologyFilter{}, nil
}
The `JobLocalityScore` Plugin
This plugin encourages co-location of pods belonging to the same job (identified by a job-name label). This is a simple form of gang scheduling that helps keep distributed training pods close together.
plugins/score.go
package plugins
import (
"context"
"fmt"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
JobLocalityScoreName = "JobLocalityScore"
// Label to identify pods belonging to the same job
LabelJobName = "job-name"
)
// JobLocalityScore is a plugin that favors nodes where other pods of the same job are running.
type JobLocalityScore struct {
handle framework.Handle
}
var _ framework.ScorePlugin = &JobLocalityScore{}
// Name returns the name of the plugin.
func (s *JobLocalityScore) Name() string {
return JobLocalityScoreName
}
// Score is the function that the scheduler framework calls to score a node.
func (s *JobLocalityScore) Score(ctx context.Context, state *framework.CycleState, p *corev1.Pod, nodeName string) (int64, *framework.Status) {
// 1. Check if the pod belongs to a job.
jobName, ok := p.Labels[LabelJobName]
if !ok {
// Not a job pod, this plugin doesn't apply.
return 0, framework.NewStatus(framework.Success)
}
// 2. Get node information.
nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
}
// 3. Iterate over existing pods on the node to find matches.
score := int64(0)
for _, existingPod := range nodeInfo.Pods {
if existingPod.Labels[LabelJobName] == jobName {
// Found another pod from the same job on this node. Increase the score.
score += 25
}
}
// Cap the score at the max score.
if score > framework.MaxNodeScore {
score = framework.MaxNodeScore
}
klog.Infof("Pod %s on node %s received locality score of %d", p.Name, nodeName, score)
return score, framework.NewStatus(framework.Success)
}
// ScoreExtensions of the Score plugin.
func (s *JobLocalityScore) ScoreExtensions() framework.ScoreExtensions {
return nil // We don't need normalization.
}
// NewJobLocalityScore initializes a new plugin and returns it.
func NewJobLocalityScore(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
return &JobLocalityScore{
handle: h,
}, nil
}
Tying It All Together: `main.go`
This file registers our plugins with the scheduler framework's command-line interface.
main.go
package main
import (
"os"
"k8s.io/component-base/cli"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
"github.com/your-org/gpu-scheduler/plugins"
// Ensure that scheduler framework options are registered
_ "k8s.io/kubernetes/pkg/scheduler/apis/config/scheme"
)
func main() {
// Register the plugin with the scheduler framework.
command := app.NewSchedulerCommand(
app.WithPlugin(plugins.GpuTopologyFilterName, plugins.NewGpuTopologyFilter),
app.WithPlugin(plugins.JobLocalityScoreName, plugins.NewJobLocalityScore),
)
code := cli.Run(command)
os.Exit(code)
}
Packaging and Deployment
Now we need to build our scheduler, containerize it, and deploy it to the Kubernetes cluster.
`Dockerfile`
We'll use a multi-stage build to keep the final image small.
# Build stage
FROM golang:1.21 as builder
WORKDIR /workspace
COPY go.mod go.mod
COPY go.sum go.sum
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -o gpu-scheduler main.go
# Final stage
FROM gcr.io/distroless/static:nonroot
WORKDIR /
COPY --from=builder /workspace/gpu-scheduler .
USER 65532:65532
ENTRYPOINT ["/gpu-scheduler"]
Build and push the image:
export IMG=your-registry/gpu-topology-scheduler:v0.1.0
docker build -t ${IMG} .
docker push ${IMG}
Kubernetes Deployment Manifests
Deploying a second scheduler requires a few components:
KubeSchedulerConfiguration file to enable our plugins.ConfigMap to hold this configuration.RBAC setup (ClusterRole and ClusterRoleBinding) to give the scheduler permissions.Deployment to run the scheduler pod.scheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-scheduler-config
namespace: kube-system
data:
scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceName: gpu-topology-scheduler
resourceNamespace: kube-system
profiles:
- schedulerName: gpu-topology-scheduler
plugins:
filter:
enabled:
- name: GpuTopologyFilter
score:
enabled:
- name: JobLocalityScore
pluginConfig:
- name: GpuTopologyFilter
args: {}
- name: JobLocalityScore
args: {}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: gpu-scheduler-role
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "pods/binding", "replicationcontrollers", "services", "podstatuses"]
verbs: ["get", "list", "watch", "create", "update"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["get", "list", "watch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create", "get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-scheduler-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gpu-scheduler-role
subjects:
- kind: ServiceAccount
name: default
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-topology-scheduler
namespace: kube-system
labels:
app: gpu-topology-scheduler
spec:
replicas: 1
selector:
matchLabels:
app: gpu-topology-scheduler
template:
metadata:
labels:
app: gpu-topology-scheduler
spec:
containers:
- name: scheduler
image: your-registry/gpu-topology-scheduler:v0.1.0 # <-- IMPORTANT: Use your image
args:
- "--config=/etc/kubernetes/scheduler-config.yaml"
- "-v=4"
volumeMounts:
- name: scheduler-config-volume
mountPath: /etc/kubernetes
volumes:
- name: scheduler-config-volume
configMap:
name: gpu-scheduler-config
Apply this manifest to your cluster:
kubectl apply -f scheduler-config.yaml
Using the Custom Scheduler
To use our new scheduler, a pod must specify its schedulerName in the spec. Here is an example of a pod that requires an NVLink-enabled node.
test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-test-nvlink
annotations:
ml-workload.io/gpu-topology-required: "nvlink"
labels:
job-name: "distributed-training-job-1"
spec:
schedulerName: gpu-topology-scheduler # <-- Use our scheduler!
containers:
- name: cuda-container
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: 1
When you apply this pod, you can check the logs of your custom scheduler to see the filtering and scoring in action:
kubectl logs -n kube-system -l app=gpu-topology-scheduler -f
You should see log lines like:
Filtering pod cuda-test-nvlink on node gpu-node-1
Pod cuda-test-nvlink PASSED filter on node gpu-node-1
Pod cuda-test-nvlink on node gpu-node-1 received locality score of 0
Advanced Considerations and Edge Cases
This implementation is a solid foundation, but production environments demand more.
True Gang Scheduling: Our JobLocalityScore plugin only encourages* co-location. It doesn't guarantee it. If a job needs 4 pods to start simultaneously (all-or-nothing), this is insufficient. For true gang scheduling, you would need to implement the Permit extension point. The Permit plugin would wait for a minimum number of pods from a job to arrive at the Permit stage before allowing any of them to proceed to binding. This often requires a controller to manage pod groups and timeouts. Projects like Volcano are dedicated schedulers built for these batch workloads.
* Preemption: What happens if a high-priority training job arrives and the cluster is full of low-priority pods? Your custom scheduler needs to participate in the preemption logic. This involves configuring preemptionPolicy for PriorityClasses and potentially implementing a PostFilter plugin to nominate nodes for preemption if a pod is unschedulable.
* Performance and Scalability: In a cluster with thousands of nodes and a high rate of pod churn, your scheduler's performance is paramount. The Filter and Score functions are in the critical path. Avoid expensive computations or external API calls within them. Leverage the PreFilter stage to perform calculations once per pod rather than once per node. Use the framework.Handle to access the scheduler's shared informer cache instead of making direct API server calls.
* Observability: A scheduler that makes opaque decisions is a nightmare to debug. Instrument your plugins with metrics. Use the Prometheus client library for Go to expose counters and histograms.
* scheduler_plugin_filter_evaluations_total{plugin="GpuTopologyFilter", result="success|fail"}
* scheduler_plugin_score_evaluation_duration_seconds{plugin="JobLocalityScore"}
Add structured logging with klog to trace the decision-making process for a specific pod, which is invaluable when a user asks, "Why isn't my pod scheduling?"
Conclusion: The Power of Custom Scheduling
By moving beyond the default kube-scheduler, you unlock the ability to tailor Kubernetes to the specific needs of your most demanding workloads. We've demonstrated how to build a topology-aware scheduler that understands the physical layout of GPUs, a critical requirement for efficient ML training. This pattern of using node feature discovery to label nodes and a custom scheduler plugin to consume those labels is a powerful and extensible model.
The journey doesn't end here. The same principles can be applied to schedule based on network fabric (e.g., InfiniBand), storage locality, power consumption, or even real-time utilization metrics, transforming your Kubernetes cluster into a truly performance-optimized platform for any specialized workload.