GPU-Aware Scheduling in K8s: A Custom Scheduler Implementation
The Inadequacy of Default Scheduling for Specialized Hardware
For senior engineers managing large-scale Kubernetes clusters, especially for AI/ML workloads, the limitations of the default kube-scheduler become apparent quickly. While it excels at general-purpose container orchestration—spreading pods across nodes, respecting resource requests, and handling affinities—it operates on a generic understanding of resources like cpu and memory. When it comes to specialized hardware like GPUs, this generic model falls short.
The default scheduler sees nvidia.com/gpu: 4 as four fungible units. It has no intrinsic knowledge of the underlying hardware topology. This is a critical blind spot in production environments where:
LeastAllocated vs. MostAllocated) is often too simplistic to enforce such a sophisticated policy.To solve these problems, we must move beyond tweaking scheduler policies and build a custom scheduler tailored to our specific hardware and workload requirements. This post provides a production-focused walkthrough of implementing a custom GPU-aware scheduler using the Kubernetes scheduler-framework.
The Scheduler Framework: Our Toolkit for Customization
The scheduler-framework is a pluggable architecture introduced in Kubernetes 1.15 that makes building custom schedulers manageable. It exposes a set of extension points (interfaces) that allow us to inject our own logic into the scheduling cycle. The key stages are:
Filtering (Filter): Predicate phase. These plugins determine if a pod can* run on a given node. If any Filter plugin returns Unschedulable, the node is rejected for that pod. This is where we'll enforce hard requirements, like the availability of an NVLink-connected GPU pair.
* Scoring (Score): Priority phase. After filtering, nodes that can run the pod are scored by these plugins. The node with the highest cumulative score is chosen. This is where we'll implement our preference logic, like preferring to pack inference pods together.
Other extension points like PreFilter, PostFilter, Reserve, and Permit offer finer-grained control, which we'll touch on when discussing edge cases like preemption.
The Production Scenario: Optimizing a Mixed ML Cluster
Let's define our target environment:
* A Kubernetes cluster with several worker nodes.
* Each node is equipped with four NVIDIA A100 GPUs.
* The GPUs are connected in pairs via NVLink: GPU0-GPU1 and GPU2-GPU3.
* We have two primary workload types:
1. training-job pods: Request 2 GPUs (nvidia.com/gpu: 2) and must be placed on an NVLink-connected pair for optimal performance.
2. inference-server pods: Request 1 GPU (nvidia.com/gpu: 1) and have no topology requirements. We want to pack these pods tightly to consolidate fragmentation and leave nodes free for training jobs.
Our custom scheduler, gpu-aware-scheduler, will enforce these rules.
Step 1: Annotating Nodes with Topology Information
Our scheduler needs data to make decisions. Since Kubernetes has no built-in concept of NVLink topology, we must provide it. The most common pattern is to use a daemonset, like the NVIDIA Node Feature Discovery (NFD) tool, to inspect the node's hardware and apply labels or annotations.
For our example, let's assume this discovery process results in the following node annotation:
apiVersion: v1
kind: Node
metadata:
name: node-1
annotations:
gpu.nvidia.com/nvlink-pairs: "0-1,2-3"
This simple annotation tells our scheduler which GPU device IDs form high-speed pairs.
Step 2: Implementing the Custom `Filter` Plugin
Our first task is to write a Filter plugin that prevents training-job pods from being scheduled on nodes where a free NVLink pair isn't available. We'll write this in Go, as it's the native language of Kubernetes.
First, we define our plugin structure.
// pkg/scheduler/plugins/nvlink.go
package plugins
import (
"context"
"fmt"
"strconv"
"strings"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
Name = "NvlinkTopology"
TrainingPodLabelKey = "workload-type"
TrainingPodLabelValue = "training-job"
NvlinkAnnotation = "gpu.nvidia.com/nvlink-pairs"
GPUResourceName = "nvidia.com/gpu"
)
type NvlinkTopology struct{}
var _ framework.FilterPlugin = &NvlinkTopology{}
func (pl *NvlinkTopology) Name() string {
return Name
}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &NvlinkTopology{}, nil
}
Now for the core Filter logic. The scheduler framework passes the pod being scheduled and a NodeInfo object, which contains a snapshot of the node's state, including pods already running on it.
// pkg/scheduler/plugins/nvlink.go (continued)
func (pl *NvlinkTopology) Filter(ctx context.Context, _ *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// 1. Check if this pod is a training job that needs topology awareness.
workloadType, ok := pod.Labels[TrainingPodLabelKey]
if !ok || workloadType != TrainingPodLabelValue {
klog.V(4).Infof("Pod %s is not a training job, skipping NvlinkTopology filter.", pod.Name)
return framework.NewStatus(framework.Success)
}
// 2. Check if the pod requests exactly 2 GPUs.
// In a real-world scenario, this could be more flexible.
requestedGPUs := getGpuRequest(pod)
if requestedGPUs != 2 {
klog.V(4).Infof("Training pod %s requests %d GPUs, not 2. Skipping filter.", pod.Name, requestedGPUs)
return framework.NewStatus(framework.Success)
}
// 3. Get node's NVLink topology from annotations.
node := nodeInfo.Node()
if node == nil {
return framework.NewStatus(framework.Error, "node not found")
}
nvlinkAnnotation, ok := node.Annotations[NvlinkAnnotation]
if !ok || nvlinkAnnotation == "" {
klog.V(3).Infof("Node %s has no NVLink topology annotation, considering it unschedulable for training pods.", node.Name)
return framework.NewStatus(framework.Unschedulable, "Node lacks NVLink topology information")
}
nvlinkPairs, err := parseNvlinkPairs(nvlinkAnnotation)
if err != nil {
return framework.NewStatus(framework.Error, fmt.Sprintf("failed to parse nvlink annotation: %v", err))
}
// 4. Determine which GPUs are already in use on the node.
// This is a simplified model. Production schedulers often rely on device plugin state.
usedGpuIDs := getUsedGpuIDs(nodeInfo)
// 5. Check if any NVLink pair is fully available.
for _, pair := range nvlinkPairs {
if !usedGpuIDs[pair[0]] && !usedGpuIDs[pair[1]] {
klog.V(2).Infof("Found available NVLink pair %v on node %s for pod %s", pair, node.Name, pod.Name)
return framework.NewStatus(framework.Success)
}
}
klog.V(2).Infof("No available NVLink pair on node %s for pod %s", node.Name, pod.Name)
return framework.NewStatus(framework.Unschedulable, "No available NVLink-connected GPU pair")
}
// Helper functions (implementations omitted for brevity in this section, but provided in full later)
func getGpuRequest(pod *corev1.Pod) int64 { /* ... */ }
func parseNvlinkPairs(annotation string) ([][2]int, error) { /* ... */ }
func getUsedGpuIDs(nodeInfo *framework.NodeInfo) map[int]bool { /* ... */ }
This Filter plugin ensures that our hard requirement is met: a training-job pod will only be considered for nodes where an entire NVLink pair is free.
Step 3: Implementing the Custom `Score` Plugin
Next, we'll implement a Score plugin to guide the scheduler's preferences. Our goal is to pack inference-server pods onto nodes that are already fragmented, leaving nodes with pristine NVLink pairs available for future training-job pods.
Our scoring logic will be:
For training-job pods: Give a high score to nodes that have the fewest* available GPUs, as long as they can fit the job. This encourages using up nearly-full nodes first.
For inference-server pods: Give a high score to nodes that already have other inference pods running and whose GPU usage would not* break up a complete NVLink pair. This is a classic bin-packing strategy with a topology-aware twist.
// pkg/scheduler/plugins/nvlinkscore.go
package plugins
// ... imports and constants ...
const (
ScorePluginName = "NvlinkTopologyScorer"
InferencePodLabelValue = "inference-server"
)
type NvlinkTopologyScorer struct {
handle framework.Handle
}
var _ framework.ScorePlugin = &NvlinkTopologyScorer{}
func (pl *NvlinkTopologyScorer) Name() string {
return ScorePluginName
}
func NewScorer(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
return &NvlinkTopologyScorer{handle: h}, nil
}
func (pl *NvlinkTopologyScorer) Score(ctx context.Context, state *framework.CycleState, pod *corev1.Pod, nodeName string) (int64, *framework.Status) {
nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from snapshot: %s", nodeName, err))
}
workloadType := pod.Labels[TrainingPodLabelKey]
if workloadType == InferencePodLabelValue {
return pl.scoreForInference(pod, nodeInfo)
}
if workloadType == TrainingPodLabelValue {
return pl.scoreForTraining(pod, nodeInfo)
}
// Default score for other pods
return 50, framework.NewStatus(framework.Success)
}
func (pl *NvlinkTopologyScorer) scoreForInference(pod *corev1.Pod, nodeInfo *framework.NodeInfo) (int64, *framework.Status) {
// Goal: Pack inference pods onto nodes that are already fragmented or have other inference pods.
node := nodeInfo.Node()
usedGpuIDs := getUsedGpuIDs(nodeInfo)
nvlinkPairs, _ := parseNvlinkPairs(node.Annotations[NvlinkAnnotation])
// Score 1: Higher score for nodes with more used GPUs (bin-packing).
score := int64(len(usedGpuIDs)) * 10
// Score 2: Add a bonus if placing this pod does NOT break a fresh NVLink pair.
// This is a complex calculation. A simpler proxy is to count how many full pairs are left.
availablePairs := 0
for _, pair := range nvlinkPairs {
if !usedGpuIDs[pair[0]] && !usedGpuIDs[pair[1]] {
availablePairs++
}
}
// If there are no available GPUs, filter should have caught it, but for safety:
totalGpus, _ := node.Status.Allocatable[GPUResourceName]
if int(totalGpus.Value())-len(usedGpuIDs) < 1 {
return 0, framework.NewStatus(framework.Success) // Should not happen
}
// If scheduling this pod leaves at least one pair intact, give a high bonus.
if availablePairs > 0 {
// Let's check the number of single free GPUs
singleFreeGpus := 0
for i := 0; i < int(totalGpus.Value()); i++ {
if !usedGpuIDs[i] {
isPaired := false
for _, pair := range nvlinkPairs {
if (i == pair[0] && !usedGpuIDs[pair[1]]) || (i == pair[1] && !usedGpuIDs[pair[0]]) {
isPaired = true
break
}
}
if !isPaired {
singleFreeGpus++
}
}
}
// Prefer nodes that already have a 'stray' GPU to use up.
if singleFreeGpus > 0 {
score += 50
}
}
return score, framework.NewStatus(framework.Success)
}
func (pl *NvlinkTopologyScorer) scoreForTraining(pod *corev1.Pod, nodeInfo *framework.NodeInfo) (int64, *framework.Status) {
// Goal: Use nodes that are already somewhat utilized to leave empty nodes free.
// This is a slightly different form of bin-packing.
usedGpuCount := len(getUsedGpuIDs(nodeInfo))
// We want to give a higher score to nodes with more GPUs used, promoting consolidation.
// Filter has already guaranteed a pair is available.
return int64(usedGpuCount), framework.NewStatus(framework.Success)
}
// ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
func (pl *NvlinkTopologyScorer) ScoreExtensions() framework.ScoreExtensions {
// We don't need normalization, so we return nil.
return nil
}
// NOTE: The helper functions `getUsedGpuIDs` and `parseNvlinkPairs` are shared with the Filter plugin.
This scoring logic is non-trivial. It demonstrates how you can encode complex business/operational rules directly into the scheduling process.
Step 4: Configuring and Deploying the Scheduler
With our plugins written, we need to package and deploy them.
FROM golang:1.19 AS builder
WORKDIR /workspace
COPY go.mod go.mod
COPY go.sum go.sum
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -o gpu-aware-scheduler ./cmd/scheduler
FROM alpine:latest
WORKDIR /root/
COPY --from=builder /workspace/gpu-aware-scheduler .
ENTRYPOINT ["/root/gpu-aware-scheduler"]
KubeSchedulerConfiguration file to enable our plugins. apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceName: gpu-aware-scheduler
resourceNamespace: kube-system
profiles:
- schedulerName: gpu-aware-scheduler
plugins:
filter:
enabled:
- name: NvlinkTopology
score:
enabled:
- name: NvlinkTopologyScorer
weight: 2
# We should still include default plugins for basic functionality
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
pluginConfig:
- name: NvlinkTopology
args: {}
- name: NvlinkTopologyScorer
args: {}
Deployment in the kube-system namespace. It needs RBAC permissions to read pods and nodes and update pod status. # rbac.yaml, deployment.yaml ...
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-aware-scheduler
namespace: kube-system
labels:
component: gpu-aware-scheduler
spec:
replicas: 1
selector:
matchLabels:
component: gpu-aware-scheduler
template:
metadata:
labels:
component: gpu-aware-scheduler
spec:
serviceAccountName: gpu-aware-scheduler-sa # Requires ClusterRole and ClusterRoleBinding
containers:
- name: scheduler
image: your-repo/gpu-aware-scheduler:v0.1.0
args:
- "--config=/etc/kubernetes/scheduler-config.yaml"
- "-v=4"
volumeMounts:
- name: scheduler-config
mountPath: /etc/kubernetes
volumes:
- name: scheduler-config
configMap:
name: gpu-aware-scheduler-config
Step 5: Using the Custom Scheduler
To use our new scheduler, pods must specify its name in their spec:
Training Pod Example:
apiVersion: v1
kind: Pod
metadata:
name: large-model-training
labels:
workload-type: "training-job"
spec:
schedulerName: gpu-aware-scheduler
containers:
- name: training-container
image: a-training-image
resources:
limits:
nvidia.com/gpu: 2
Inference Pod Example:
apiVersion: v1
kind: Pod
metadata:
name: inference-server-abc
labels:
workload-type: "inference-server"
spec:
schedulerName: gpu-aware-scheduler
containers:
- name: inference-container
image: an-inference-image
resources:
limits:
nvidia.com/gpu: 1
When these pods are created, the Kubernetes API server will hand them off to our gpu-aware-scheduler instead of the default one.
Advanced Topics and Edge Cases
A production-ready scheduler needs to handle more than the happy path.
Edge Case 1: Preemption
What if a high-priority training-job is pending but all NVLink pairs are occupied, some by low-priority inference-server pods? The scheduler needs to preempt (evict) the low-priority pods. This requires:
PriorityClass resources and assigning them to pods.PostFilter Plugin: Our scheduler can implement the PostFilter extension point. If Filter fails for a high-priority pod, PostFilter is called. Here, we can identify nodes where preemption of lower-priority pods would make the node feasible and nominate them.Edge Case 2: Accurate GPU Allocation Tracking
Our getUsedGpuIDs helper is a simplification. It assumes pods use GPUs sequentially (0, 1, 2...). This is not guaranteed. The NVIDIA device plugin actually allocates specific GPU UUIDs and advertises them to the kubelet, which sets an environment variable (NVIDIA_VISIBLE_DEVICES) in the container. A robust scheduler cannot see this variable. The state-of-the-art solution is to have the device plugin itself expose allocation state via node annotations or a custom resource, which the scheduler then reads as its source of truth. This creates a tighter feedback loop between allocation and scheduling.
Performance Considerations
The scheduler is a critical control plane component. A slow scheduler can cripple cluster throughput.
* Caching: The scheduler-framework provides a snapshot of the cluster state. For more complex calculations (e.g., parsing topology repeatedly), our plugin can implement its own internal cache. Be mindful of cache invalidation when nodes or pods change.
* Algorithm Complexity: Our scoring logic is O(N) where N is the number of GPUs. For nodes with many devices, this could be slow. Always benchmark your plugins. The scheduler exposes Prometheus metrics for plugin execution latency (scheduler_plugin_execution_duration_seconds).
* Concurrency: The framework runs scheduling cycles for different pods in parallel. Ensure your plugins are thread-safe.
Full Code Example for Helper Functions
Here are the implementations for the helper functions used in our plugins.
// pkg/scheduler/plugins/helpers.go
func getGpuRequest(pod *corev1.Pod) int64 {
for _, container := range pod.Spec.Containers {
if val, ok := container.Resources.Limits[GPUResourceName]; ok {
return val.Value()
}
}
return 0
}
func parseNvlinkPairs(annotation string) ([][2]int, error) {
var pairs [][2]int
pairStrs := strings.Split(annotation, ",")
for _, p := range pairStrs {
ids := strings.Split(p, "-")
if len(ids) != 2 {
return nil, fmt.Errorf("invalid pair format: %s", p)
}
id1, err1 := strconv.Atoi(ids[0])
id2, err2 := strconv.Atoi(ids[1])
if err1 != nil || err2 != nil {
return nil, fmt.Errorf("invalid GPU ID in pair: %s", p)
}
pairs = append(pairs, [2]int{id1, id2})
}
return pairs, nil
}
// getUsedGpuIDs is a simplified model. It assumes sequential allocation and doesn't account for
// specific device UUIDs. A production implementation would need a more robust way to track this.
func getUsedGpuIDs(nodeInfo *framework.NodeInfo) map[int]bool {
used := make(map[int]bool)
var gpuCount int64 = 0
for _, p := range nodeInfo.Pods {
// Ignore pods in a terminal state
if p.Status.Phase == corev1.PodSucceeded || p.Status.Phase == corev1.PodFailed {
continue
}
requested := getGpuRequest(p)
for i := int64(0); i < requested; i++ {
used[int(gpuCount+i)] = true
}
gpuCount += requested
}
return used
}
Conclusion
Building a custom Kubernetes scheduler is an advanced task, but it's a powerful tool for optimizing specialized, high-value workloads. By leveraging the scheduler-framework, we can move beyond the one-size-fits-all model of the default scheduler and implement nuanced, topology-aware, and policy-driven placement strategies. This allows platform engineers to extract maximum performance and utilization from expensive hardware resources like GPUs, directly impacting the efficiency and cost-effectiveness of production AI/ML systems. While the initial investment is significant, the control it provides over the orchestration of critical workloads is often indispensable at scale.