GPU-Aware Scheduling in K8s: A Custom Scheduler Implementation

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inadequacy of Default Scheduling for Specialized Hardware

For senior engineers managing large-scale Kubernetes clusters, especially for AI/ML workloads, the limitations of the default kube-scheduler become apparent quickly. While it excels at general-purpose container orchestration—spreading pods across nodes, respecting resource requests, and handling affinities—it operates on a generic understanding of resources like cpu and memory. When it comes to specialized hardware like GPUs, this generic model falls short.

The default scheduler sees nvidia.com/gpu: 4 as four fungible units. It has no intrinsic knowledge of the underlying hardware topology. This is a critical blind spot in production environments where:

  • Interconnect Matters: High-performance computing (HPC) and distributed model training often require multiple GPUs connected via high-speed interconnects like NVLink or NVSwitch. Placing two communicating pods on GPUs on the same node but across different NUMA nodes or without a direct NVLink bridge can introduce significant performance degradation.
  • Heterogeneous Hardware: A single cluster might contain nodes with different GPU models (e.g., NVIDIA A100s for training, T4s for inference). The default scheduler can't differentiate them beyond simple node labels, making it difficult to route specific workloads to the appropriate hardware class automatically.
  • Bin Packing vs. Spreading: You might want to tightly pack low-priority inference pods onto a few nodes to keep other nodes with powerful, fully-interconnected GPUs free for large, high-priority training jobs. The default scheduler's scoring logic (LeastAllocated vs. MostAllocated) is often too simplistic to enforce such a sophisticated policy.
  • To solve these problems, we must move beyond tweaking scheduler policies and build a custom scheduler tailored to our specific hardware and workload requirements. This post provides a production-focused walkthrough of implementing a custom GPU-aware scheduler using the Kubernetes scheduler-framework.

    The Scheduler Framework: Our Toolkit for Customization

    The scheduler-framework is a pluggable architecture introduced in Kubernetes 1.15 that makes building custom schedulers manageable. It exposes a set of extension points (interfaces) that allow us to inject our own logic into the scheduling cycle. The key stages are:

    Filtering (Filter): Predicate phase. These plugins determine if a pod can* run on a given node. If any Filter plugin returns Unschedulable, the node is rejected for that pod. This is where we'll enforce hard requirements, like the availability of an NVLink-connected GPU pair.

    * Scoring (Score): Priority phase. After filtering, nodes that can run the pod are scored by these plugins. The node with the highest cumulative score is chosen. This is where we'll implement our preference logic, like preferring to pack inference pods together.

    Other extension points like PreFilter, PostFilter, Reserve, and Permit offer finer-grained control, which we'll touch on when discussing edge cases like preemption.

    The Production Scenario: Optimizing a Mixed ML Cluster

    Let's define our target environment:

    * A Kubernetes cluster with several worker nodes.

    * Each node is equipped with four NVIDIA A100 GPUs.

    * The GPUs are connected in pairs via NVLink: GPU0-GPU1 and GPU2-GPU3.

    * We have two primary workload types:

    1. training-job pods: Request 2 GPUs (nvidia.com/gpu: 2) and must be placed on an NVLink-connected pair for optimal performance.

    2. inference-server pods: Request 1 GPU (nvidia.com/gpu: 1) and have no topology requirements. We want to pack these pods tightly to consolidate fragmentation and leave nodes free for training jobs.

    Our custom scheduler, gpu-aware-scheduler, will enforce these rules.

    Step 1: Annotating Nodes with Topology Information

    Our scheduler needs data to make decisions. Since Kubernetes has no built-in concept of NVLink topology, we must provide it. The most common pattern is to use a daemonset, like the NVIDIA Node Feature Discovery (NFD) tool, to inspect the node's hardware and apply labels or annotations.

    For our example, let's assume this discovery process results in the following node annotation:

    yaml
    apiVersion: v1
    kind: Node
    metadata:
      name: node-1
      annotations:
        gpu.nvidia.com/nvlink-pairs: "0-1,2-3"

    This simple annotation tells our scheduler which GPU device IDs form high-speed pairs.

    Step 2: Implementing the Custom `Filter` Plugin

    Our first task is to write a Filter plugin that prevents training-job pods from being scheduled on nodes where a free NVLink pair isn't available. We'll write this in Go, as it's the native language of Kubernetes.

    First, we define our plugin structure.

    go
    // pkg/scheduler/plugins/nvlink.go
    package plugins
    
    import (
    	"context"
    	"fmt"
    	"strconv"
    	"strings"
    
    	corev1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/klog/v2"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	Name = "NvlinkTopology"
    	TrainingPodLabelKey = "workload-type"
    	TrainingPodLabelValue = "training-job"
    	NvlinkAnnotation = "gpu.nvidia.com/nvlink-pairs"
    	GPUResourceName = "nvidia.com/gpu"
    )
    
    type NvlinkTopology struct{}
    
    var _ framework.FilterPlugin = &NvlinkTopology{}
    
    func (pl *NvlinkTopology) Name() string {
    	return Name
    }
    
    // New initializes a new plugin and returns it.
    func New(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
    	return &NvlinkTopology{}, nil
    }

    Now for the core Filter logic. The scheduler framework passes the pod being scheduled and a NodeInfo object, which contains a snapshot of the node's state, including pods already running on it.

    go
    // pkg/scheduler/plugins/nvlink.go (continued)
    
    func (pl *NvlinkTopology) Filter(ctx context.Context, _ *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    	// 1. Check if this pod is a training job that needs topology awareness.
    	workloadType, ok := pod.Labels[TrainingPodLabelKey]
    	if !ok || workloadType != TrainingPodLabelValue {
    		klog.V(4).Infof("Pod %s is not a training job, skipping NvlinkTopology filter.", pod.Name)
    		return framework.NewStatus(framework.Success)
    	}
    
    	// 2. Check if the pod requests exactly 2 GPUs.
    	// In a real-world scenario, this could be more flexible.
    	requestedGPUs := getGpuRequest(pod)
    	if requestedGPUs != 2 {
    		klog.V(4).Infof("Training pod %s requests %d GPUs, not 2. Skipping filter.", pod.Name, requestedGPUs)
    		return framework.NewStatus(framework.Success)
    	}
    
    	// 3. Get node's NVLink topology from annotations.
    	node := nodeInfo.Node()
    	if node == nil {
    		return framework.NewStatus(framework.Error, "node not found")
    	}
    	nvlinkAnnotation, ok := node.Annotations[NvlinkAnnotation]
    	if !ok || nvlinkAnnotation == "" {
    		klog.V(3).Infof("Node %s has no NVLink topology annotation, considering it unschedulable for training pods.", node.Name)
    		return framework.NewStatus(framework.Unschedulable, "Node lacks NVLink topology information")
    	}
    
    	nvlinkPairs, err := parseNvlinkPairs(nvlinkAnnotation)
    	if err != nil {
    		return framework.NewStatus(framework.Error, fmt.Sprintf("failed to parse nvlink annotation: %v", err))
    	}
    
    	// 4. Determine which GPUs are already in use on the node.
    	// This is a simplified model. Production schedulers often rely on device plugin state.
    	usedGpuIDs := getUsedGpuIDs(nodeInfo)
    
    	// 5. Check if any NVLink pair is fully available.
    	for _, pair := range nvlinkPairs {
    		if !usedGpuIDs[pair[0]] && !usedGpuIDs[pair[1]] {
    			klog.V(2).Infof("Found available NVLink pair %v on node %s for pod %s", pair, node.Name, pod.Name)
    			return framework.NewStatus(framework.Success)
    		}
    	}
    
    	klog.V(2).Infof("No available NVLink pair on node %s for pod %s", node.Name, pod.Name)
    	return framework.NewStatus(framework.Unschedulable, "No available NVLink-connected GPU pair")
    }
    
    // Helper functions (implementations omitted for brevity in this section, but provided in full later)
    func getGpuRequest(pod *corev1.Pod) int64 { /* ... */ }
    func parseNvlinkPairs(annotation string) ([][2]int, error) { /* ... */ }
    func getUsedGpuIDs(nodeInfo *framework.NodeInfo) map[int]bool { /* ... */ }
    

    This Filter plugin ensures that our hard requirement is met: a training-job pod will only be considered for nodes where an entire NVLink pair is free.

    Step 3: Implementing the Custom `Score` Plugin

    Next, we'll implement a Score plugin to guide the scheduler's preferences. Our goal is to pack inference-server pods onto nodes that are already fragmented, leaving nodes with pristine NVLink pairs available for future training-job pods.

    Our scoring logic will be:

    For training-job pods: Give a high score to nodes that have the fewest* available GPUs, as long as they can fit the job. This encourages using up nearly-full nodes first.

    For inference-server pods: Give a high score to nodes that already have other inference pods running and whose GPU usage would not* break up a complete NVLink pair. This is a classic bin-packing strategy with a topology-aware twist.

    go
    // pkg/scheduler/plugins/nvlinkscore.go
    package plugins
    
    // ... imports and constants ...
    
    const (
    	ScorePluginName = "NvlinkTopologyScorer"
    	InferencePodLabelValue = "inference-server"
    )
    
    type NvlinkTopologyScorer struct {
    	handle framework.Handle
    }
    
    var _ framework.ScorePlugin = &NvlinkTopologyScorer{}
    
    func (pl *NvlinkTopologyScorer) Name() string {
    	return ScorePluginName
    }
    
    func NewScorer(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &NvlinkTopologyScorer{handle: h}, nil
    }
    
    func (pl *NvlinkTopologyScorer) Score(ctx context.Context, state *framework.CycleState, pod *corev1.Pod, nodeName string) (int64, *framework.Status) {
    	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from snapshot: %s", nodeName, err))
    	}
    
    	workloadType := pod.Labels[TrainingPodLabelKey]
    
    	if workloadType == InferencePodLabelValue {
    		return pl.scoreForInference(pod, nodeInfo)
    	}
    
    	if workloadType == TrainingPodLabelValue {
    		return pl.scoreForTraining(pod, nodeInfo)
    	}
    
    	// Default score for other pods
    	return 50, framework.NewStatus(framework.Success)
    }
    
    func (pl *NvlinkTopologyScorer) scoreForInference(pod *corev1.Pod, nodeInfo *framework.NodeInfo) (int64, *framework.Status) {
    	// Goal: Pack inference pods onto nodes that are already fragmented or have other inference pods.
    	node := nodeInfo.Node()
    	usedGpuIDs := getUsedGpuIDs(nodeInfo)
    	nvlinkPairs, _ := parseNvlinkPairs(node.Annotations[NvlinkAnnotation])
    
    	// Score 1: Higher score for nodes with more used GPUs (bin-packing).
    	score := int64(len(usedGpuIDs)) * 10
    
    	// Score 2: Add a bonus if placing this pod does NOT break a fresh NVLink pair.
    	// This is a complex calculation. A simpler proxy is to count how many full pairs are left.
    	availablePairs := 0
    	for _, pair := range nvlinkPairs {
    		if !usedGpuIDs[pair[0]] && !usedGpuIDs[pair[1]] {
    			availablePairs++
    		}
    	}
    
    	// If there are no available GPUs, filter should have caught it, but for safety:
    	totalGpus, _ := node.Status.Allocatable[GPUResourceName]
    	if int(totalGpus.Value())-len(usedGpuIDs) < 1 {
    		return 0, framework.NewStatus(framework.Success) // Should not happen
    	}
    
    	// If scheduling this pod leaves at least one pair intact, give a high bonus.
    	if availablePairs > 0 {
    		// Let's check the number of single free GPUs
    		singleFreeGpus := 0
    		for i := 0; i < int(totalGpus.Value()); i++ {
    			if !usedGpuIDs[i] {
    				isPaired := false
    				for _, pair := range nvlinkPairs {
    					if (i == pair[0] && !usedGpuIDs[pair[1]]) || (i == pair[1] && !usedGpuIDs[pair[0]]) {
    						isPaired = true
    						break
    					}
    				}
    				if !isPaired {
    					singleFreeGpus++
    				}
    			}
    		}
    		// Prefer nodes that already have a 'stray' GPU to use up.
    		if singleFreeGpus > 0 {
    			score += 50
    		}
    	}
    
    	return score, framework.NewStatus(framework.Success)
    }
    
    func (pl *NvlinkTopologyScorer) scoreForTraining(pod *corev1.Pod, nodeInfo *framework.NodeInfo) (int64, *framework.Status) {
    	// Goal: Use nodes that are already somewhat utilized to leave empty nodes free.
    	// This is a slightly different form of bin-packing.
    	usedGpuCount := len(getUsedGpuIDs(nodeInfo))
    	// We want to give a higher score to nodes with more GPUs used, promoting consolidation.
    	// Filter has already guaranteed a pair is available.
    	return int64(usedGpuCount), framework.NewStatus(framework.Success)
    }
    
    // ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
    func (pl *NvlinkTopologyScorer) ScoreExtensions() framework.ScoreExtensions {
    	// We don't need normalization, so we return nil.
    	return nil
    }
    
    // NOTE: The helper functions `getUsedGpuIDs` and `parseNvlinkPairs` are shared with the Filter plugin.

    This scoring logic is non-trivial. It demonstrates how you can encode complex business/operational rules directly into the scheduling process.

    Step 4: Configuring and Deploying the Scheduler

    With our plugins written, we need to package and deploy them.

  • Compile and Containerize: Compile the Go code into a static binary and build a Docker image. The entrypoint of the container will be our custom scheduler binary.
  • dockerfile
        FROM golang:1.19 AS builder
        WORKDIR /workspace
        COPY go.mod go.mod
        COPY go.sum go.sum
        RUN go mod download
        COPY . .
        RUN CGO_ENABLED=0 GOOS=linux go build -a -o gpu-aware-scheduler ./cmd/scheduler
    
        FROM alpine:latest
        WORKDIR /root/
        COPY --from=builder /workspace/gpu-aware-scheduler .
        ENTRYPOINT ["/root/gpu-aware-scheduler"]
  • Create a Scheduler Configuration: We define a KubeSchedulerConfiguration file to enable our plugins.
  • yaml
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
          resourceName: gpu-aware-scheduler
          resourceNamespace: kube-system
        profiles:
          - schedulerName: gpu-aware-scheduler
            plugins:
              filter:
                enabled:
                  - name: NvlinkTopology
              score:
                enabled:
                  - name: NvlinkTopologyScorer
                    weight: 2
                  # We should still include default plugins for basic functionality
                  - name: NodeResourcesBalancedAllocation
                    weight: 1
                  - name: ImageLocality
                    weight: 1
            pluginConfig:
              - name: NvlinkTopology
                args: {}
              - name: NvlinkTopologyScorer
                args: {}
  • Deploy the Scheduler: We deploy our custom scheduler as a Deployment in the kube-system namespace. It needs RBAC permissions to read pods and nodes and update pod status.
  • yaml
        # rbac.yaml, deployment.yaml ...
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: gpu-aware-scheduler
          namespace: kube-system
          labels:
            component: gpu-aware-scheduler
        spec:
          replicas: 1
          selector:
            matchLabels:
              component: gpu-aware-scheduler
          template:
            metadata:
              labels:
                component: gpu-aware-scheduler
            spec:
              serviceAccountName: gpu-aware-scheduler-sa # Requires ClusterRole and ClusterRoleBinding
              containers:
              - name: scheduler
                image: your-repo/gpu-aware-scheduler:v0.1.0
                args:
                - "--config=/etc/kubernetes/scheduler-config.yaml"
                - "-v=4"
                volumeMounts:
                - name: scheduler-config
                  mountPath: /etc/kubernetes
              volumes:
              - name: scheduler-config
                configMap:
                  name: gpu-aware-scheduler-config

    Step 5: Using the Custom Scheduler

    To use our new scheduler, pods must specify its name in their spec:

    Training Pod Example:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: large-model-training
      labels:
        workload-type: "training-job"
    spec:
      schedulerName: gpu-aware-scheduler
      containers:
      - name: training-container
        image: a-training-image
        resources:
          limits:
            nvidia.com/gpu: 2

    Inference Pod Example:

    yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: inference-server-abc
      labels:
        workload-type: "inference-server"
    spec:
      schedulerName: gpu-aware-scheduler
      containers:
      - name: inference-container
        image: an-inference-image
        resources:
          limits:
            nvidia.com/gpu: 1

    When these pods are created, the Kubernetes API server will hand them off to our gpu-aware-scheduler instead of the default one.

    Advanced Topics and Edge Cases

    A production-ready scheduler needs to handle more than the happy path.

    Edge Case 1: Preemption

    What if a high-priority training-job is pending but all NVLink pairs are occupied, some by low-priority inference-server pods? The scheduler needs to preempt (evict) the low-priority pods. This requires:

  • Priority and Preemption: Defining PriorityClass resources and assigning them to pods.
  • PostFilter Plugin: Our scheduler can implement the PostFilter extension point. If Filter fails for a high-priority pod, PostFilter is called. Here, we can identify nodes where preemption of lower-priority pods would make the node feasible and nominate them.
  • Edge Case 2: Accurate GPU Allocation Tracking

    Our getUsedGpuIDs helper is a simplification. It assumes pods use GPUs sequentially (0, 1, 2...). This is not guaranteed. The NVIDIA device plugin actually allocates specific GPU UUIDs and advertises them to the kubelet, which sets an environment variable (NVIDIA_VISIBLE_DEVICES) in the container. A robust scheduler cannot see this variable. The state-of-the-art solution is to have the device plugin itself expose allocation state via node annotations or a custom resource, which the scheduler then reads as its source of truth. This creates a tighter feedback loop between allocation and scheduling.

    Performance Considerations

    The scheduler is a critical control plane component. A slow scheduler can cripple cluster throughput.

    * Caching: The scheduler-framework provides a snapshot of the cluster state. For more complex calculations (e.g., parsing topology repeatedly), our plugin can implement its own internal cache. Be mindful of cache invalidation when nodes or pods change.

    * Algorithm Complexity: Our scoring logic is O(N) where N is the number of GPUs. For nodes with many devices, this could be slow. Always benchmark your plugins. The scheduler exposes Prometheus metrics for plugin execution latency (scheduler_plugin_execution_duration_seconds).

    * Concurrency: The framework runs scheduling cycles for different pods in parallel. Ensure your plugins are thread-safe.

    Full Code Example for Helper Functions

    Here are the implementations for the helper functions used in our plugins.

    go
    // pkg/scheduler/plugins/helpers.go
    
    func getGpuRequest(pod *corev1.Pod) int64 {
    	for _, container := range pod.Spec.Containers {
    		if val, ok := container.Resources.Limits[GPUResourceName]; ok {
    			return val.Value()
    		}
    	}
    	return 0
    }
    
    func parseNvlinkPairs(annotation string) ([][2]int, error) {
    	var pairs [][2]int
    	pairStrs := strings.Split(annotation, ",")
    	for _, p := range pairStrs {
    		ids := strings.Split(p, "-")
    		if len(ids) != 2 {
    			return nil, fmt.Errorf("invalid pair format: %s", p)
    		}
    		id1, err1 := strconv.Atoi(ids[0])
    		id2, err2 := strconv.Atoi(ids[1])
    		if err1 != nil || err2 != nil {
    			return nil, fmt.Errorf("invalid GPU ID in pair: %s", p)
    		}
    		pairs = append(pairs, [2]int{id1, id2})
    	}
    	return pairs, nil
    }
    
    // getUsedGpuIDs is a simplified model. It assumes sequential allocation and doesn't account for
    // specific device UUIDs. A production implementation would need a more robust way to track this.
    func getUsedGpuIDs(nodeInfo *framework.NodeInfo) map[int]bool {
    	used := make(map[int]bool)
    	var gpuCount int64 = 0
    	for _, p := range nodeInfo.Pods {
    		// Ignore pods in a terminal state
    		if p.Status.Phase == corev1.PodSucceeded || p.Status.Phase == corev1.PodFailed {
    			continue
    		}
    		requested := getGpuRequest(p)
    		for i := int64(0); i < requested; i++ {
    			used[int(gpuCount+i)] = true
    		}
    		gpuCount += requested
    	}
    	return used
    }

    Conclusion

    Building a custom Kubernetes scheduler is an advanced task, but it's a powerful tool for optimizing specialized, high-value workloads. By leveraging the scheduler-framework, we can move beyond the one-size-fits-all model of the default scheduler and implement nuanced, topology-aware, and policy-driven placement strategies. This allows platform engineers to extract maximum performance and utilization from expensive hardware resources like GPUs, directly impacting the efficiency and cost-effectiveness of production AI/ML systems. While the initial investment is significant, the control it provides over the orchestration of critical workloads is often indispensable at scale.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles