Advanced StatefulSet Scheduling with Custom Kubernetes Scheduler Plugins

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Default Scheduler's Blind Spot: Nuanced Stateful Workloads

The stock kube-scheduler is a marvel of engineering, adeptly handling the placement of thousands of pods across a cluster using a sophisticated set of predicates (Filters) and priorities (Scores). For stateless applications, its combination of pod/node affinity, resource requests, and topology spread constraints is typically sufficient. However, for high-performance stateful workloads—distributed databases (Cassandra, Vitess), message queues (Kafka), or search indexes (Elasticsearch)—these general-purpose tools represent a blunt instrument where a scalpel is required.

The core limitation is a lack of deep, real-time, infrastructure-specific context. The default scheduler is unaware of:

* I/O Contention: It treats a node with idle NVMe drives the same as one whose spinning disks are saturated by a noisy neighbor, as long as CPU and memory requests are met.

* Heterogeneous Storage Tiers: It cannot natively distinguish between a node backed by premium, low-latency storage and one with archival-grade, high-latency disks, unless explicitly guided by crude node labels.

* Fine-Grained Network Topology: Beyond basic zone/region awareness, it doesn't understand network switch proximity or cross-rack latency, which is critical for leader election and replication traffic in distributed systems.

* Dynamic Node State: A node's health and performance are not static. A temporary I/O bottleneck or a 'flapping' network interface are transient states the default scheduler cannot react to during a scheduling cycle.

Attempting to solve these issues with a proliferation of complex node labels and intricate affinity rules quickly leads to a brittle, unmanageable configuration. This is the inflection point where senior engineers must consider moving beyond configuration and into programmatic control by extending the scheduler itself.

This article details the implementation of a custom scheduler plugin designed to address these stateful workload challenges directly. We will build a Storage-Aware Plugin that filters nodes based on storage hardware and scores them based on real-time disk I/O metrics, ensuring our StatefulSet pods land on the most performant and appropriate nodes.

The Kubernetes Scheduling Framework: Surgical Extension Points

Before KEP-624, customizing the scheduler meant forking the entire Kubernetes codebase. The modern Scheduling Framework provides a set of extension points, allowing us to inject custom logic into specific phases of the scheduling cycle without modifying the core. For our stateful workload scenario, we will focus on two primary extension points:

  • Filter: This extension point runs for every node in the cluster. If any Filter plugin returns 'unschedulable' for a node, that node is immediately removed from consideration for the current pod. This is our gatekeeper. We'll use it to disqualify nodes that don't have the required storage hardware (e.g., must be NVMe).
  • Score: After filtering, the remaining nodes are passed to the Score plugins. Each Score plugin returns an integer score (typically 0-100) for each node, indicating its desirability. The scheduler then sums the weighted scores from all Score plugins to produce a final ranking. This is our optimization engine. We'll use it to rank nodes based on lowest disk I/O, giving preference to quiescent nodes.
  • Other relevant points we'll touch upon:

    * PreFilter: Runs before Filter. Useful for pre-computing data that can be shared across Filter and Score calls for a single pod, preventing redundant calculations.

    * Reserve: The final step before binding. It allows a plugin to reserve resources on a node, knowing that the pod is very likely to be bound there. This is critical for preventing race conditions if multiple custom schedulers are operating.

    By implementing these interfaces in Go, we can create a powerful, context-aware scheduling logic that integrates seamlessly with the kube-scheduler binary.

    Scenario: Implementing a Production-Grade Storage-Aware Plugin

    Our goal is to schedule pods for a custom, I/O-intensive StatefulSet. The scheduling requirements are:

  • Hard Requirement: Pods must be placed on nodes with storage=nvme labels.
  • Soft Requirement (Optimization): Pods should be placed on nodes with the lowest current disk I/O utilization to minimize latency and 'noisy neighbor' effects.
  • To achieve this, we need a mechanism to expose real-time node metrics to our plugin. A common production pattern is to have a DaemonSet (like node-exporter) that collects metrics and periodically updates the Node object's annotations. While CRDs are another option, annotations are simpler for this use case and avoid adding another controller dependency.

    Let's assume a DaemonSet on each node is responsible for maintaining the following annotation:

    storage.my-company.com/io-utilization: "15" (A string representing percentage from 0-100)

    Step 1: Project Setup and Plugin Scaffolding

    We'll structure our Go project. You need a working Go environment and knowledge of Go modules.

    bash
    # Create project directory
    mkdir storage-aware-scheduler
    cd storage-aware-scheduler
    
    # Initialize Go module
    go mod init github.com/your-org/storage-aware-scheduler
    
    # Get Kubernetes dependencies
    # Use a version that matches your target cluster
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]
    
    # Tidy up
    go mod tidy

    Now, let's create our main plugin file, plugin.go.

    go
    // plugin.go
    package main
    
    import (
    	"context"
    	"fmt"
    	"strconv"
    
    	"k8s.io/apimachinery/pkg/runtime"
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/klog/v2"
    	framework "k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	// Name is the name of the plugin used in the plugin registry and configurations.
    	Name = "StorageAwareScheduler"
    
    	// The annotation key for I/O utilization on a Node object.
    	AnnotationKeyIOUtilization = "storage.my-company.com/io-utilization"
    
    	// The label key for required storage type on a Node object.
    	LabelKeyStorageType = "storage"
    	// The required value for the storage type label.
    	RequiredStorageTypeValue = "nvme"
    )
    
    // StorageAware is a scheduler plugin that filters and scores nodes based on storage attributes.
    type StorageAware struct {
    	handle framework.Handle
    }
    
    // Ensure StorageAware implements the necessary interfaces.
    var _ framework.FilterPlugin = &StorageAware{}
    var _ framework.ScorePlugin = &StorageAware{}
    
    // Name returns the name of the plugin.
    func (s *StorageAware) Name() string {
    	return Name
    }
    
    // New initializes a new plugin and returns it.
    func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	klog.InfoS("Initializing new StorageAware scheduler plugin")
    	return &StorageAware{
    		handle: h,
    	}, nil
    }

    This code sets up the basic structure. We define our plugin's Name, create the StorageAware struct, and implement the Name() and New() factory functions required by the framework. The var _ ... lines are a compile-time check to ensure our struct correctly implements the interfaces we intend to use.

    Step 2: Implementing the `Filter` Plugin Logic

    The Filter logic is straightforward. It checks if a node has the storage=nvme label. If not, it's rejected.

    go
    // Add this method to the StorageAware struct in plugin.go
    
    // Filter is called by the framework to filter out nodes that do not fit the pod.
    func (s *StorageAware) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    	klog.V(4).InfoS("Filtering node for storage requirements", "pod", klog.KObj(pod), "node", klog.KObj(nodeInfo.Node()))
    
    	node := nodeInfo.Node()
    	if node == nil {
    		return framework.NewStatus(framework.Error, "node not found")
    	}
    
    	storageType, ok := node.Labels[LabelKeyStorageType]
    	if !ok {
    		klog.V(2).InfoS("Node rejected: missing storage type label", "node", klog.KObj(node), "requiredLabel", LabelKeyStorageType)
    		return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node does not have label %s", LabelKeyStorageType))
    	}
    
    	if storageType != RequiredStorageTypeValue {
    		klog.V(2).InfoS("Node rejected: incorrect storage type", "node", klog.KObj(node), "labelValue", storageType, "requiredValue", RequiredStorageTypeValue)
    		return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node has storage type %s, requires %s", storageType, RequiredStorageTypeValue))
    	}
    
    	klog.V(4).InfoS("Node passed storage filter", "node", klog.KObj(node))
    	return framework.NewStatus(framework.Success)
    }

    This Filter method is robust. It handles nil nodes, missing labels, and incorrect label values, providing clear reasons for unschedulability, which are invaluable for debugging (kubectl describe pod ...).

    Step 3: Implementing the `Score` Plugin Logic

    This is the core of our custom logic. The Score plugin will read the storage.my-company.com/io-utilization annotation from each node that passed the Filter stage. It will then normalize this value into a score from 0 to 100, where a lower I/O utilization results in a higher score.

    go
    // Add these methods to the StorageAware struct in plugin.go
    
    // ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
    func (s *StorageAware) ScoreExtensions() framework.ScoreExtensions {
    	return s
    }
    
    // Score is called by the framework to score a node.
    func (s *StorageAware) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    	nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from snapshot: %v", nodeName, err))
    	}
    
    	node := nodeInfo.Node()
    	if node == nil {
    		return 0, framework.NewStatus(framework.Error, "node not found")
    	}
    
    	// Default score if annotation is not present
    	score := int64(framework.MinNodeScore) // Default to lowest score
    
    	ioUtilString, ok := node.Annotations[AnnotationKeyIOUtilization]
    	if !ok {
    		klog.V(3).InfoS("Node has no I/O utilization annotation, assigning minimum score", "node", klog.KObj(node))
    		return score, framework.NewStatus(framework.Success)
    	}
    
    	ioUtil, err := strconv.ParseInt(ioUtilString, 10, 64)
    	if err != nil {
    		klog.Warningf("Failed to parse I/O utilization annotation for node %s: %v", nodeName, err)
    		return score, framework.NewStatus(framework.Success) // Return min score on parse error
    	}
    
    	// The lower the I/O utilization, the higher the score.
    	// We map utilization [0, 100] to score [100, 0].
    	if ioUtil < 0 {
    		ioUtil = 0
    	} else if ioUtil > 100 {
    		ioUtil = 100
    	}
    	score = 100 - ioUtil
    
    	klog.V(5).InfoS("Scoring node based on I/O utilization", "node", klog.KObj(node), "ioUtilization", ioUtil, "score", score)
    	return score, framework.NewStatus(framework.Success)
    }
    
    // NormalizeScore normalizes the score for the node.
    func (s *StorageAware) NormalizeScore(ctx context.Context, state *framework.CycleState, p *v1.Pod, scores framework.NodeScoreList) *framework.Status {
    	// The Score function already returns a value in the [0, 100] range, so we don't need complex normalization.
    	// We can just check for the highest score and scale if needed, but for now, this is sufficient.
    	// A more complex plugin might find the min/max scores in the list and scale them to the 0-100 range.
    	return framework.NewStatus(framework.Success)
    }

    Key Implementation Details:

    * ScoreExtensions: We implement this to provide a NormalizeScore function. While our current scoring logic naturally produces a 0-100 range, this hook is essential for plugins whose raw scores might be in arbitrary ranges (e.g., raw latency in milliseconds). The scheduler uses this to ensure all plugins' scores are weighted fairly.

    * Error Handling: We gracefully handle missing annotations or parse errors by assigning a minimum score. This makes the scheduler resilient to transient issues with the metrics-gathering DaemonSet.

    * Inverse Proportionality: The core logic score = 100 - ioUtil implements our desired behavior: lower I/O means a higher score, making the node more attractive.

    Step 4: Plugin Registration

    Finally, we need a main.go file to register our plugin with the Kubernetes component framework.

    go
    // main.go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	_ "k8s.io/component-base/logs/json/register"
    )
    
    func main() {
    	// Register the plugin
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(Name, New),
    	)
    
    	code := cli.Run(command)
    	os.Exit(code)
    }

    This main function uses the app.NewSchedulerCommand constructor, which is the standard entry point for scheduler-like components. The app.WithPlugin option registers our StorageAware plugin's name and its factory function (New). When the scheduler starts, it will know how to instantiate our plugin.

    Deployment and Configuration in a Production Cluster

    Writing the Go code is only half the battle. Integrating it into a live cluster requires careful packaging and configuration.

    Step 1: Packaging the Plugin in a Container

    We'll use a multi-stage Dockerfile to create a lean, production-ready image. This avoids shipping the entire Go toolchain in our final image.

    dockerfile
    # Dockerfile
    
    # Stage 1: Build the plugin binary
    FROM golang:1.21-alpine AS builder
    
    WORKDIR /build
    
    # Copy go.mod and go.sum and download dependencies
    COPY go.mod go.sum ./
    RUN go mod download
    
    # Copy the source code
    COPY . .
    
    # Build the binary
    # CGO_ENABLED=0 is important for a static binary
    # -ldflags "-w -s" strips debug symbols to reduce size
    RUN CGO_ENABLED=0 GOOS=linux go build -a -ldflags "-w -s" -o storage-aware-scheduler .
    
    # Stage 2: Create the final image
    # We use the official kube-scheduler image as a base to ensure compatibility
    # and get all the default scheduler logic for free.
    FROM registry.k8s.io/kube-scheduler:v1.28.0
    
    # Copy our custom plugin binary into the image
    COPY --from=builder /build/storage-aware-scheduler /usr/local/bin/storage-aware-scheduler
    
    # Ensure it's executable
    RUN chmod +x /usr/local/bin/storage-aware-scheduler

    Build and push this image to your container registry:

    bash
    docker build -t your-registry/storage-aware-scheduler:v1.0.0 .
    docker push your-registry/storage-aware-scheduler:v1.0.0

    Step 2: Creating a Custom Scheduler Configuration

    We need to tell Kubernetes how to use our plugin. This is done via a KubeSchedulerConfiguration file. This is a critical piece that defines a new scheduler profile.

    yaml
    # scheduler-config.yaml
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
    clientConnection:
      kubeconfig: /etc/kubernetes/scheduler.conf
    profiles:
      - schedulerName: default-scheduler # Keep the default profile
      - schedulerName: storage-aware-scheduler
        plugins:
          # We still want the default plugins for most stages
          preFilter:
            enabled:
              - name: "*"
          filter:
            enabled:
              - name: "*"
              - name: "StorageAwareScheduler" # Add our custom filter
          preScore:
            enabled:
              - name: "*"
          score:
            enabled:
              - name: "*"
              - name: "StorageAwareScheduler" # Add our custom scorer
            disabled:
              # We might want to disable a default scorer if it conflicts
              # e.g., - name: NodeResourcesBalancedAllocation
        pluginConfig:
          - name: StorageAwareScheduler
            # We can pass arguments to our plugin's New() function here
            # For this example, we don't have any.
            args: {}

    Key Configuration Points:

    * Profiles: We define two profiles. The default-scheduler remains untouched, ensuring normal cluster operations are not affected. Our new storage-aware-scheduler profile is additive.

    Enabling Plugins: Inside our custom profile, we enable all default plugins (name: "") and then explicitly add StorageAwareScheduler to the filter and score extension points. This composition is powerful; we leverage the battle-tested default logic (like checking for resource requests) and augment it with our specific logic.

    * pluginConfig: This section allows you to pass structured configuration to your plugin's New function, making your plugin reusable and configurable without a recompile.

    Step 3: Deploying the Custom Scheduler

    We will deploy our scheduler as a Deployment in the kube-system namespace.

    First, create a ConfigMap from the configuration file:

    bash
    kubectl create configmap -n kube-system storage-aware-scheduler-config --from-file=scheduler-config.yaml

    Now, the Deployment manifest:

    yaml
    # custom-scheduler-deployment.yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: storage-aware-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.ioio/v1
    kind: ClusterRoleBinding
    metadata:
      name: storage-aware-scheduler-as-scheduler
    subjects:
      - kind: ServiceAccount
        name: storage-aware-scheduler
        namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:kube-scheduler
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: storage-aware-scheduler
      namespace: kube-system
      labels:
        app: storage-aware-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: storage-aware-scheduler
      template:
        metadata:
          labels:
            app: storage-aware-scheduler
        spec:
          serviceAccountName: storage-aware-scheduler
          containers:
            - name: scheduler-plugin
              image: your-registry/storage-aware-scheduler:v1.0.0
              # The command tells the container to run our binary
              command:
                - /usr/local/bin/storage-aware-scheduler
              args:
                - --config=/etc/kubernetes/scheduler-config.yaml
                - --v=3 # Verbosity level for logging
              resources:
                requests:
                  cpu: "100m"
                  memory: "256Mi"
              volumeMounts:
                - name: scheduler-config-volume
                  mountPath: /etc/kubernetes
          volumes:
            - name: scheduler-config-volume
              configMap:
                name: storage-aware-scheduler-config

    This manifest sets up the necessary RBAC permissions (reusing the system:kube-scheduler ClusterRole) and mounts our ConfigMap so the scheduler binary can read its configuration.

    Step 4: Using the Custom Scheduler

    With our scheduler running, using it is as simple as specifying the schedulerName in a pod's spec. Here's an example StatefulSet:

    yaml
    # my-database-statefulset.yaml
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: my-database
    spec:
      serviceName: "my-database"
      replicas: 3
      selector:
        matchLabels:
          app: my-database
      template:
        metadata:
          labels:
            app: my-database
        spec:
          schedulerName: storage-aware-scheduler # HERE is the magic!
          containers:
            - name: database-container
              image: your-database-image:latest
              # ... ports, volumeMounts, etc.

    When you apply this manifest, the Kubernetes control plane will bypass the default-scheduler and hand these pods over to our storage-aware-scheduler for placement.

    Advanced Considerations and Performance

    A custom scheduler is a critical component. Its performance and reliability directly impact cluster stability.

    Performance and Scalability

    * Plugin Latency: Every millisecond added to your plugin's execution time is multiplied by the number of nodes. A slow Score function can severely degrade overall scheduling throughput. Never perform blocking I/O (like external API calls) directly in a Filter or Score function.

    * Solution: Caching and PreFilter: If you need external data, use the PreFilter stage to fetch it once per pod. Store the result in the CycleState object, a key-value store that persists for the duration of a single scheduling attempt. Your Filter and Score functions can then read this data from memory, which is orders of magnitude faster.

    * Benchmarking: Use tools like kubemark or custom-written performance test clients to simulate creating thousands of pods. Monitor the scheduler_scheduling_algorithm_duration_seconds and scheduler_framework_extension_point_duration_seconds metrics from your custom scheduler to pinpoint bottlenecks.

    Example Benchmark Comparison:

    Metric (p99 Latency)Default SchedulerStorage-Aware SchedulerImpact
    Scheduling Algorithm Duration8ms11ms+3ms per pod (acceptable)
    filter Extension Point Duration1ms1.5ms+0.5ms due to label check (negligible)
    score Extension Point Duration2ms4ms+2ms due to annotation parsing (monitor this)

    This analysis shows a modest, acceptable performance impact. If the Score duration were significantly higher (e.g., >20ms), it would indicate a need for optimization.

    Edge Case: Stale Metrics Data

    What if the DaemonSet updating the I/O annotation gets stuck or delayed? Our scheduler might make decisions based on stale data. A robust solution involves adding a timestamp to the annotation:

    storage.my-company.com/io-utilization: '{"value": 15, "timestamp": 1678886400}'

    Your Score plugin should then parse this JSON, check the timestamp, and if it's too old (e.g., > 60 seconds), it should disregard the value and return a default low score. This makes the system resilient to failures in the metrics pipeline.

    When Not to Write a Custom Scheduler

    Building and maintaining a custom scheduler is a significant operational commitment. Before embarking on this path, exhaust all native options:

  • Advanced podAntiAffinity: Use requiredDuringSchedulingIgnoredDuringExecution with topologyKey: kubernetes.io/hostname to ensure no two stateful pods land on the same node.
  • topologySpreadConstraints: This is a powerful tool for spreading pods across failure domains (nodes, racks, zones). It can often solve availability requirements without custom code.
  • The Descheduler: If your problem is rebalancing an already running cluster, the descheduler might be a better fit. It evicts pods that violate policies, allowing them to be rescheduled correctly.
  • A custom scheduler is the right tool when your scheduling logic depends on dynamic, external, or non-native state that cannot be adequately represented by labels and affinity rules. Our I/O utilization use case is a prime example. By taking control of the cluster's brain, you can achieve a level of performance and optimization for stateful workloads that is simply unattainable with the default toolset.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles