Advanced StatefulSet Scheduling with Custom Kubernetes Scheduler Plugins
The Default Scheduler's Blind Spot: Nuanced Stateful Workloads
The stock kube-scheduler is a marvel of engineering, adeptly handling the placement of thousands of pods across a cluster using a sophisticated set of predicates (Filters) and priorities (Scores). For stateless applications, its combination of pod/node affinity, resource requests, and topology spread constraints is typically sufficient. However, for high-performance stateful workloads—distributed databases (Cassandra, Vitess), message queues (Kafka), or search indexes (Elasticsearch)—these general-purpose tools represent a blunt instrument where a scalpel is required.
The core limitation is a lack of deep, real-time, infrastructure-specific context. The default scheduler is unaware of:
* I/O Contention: It treats a node with idle NVMe drives the same as one whose spinning disks are saturated by a noisy neighbor, as long as CPU and memory requests are met.
* Heterogeneous Storage Tiers: It cannot natively distinguish between a node backed by premium, low-latency storage and one with archival-grade, high-latency disks, unless explicitly guided by crude node labels.
* Fine-Grained Network Topology: Beyond basic zone/region awareness, it doesn't understand network switch proximity or cross-rack latency, which is critical for leader election and replication traffic in distributed systems.
* Dynamic Node State: A node's health and performance are not static. A temporary I/O bottleneck or a 'flapping' network interface are transient states the default scheduler cannot react to during a scheduling cycle.
Attempting to solve these issues with a proliferation of complex node labels and intricate affinity rules quickly leads to a brittle, unmanageable configuration. This is the inflection point where senior engineers must consider moving beyond configuration and into programmatic control by extending the scheduler itself.
This article details the implementation of a custom scheduler plugin designed to address these stateful workload challenges directly. We will build a Storage-Aware Plugin that filters nodes based on storage hardware and scores them based on real-time disk I/O metrics, ensuring our StatefulSet pods land on the most performant and appropriate nodes.
The Kubernetes Scheduling Framework: Surgical Extension Points
Before KEP-624, customizing the scheduler meant forking the entire Kubernetes codebase. The modern Scheduling Framework provides a set of extension points, allowing us to inject custom logic into specific phases of the scheduling cycle without modifying the core. For our stateful workload scenario, we will focus on two primary extension points:
Filter: This extension point runs for every node in the cluster. If any Filter plugin returns 'unschedulable' for a node, that node is immediately removed from consideration for the current pod. This is our gatekeeper. We'll use it to disqualify nodes that don't have the required storage hardware (e.g., must be NVMe).Score: After filtering, the remaining nodes are passed to the Score plugins. Each Score plugin returns an integer score (typically 0-100) for each node, indicating its desirability. The scheduler then sums the weighted scores from all Score plugins to produce a final ranking. This is our optimization engine. We'll use it to rank nodes based on lowest disk I/O, giving preference to quiescent nodes.Other relevant points we'll touch upon:
* PreFilter: Runs before Filter. Useful for pre-computing data that can be shared across Filter and Score calls for a single pod, preventing redundant calculations.
* Reserve: The final step before binding. It allows a plugin to reserve resources on a node, knowing that the pod is very likely to be bound there. This is critical for preventing race conditions if multiple custom schedulers are operating.
By implementing these interfaces in Go, we can create a powerful, context-aware scheduling logic that integrates seamlessly with the kube-scheduler binary.
Scenario: Implementing a Production-Grade Storage-Aware Plugin
Our goal is to schedule pods for a custom, I/O-intensive StatefulSet. The scheduling requirements are:
storage=nvme labels.To achieve this, we need a mechanism to expose real-time node metrics to our plugin. A common production pattern is to have a DaemonSet (like node-exporter) that collects metrics and periodically updates the Node object's annotations. While CRDs are another option, annotations are simpler for this use case and avoid adding another controller dependency.
Let's assume a DaemonSet on each node is responsible for maintaining the following annotation:
storage.my-company.com/io-utilization: "15" (A string representing percentage from 0-100)
Step 1: Project Setup and Plugin Scaffolding
We'll structure our Go project. You need a working Go environment and knowledge of Go modules.
# Create project directory
mkdir storage-aware-scheduler
cd storage-aware-scheduler
# Initialize Go module
go mod init github.com/your-org/storage-aware-scheduler
# Get Kubernetes dependencies
# Use a version that matches your target cluster
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
go get k8s.io/[email protected]
# Tidy up
go mod tidy
Now, let's create our main plugin file, plugin.go.
// plugin.go
package main
import (
"context"
"fmt"
"strconv"
"k8s.io/apimachinery/pkg/runtime"
v1 "k8s.io/api/core/v1"
"k8s.io/klog/v2"
framework "k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
// Name is the name of the plugin used in the plugin registry and configurations.
Name = "StorageAwareScheduler"
// The annotation key for I/O utilization on a Node object.
AnnotationKeyIOUtilization = "storage.my-company.com/io-utilization"
// The label key for required storage type on a Node object.
LabelKeyStorageType = "storage"
// The required value for the storage type label.
RequiredStorageTypeValue = "nvme"
)
// StorageAware is a scheduler plugin that filters and scores nodes based on storage attributes.
type StorageAware struct {
handle framework.Handle
}
// Ensure StorageAware implements the necessary interfaces.
var _ framework.FilterPlugin = &StorageAware{}
var _ framework.ScorePlugin = &StorageAware{}
// Name returns the name of the plugin.
func (s *StorageAware) Name() string {
return Name
}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
klog.InfoS("Initializing new StorageAware scheduler plugin")
return &StorageAware{
handle: h,
}, nil
}
This code sets up the basic structure. We define our plugin's Name, create the StorageAware struct, and implement the Name() and New() factory functions required by the framework. The var _ ... lines are a compile-time check to ensure our struct correctly implements the interfaces we intend to use.
Step 2: Implementing the `Filter` Plugin Logic
The Filter logic is straightforward. It checks if a node has the storage=nvme label. If not, it's rejected.
// Add this method to the StorageAware struct in plugin.go
// Filter is called by the framework to filter out nodes that do not fit the pod.
func (s *StorageAware) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
klog.V(4).InfoS("Filtering node for storage requirements", "pod", klog.KObj(pod), "node", klog.KObj(nodeInfo.Node()))
node := nodeInfo.Node()
if node == nil {
return framework.NewStatus(framework.Error, "node not found")
}
storageType, ok := node.Labels[LabelKeyStorageType]
if !ok {
klog.V(2).InfoS("Node rejected: missing storage type label", "node", klog.KObj(node), "requiredLabel", LabelKeyStorageType)
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node does not have label %s", LabelKeyStorageType))
}
if storageType != RequiredStorageTypeValue {
klog.V(2).InfoS("Node rejected: incorrect storage type", "node", klog.KObj(node), "labelValue", storageType, "requiredValue", RequiredStorageTypeValue)
return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node has storage type %s, requires %s", storageType, RequiredStorageTypeValue))
}
klog.V(4).InfoS("Node passed storage filter", "node", klog.KObj(node))
return framework.NewStatus(framework.Success)
}
This Filter method is robust. It handles nil nodes, missing labels, and incorrect label values, providing clear reasons for unschedulability, which are invaluable for debugging (kubectl describe pod ...).
Step 3: Implementing the `Score` Plugin Logic
This is the core of our custom logic. The Score plugin will read the storage.my-company.com/io-utilization annotation from each node that passed the Filter stage. It will then normalize this value into a score from 0 to 100, where a lower I/O utilization results in a higher score.
// Add these methods to the StorageAware struct in plugin.go
// ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
func (s *StorageAware) ScoreExtensions() framework.ScoreExtensions {
return s
}
// Score is called by the framework to score a node.
func (s *StorageAware) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
nodeInfo, err := s.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from snapshot: %v", nodeName, err))
}
node := nodeInfo.Node()
if node == nil {
return 0, framework.NewStatus(framework.Error, "node not found")
}
// Default score if annotation is not present
score := int64(framework.MinNodeScore) // Default to lowest score
ioUtilString, ok := node.Annotations[AnnotationKeyIOUtilization]
if !ok {
klog.V(3).InfoS("Node has no I/O utilization annotation, assigning minimum score", "node", klog.KObj(node))
return score, framework.NewStatus(framework.Success)
}
ioUtil, err := strconv.ParseInt(ioUtilString, 10, 64)
if err != nil {
klog.Warningf("Failed to parse I/O utilization annotation for node %s: %v", nodeName, err)
return score, framework.NewStatus(framework.Success) // Return min score on parse error
}
// The lower the I/O utilization, the higher the score.
// We map utilization [0, 100] to score [100, 0].
if ioUtil < 0 {
ioUtil = 0
} else if ioUtil > 100 {
ioUtil = 100
}
score = 100 - ioUtil
klog.V(5).InfoS("Scoring node based on I/O utilization", "node", klog.KObj(node), "ioUtilization", ioUtil, "score", score)
return score, framework.NewStatus(framework.Success)
}
// NormalizeScore normalizes the score for the node.
func (s *StorageAware) NormalizeScore(ctx context.Context, state *framework.CycleState, p *v1.Pod, scores framework.NodeScoreList) *framework.Status {
// The Score function already returns a value in the [0, 100] range, so we don't need complex normalization.
// We can just check for the highest score and scale if needed, but for now, this is sufficient.
// A more complex plugin might find the min/max scores in the list and scale them to the 0-100 range.
return framework.NewStatus(framework.Success)
}
Key Implementation Details:
* ScoreExtensions: We implement this to provide a NormalizeScore function. While our current scoring logic naturally produces a 0-100 range, this hook is essential for plugins whose raw scores might be in arbitrary ranges (e.g., raw latency in milliseconds). The scheduler uses this to ensure all plugins' scores are weighted fairly.
* Error Handling: We gracefully handle missing annotations or parse errors by assigning a minimum score. This makes the scheduler resilient to transient issues with the metrics-gathering DaemonSet.
* Inverse Proportionality: The core logic score = 100 - ioUtil implements our desired behavior: lower I/O means a higher score, making the node more attractive.
Step 4: Plugin Registration
Finally, we need a main.go file to register our plugin with the Kubernetes component framework.
// main.go
package main
import (
"os"
"k8s.io/component-base/cli"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
_ "k8s.io/component-base/logs/json/register"
)
func main() {
// Register the plugin
command := app.NewSchedulerCommand(
app.WithPlugin(Name, New),
)
code := cli.Run(command)
os.Exit(code)
}
This main function uses the app.NewSchedulerCommand constructor, which is the standard entry point for scheduler-like components. The app.WithPlugin option registers our StorageAware plugin's name and its factory function (New). When the scheduler starts, it will know how to instantiate our plugin.
Deployment and Configuration in a Production Cluster
Writing the Go code is only half the battle. Integrating it into a live cluster requires careful packaging and configuration.
Step 1: Packaging the Plugin in a Container
We'll use a multi-stage Dockerfile to create a lean, production-ready image. This avoids shipping the entire Go toolchain in our final image.
# Dockerfile
# Stage 1: Build the plugin binary
FROM golang:1.21-alpine AS builder
WORKDIR /build
# Copy go.mod and go.sum and download dependencies
COPY go.mod go.sum ./
RUN go mod download
# Copy the source code
COPY . .
# Build the binary
# CGO_ENABLED=0 is important for a static binary
# -ldflags "-w -s" strips debug symbols to reduce size
RUN CGO_ENABLED=0 GOOS=linux go build -a -ldflags "-w -s" -o storage-aware-scheduler .
# Stage 2: Create the final image
# We use the official kube-scheduler image as a base to ensure compatibility
# and get all the default scheduler logic for free.
FROM registry.k8s.io/kube-scheduler:v1.28.0
# Copy our custom plugin binary into the image
COPY --from=builder /build/storage-aware-scheduler /usr/local/bin/storage-aware-scheduler
# Ensure it's executable
RUN chmod +x /usr/local/bin/storage-aware-scheduler
Build and push this image to your container registry:
docker build -t your-registry/storage-aware-scheduler:v1.0.0 .
docker push your-registry/storage-aware-scheduler:v1.0.0
Step 2: Creating a Custom Scheduler Configuration
We need to tell Kubernetes how to use our plugin. This is done via a KubeSchedulerConfiguration file. This is a critical piece that defines a new scheduler profile.
# scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- schedulerName: default-scheduler # Keep the default profile
- schedulerName: storage-aware-scheduler
plugins:
# We still want the default plugins for most stages
preFilter:
enabled:
- name: "*"
filter:
enabled:
- name: "*"
- name: "StorageAwareScheduler" # Add our custom filter
preScore:
enabled:
- name: "*"
score:
enabled:
- name: "*"
- name: "StorageAwareScheduler" # Add our custom scorer
disabled:
# We might want to disable a default scorer if it conflicts
# e.g., - name: NodeResourcesBalancedAllocation
pluginConfig:
- name: StorageAwareScheduler
# We can pass arguments to our plugin's New() function here
# For this example, we don't have any.
args: {}
Key Configuration Points:
* Profiles: We define two profiles. The default-scheduler remains untouched, ensuring normal cluster operations are not affected. Our new storage-aware-scheduler profile is additive.
Enabling Plugins: Inside our custom profile, we enable all default plugins (name: "") and then explicitly add StorageAwareScheduler to the filter and score extension points. This composition is powerful; we leverage the battle-tested default logic (like checking for resource requests) and augment it with our specific logic.
* pluginConfig: This section allows you to pass structured configuration to your plugin's New function, making your plugin reusable and configurable without a recompile.
Step 3: Deploying the Custom Scheduler
We will deploy our scheduler as a Deployment in the kube-system namespace.
First, create a ConfigMap from the configuration file:
kubectl create configmap -n kube-system storage-aware-scheduler-config --from-file=scheduler-config.yaml
Now, the Deployment manifest:
# custom-scheduler-deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: storage-aware-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.ioio/v1
kind: ClusterRoleBinding
metadata:
name: storage-aware-scheduler-as-scheduler
subjects:
- kind: ServiceAccount
name: storage-aware-scheduler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: storage-aware-scheduler
namespace: kube-system
labels:
app: storage-aware-scheduler
spec:
replicas: 1
selector:
matchLabels:
app: storage-aware-scheduler
template:
metadata:
labels:
app: storage-aware-scheduler
spec:
serviceAccountName: storage-aware-scheduler
containers:
- name: scheduler-plugin
image: your-registry/storage-aware-scheduler:v1.0.0
# The command tells the container to run our binary
command:
- /usr/local/bin/storage-aware-scheduler
args:
- --config=/etc/kubernetes/scheduler-config.yaml
- --v=3 # Verbosity level for logging
resources:
requests:
cpu: "100m"
memory: "256Mi"
volumeMounts:
- name: scheduler-config-volume
mountPath: /etc/kubernetes
volumes:
- name: scheduler-config-volume
configMap:
name: storage-aware-scheduler-config
This manifest sets up the necessary RBAC permissions (reusing the system:kube-scheduler ClusterRole) and mounts our ConfigMap so the scheduler binary can read its configuration.
Step 4: Using the Custom Scheduler
With our scheduler running, using it is as simple as specifying the schedulerName in a pod's spec. Here's an example StatefulSet:
# my-database-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: my-database
spec:
serviceName: "my-database"
replicas: 3
selector:
matchLabels:
app: my-database
template:
metadata:
labels:
app: my-database
spec:
schedulerName: storage-aware-scheduler # HERE is the magic!
containers:
- name: database-container
image: your-database-image:latest
# ... ports, volumeMounts, etc.
When you apply this manifest, the Kubernetes control plane will bypass the default-scheduler and hand these pods over to our storage-aware-scheduler for placement.
Advanced Considerations and Performance
A custom scheduler is a critical component. Its performance and reliability directly impact cluster stability.
Performance and Scalability
* Plugin Latency: Every millisecond added to your plugin's execution time is multiplied by the number of nodes. A slow Score function can severely degrade overall scheduling throughput. Never perform blocking I/O (like external API calls) directly in a Filter or Score function.
* Solution: Caching and PreFilter: If you need external data, use the PreFilter stage to fetch it once per pod. Store the result in the CycleState object, a key-value store that persists for the duration of a single scheduling attempt. Your Filter and Score functions can then read this data from memory, which is orders of magnitude faster.
* Benchmarking: Use tools like kubemark or custom-written performance test clients to simulate creating thousands of pods. Monitor the scheduler_scheduling_algorithm_duration_seconds and scheduler_framework_extension_point_duration_seconds metrics from your custom scheduler to pinpoint bottlenecks.
Example Benchmark Comparison:
| Metric (p99 Latency) | Default Scheduler | Storage-Aware Scheduler | Impact |
|---|---|---|---|
| Scheduling Algorithm Duration | 8ms | 11ms | +3ms per pod (acceptable) |
filter Extension Point Duration | 1ms | 1.5ms | +0.5ms due to label check (negligible) |
score Extension Point Duration | 2ms | 4ms | +2ms due to annotation parsing (monitor this) |
This analysis shows a modest, acceptable performance impact. If the Score duration were significantly higher (e.g., >20ms), it would indicate a need for optimization.
Edge Case: Stale Metrics Data
What if the DaemonSet updating the I/O annotation gets stuck or delayed? Our scheduler might make decisions based on stale data. A robust solution involves adding a timestamp to the annotation:
storage.my-company.com/io-utilization: '{"value": 15, "timestamp": 1678886400}'
Your Score plugin should then parse this JSON, check the timestamp, and if it's too old (e.g., > 60 seconds), it should disregard the value and return a default low score. This makes the system resilient to failures in the metrics pipeline.
When Not to Write a Custom Scheduler
Building and maintaining a custom scheduler is a significant operational commitment. Before embarking on this path, exhaust all native options:
podAntiAffinity: Use requiredDuringSchedulingIgnoredDuringExecution with topologyKey: kubernetes.io/hostname to ensure no two stateful pods land on the same node.topologySpreadConstraints: This is a powerful tool for spreading pods across failure domains (nodes, racks, zones). It can often solve availability requirements without custom code.A custom scheduler is the right tool when your scheduling logic depends on dynamic, external, or non-native state that cannot be adequately represented by labels and affinity rules. Our I/O utilization use case is a prime example. By taking control of the cluster's brain, you can achieve a level of performance and optimization for stateful workloads that is simply unattainable with the default toolset.