Advanced K8s Pod Placement: Custom Schedulers & Topology Spreading

17 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Ceiling of Declarative Scheduling: Why Defaults Fall Short

As a senior engineer managing large-scale Kubernetes clusters, you've mastered the basics. nodeSelector, nodeAffinity, and taints/tolerations are second nature for directing workloads. Yet, you've inevitably hit a wall where these tools are too coarse. They answer "can a pod run here?" but struggle with the more nuanced, critical question: "should a pod run here for optimal performance, availability, and cost?"

Consider these real-world production scenarios:

  • True High Availability: How do you guarantee that a rolling update for your critical API doesn't momentarily place all new pods in a single availability zone or on a single physical host, creating a temporary single point of failure?
  • Data Gravity: For a data processing pipeline, scheduling a compute pod on the same node as its required multi-gigabyte dataset can reduce network latency from milliseconds to microseconds, drastically improving job completion times. The default scheduler has no concept of this external state.
  • Resource Fragmentation: How do you prevent a few resource-hungry pods from consuming all the large nodes, leaving only fragmented resources that can't fit new, large requests, even though total cluster capacity is sufficient?
  • License Constraints: Your organization has a limited number of expensive, per-core software licenses. Pods using this software must be scheduled on a specific, pre-approved subset of nodes, and you need to load-balance them carefully to stay within compliance.
  • This is where the default kube-scheduler's generic filtering (Predicates) and scoring (Priorities) algorithms show their limitations. To solve these problems, we need to move to a more sophisticated level of control. This post will explore two powerful, production-ready techniques:

    * podTopologySpreadConstraints: A declarative, in-tree mechanism for fine-grained control over how pods are distributed across failure domains.

    * Custom Scheduler Plugins: The ultimate escape hatch, allowing you to inject bespoke business logic directly into the Kubernetes scheduling lifecycle using the Go-based scheduler framework.

    We will not be covering the basics. We assume you know how to write a Deployment manifest and understand what the scheduler does at a high level. Let's dive deep.


    Mastering `podTopologySpreadConstraints` for Resilient Architectures

    podTopologySpreadConstraints evolved from podAntiAffinity but provides far more granular control. Instead of a simple "don't run here if X is present," it allows you to define rules about the degree of imbalance (maxSkew) of pods across topological domains (topologyKey).

    Core Concepts Breakdown

  • topologyKey: The node label that defines the boundary of the failure domain. Common values are topology.kubernetes.io/zone, kubernetes.io/hostname, or custom labels like rack-id.
  • maxSkew: The maximum permitted difference between the number of matching pods in any two domains. A maxSkew of 1 for a service with 3 replicas spread across 3 zones means each zone must have exactly one pod.
  • whenUnsatisfiable: Defines the scheduler's behavior if the constraint cannot be met. DoNotSchedule (default) will fail to schedule the pod. ScheduleAnyway will schedule the pod but prioritize nodes that minimize the skew.
  • labelSelector: How the scheduler identifies which pods belong to the group being spread. This must match the labels of the pods you are controlling.
  • Scenario 1: Bulletproof High Availability for a Stateless Service

    Let's enforce strict HA for a critical api-gateway service. We want to ensure pods are spread as evenly as possible across both availability zones and individual nodes within those zones.

    yaml
    # api-gateway-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: api-gateway
      labels:
        app: api-gateway
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: api-gateway
      template:
        metadata:
          labels:
            app: api-gateway
        spec:
          containers:
          - name: gateway-container
            image: nginx:1.21
          topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: "topology.kubernetes.io/zone"
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: api-gateway
            # Optional: Add a note for human operators
            # This constraint ensures we never have a skew of more than 1 pod across AZs.
          - maxSkew: 1
            topologyKey: "kubernetes.io/hostname"
            whenUnsatisfiable: ScheduleAnyway
            labelSelector:
              matchLabels:
                app: api-gateway
            # This is a soft preference to spread across nodes within a zone.
            # We prioritize AZ spread over node spread.

    Analysis of this Production Pattern:

    Two-Tier Spreading: We apply two constraints. The first is a hard requirement* (DoNotSchedule) to spread across zones. If the cluster has 3 zones, this ensures that after scheduling 6 pods, each zone will have exactly 2. If one zone goes down, pods for this deployment cannot be scheduled, preventing a full deployment in a single remaining zone.

    Soft Preference: The second constraint is a soft preference (ScheduleAnyway) to spread across hostnames. This tells the scheduler: "Within a given zone, try* to place pods on different nodes, but if you can't, don't fail the scheduling attempt." This is crucial for handling scenarios with uneven node counts or resource availability within zones.

    * Rolling Update Safety: During a rolling update, maxSkew: 1 is critical. As old pods are terminated and new ones are created, the scheduler will always try to maintain the balance. For example, when the 7th pod (a new one) is being scheduled while an old one is still terminating, it will be placed in the zone with the fewest pods, maintaining the spread throughout the update process.

    Scenario 2: Co-locating Data Processors with a Distributed Cache

    Imagine a scenario with a distributed cache like Redis (managed by a StatefulSet) and a fleet of stateless data processing workers. For performance, we want to spread the workers evenly across the same zones as the cache pods.

    yaml
    # data-processor-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: data-processor
    spec:
      replicas: 10
      selector:
        matchLabels:
          app: data-processor
      template:
        metadata:
          labels:
            app: data-processor
        spec:
          containers:
          - name: processor
            image: my-data-processor:v1.2
          topologySpreadConstraints:
          - maxSkew: 1
            # This key assumes your nodes are labeled with their rack ID
            topologyKey: "topology.acme.com/rack-id"
            whenUnsatisfiable: DoNotSchedule
            # IMPORTANT: This selector targets the CACHE pods, not our own pods!
            labelSelector:
              matchLabels:
                app: redis-cache
            # Match pods whose node has a label 'storage-tier=ssd'
            nodeAffinityPolicy: Honor
            nodeTaintsPolicy: Honor

    Advanced Pattern Analysis:

    * Cross-Workload Spreading: The key insight here is that labelSelector does not have to target the pods within the same Deployment. We are spreading our data-processor pods relative to the topology of the redis-cache pods. This ensures our compute workloads follow our data workloads.

    Interaction with Affinity/Taints: The nodeAffinityPolicy and nodeTaintsPolicy fields (available in recent K8s versions) control whether nodes that don't match the pod's nodeAffinity or are tainted are included in the skew calculation. Setting them to Honor provides a more accurate spread calculation by only considering the pool of nodes where the pod could* actually be scheduled.

    Edge Cases and Performance Considerations

    * Performance Overhead: Topology spreading is not free. For every pod scheduling attempt, the scheduler must list all nodes and pods in the cluster that match the constraints to calculate skews. In clusters with thousands of nodes and tens of thousands of pods, this can add measurable latency to pod startup times. Use this feature judiciously for critical workloads.

    * Skew Calculation with minDomains: For autoscaling scenarios, a new minDomains field helps you ensure a minimum level of spread even when replica counts are low. For example, minDomains: 3 would force the first three pods to land in three different domains.

    * Cluster Imbalance: If your underlying nodes are not evenly distributed (e.g., 10 nodes in us-east-1a, 2 nodes in us-east-1b), a strict maxSkew: 1 can lead to unschedulable pods. You must align your constraints with the physical reality of your cluster topology.


    Building a Custom Scheduler Plugin for Data Locality

    podTopologySpreadConstraints are powerful, but they operate only on Kubernetes-native primitives like labels. What if your scheduling logic depends on an external system, like a data catalog, a GPU monitoring service, or a license manager?

    This is the domain of custom scheduler plugins. We will now build a scheduler plugin that scores nodes based on whether they have a specific dataset cached locally. This information resides in an external metadata service.

    The Kubernetes Scheduler Framework

    The modern way to build custom schedulers is not to fork and modify the kube-scheduler source, but to write plugins that hook into its well-defined extension points. A few key points:

    * Filter: Like predicates. Can the pod run on this node? Returns Success, Unschedulable, or Error.

    Score: Like priorities. If the pod can* run on multiple nodes, this plugin assigns a numerical score (0-100) to each valid node. The node with the highest combined score from all Score plugins wins.

    * Reserve/Unreserve: A transactional step. Once a node is chosen, the Reserve plugins claim the resources. If a later step fails, Unreserve rolls it back.

    * Permit: A final gate. The pod is bound to the node, but this plugin can hold it for a period (e.g., to wait for a resource to become available) before allowing it to proceed.

    Our data locality use case is a perfect fit for a Score plugin.

    Step 1: Project Setup and Plugin Skeleton

    We'll write our plugin in Go. First, set up your Go module.

    bash
    # Create the project directory
    mkdir data-locality-scheduler
    cd data-locality-scheduler
    
    # Initialize Go module
    go mod init github.com/your-org/data-locality-scheduler
    
    # Get Kubernetes dependencies (use a recent, compatible version)
    go get k8s.io/[email protected]
    go get k8s.io/[email protected]

    Now, let's create the plugin skeleton. Our plugin will be called DataLocalityScorer.

    go
    // plugins/data_locality_scorer.go
    package plugins
    
    import (
    	"context"
    	"encoding/json"
    	"fmt"
    	"log"
    
    	"k8s.io/apimachinery/pkg/runtime"
    	v1 "k8s.io/api/core/v1"
    	"k8s.io/kubernetes/pkg/scheduler/framework"
    )
    
    const (
    	// Name is the name of the plugin used in the scheduler configuration.
    	Name = "DataLocalityScorer"
    	// PodDatasetAnnotation is the annotation key on a pod to specify required dataset.
    	PodDatasetAnnotation = "data.locality.acme.com/dataset-id"
    	// NodeDatasetLabelPrefix is the prefix for node labels indicating cached datasets.
    	NodeDatasetLabelPrefix = "data.cache.acme.com/"
    )
    
    // DataLocalityScorer is a plugin that scores nodes based on data locality.
    type DataLocalityScorer struct {
    	handle framework.Handle
    }
    
    // Compile-time check to ensure our plugin implements the ScorePlugin interface.
    var _ framework.ScorePlugin = &DataLocalityScorer{}
    
    // Name returns the name of the plugin.
    func (dls *DataLocalityScorer) Name() string {
    	return Name
    }
    
    // New creates a new instance of the DataLocalityScorer plugin.
    func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
    	return &DataLocalityScorer{
    		handle: h,
    	}, nil
    }
    
    // Score is the main logic of our plugin.
    func (dls *DataLocalityScorer) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
        // Implementation coming next
        return 0, framework.NewStatus(framework.Success)
    }
    
    // ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
    func (dls *DataLocalityScorer) ScoreExtensions() framework.ScoreExtensions {
    	return nil // We don't need normalization for this simple plugin.
    }

    Step 2: Implementing the Core Scoring Logic

    This is where we add our business logic. The Score function is called for every node that passed the Filter phase.

    Our logic:

  • Check if the pod has the data.locality.acme.com/dataset-id annotation. If not, it's not our concern; return a neutral score of 0.
  • If it does, get the node object for the given nodeName.
  • Check if the node has a label data.cache.acme.com/: "true".
  • If the label exists, return the maximum possible score (framework.MaxNodeScore).
    • If not, return the minimum score (0).

    Note: A real-world implementation would call an external API. For simplicity and testability, we'll simulate this by checking node labels, a common pattern for conveying node-specific state to the scheduler.

    go
    // plugins/data_locality_scorer.go (updated Score function)
    
    func (dls *DataLocalityScorer) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    	log.Printf("Scoring pod %s on node %s", p.Name, nodeName)
    
    	// 1. Check for our specific annotation.
    	datasetID, ok := p.Annotations[PodDatasetAnnotation]
    	if !ok {
    		// This pod doesn't care about data locality, so we don't influence its score.
    		log.Printf("Pod %s has no dataset annotation, returning neutral score.", p.Name)
    		return 0, nil
    	}
    
    	// 2. Get the node object to inspect its labels.
    	nodeInfo, err := dls.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    	if err != nil {
    		return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
    	}
    	node := nodeInfo.Node()
    
    	// 3. Construct the expected label key.
    	expectedLabelKey := NodeDatasetLabelPrefix + datasetID
    
    	// 4. Check if the label exists and is true.
    	if labelValue, exists := node.Labels[expectedLabelKey]; exists && labelValue == "true" {
    		log.Printf("Node %s has required dataset %s. Assigning max score.", nodeName, datasetID)
    		return framework.MaxNodeScore, nil
    	}
    
    	// 5. The node does not have the data.
    	log.Printf("Node %s does not have dataset %s. Assigning min score.", nodeName, datasetID)
    	return 0, nil
    }

    Step 3: Registering the Plugin

    Now we need a main.go to build a runnable scheduler binary that includes our plugin.

    go
    // main.go
    package main
    
    import (
    	"os"
    
    	"k8s.io/component-base/cli"
    	"k8s.io/kubernetes/cmd/kube-scheduler/app"
    
    	"github.com/your-org/data-locality-scheduler/plugins"
    )
    
    func main() {
    	// Register our custom plugin with the scheduler framework's registry.
    	command := app.NewSchedulerCommand(
    		app.WithPlugin(plugins.Name, plugins.New),
    	)
    
    	code := cli.Run(command)
    	os.Exit(code)
    }
    

    Step 4: Packaging and Deploying the Custom Scheduler

    We'll package our scheduler using a multi-stage Dockerfile for a minimal, secure final image.

    dockerfile
    # Dockerfile
    
    # --- Build Stage ---
    FROM golang:1.21-alpine AS builder
    
    WORKDIR /app
    
    # Copy Go module files
    COPY go.mod go.sum ./
    # Download dependencies
    RUN go mod download
    
    # Copy the source code
    COPY . .
    
    # Build the binary
    RUN CGO_ENABLED=0 GOOS=linux go build -a -o custom-scheduler .
    
    # --- Final Stage ---
    FROM alpine:3.18
    
    WORKDIR /bin
    
    # Copy the binary from the builder stage
    COPY --from=builder /app/custom-scheduler .
    
    # Run the binary
    ENTRYPOINT ["/bin/custom-scheduler"]

    Build and push the image:

    docker build -t your-registry/data-locality-scheduler:v1.0 .

    docker push your-registry/data-locality-scheduler:v1.0

    Step 5: Configuring and Deploying to Kubernetes

    This is the most critical and often misunderstood part. We don't replace the default scheduler; we run ours alongside it and tell pods to use it via schedulerName.

    1. KubeSchedulerConfiguration:

    This YAML defines a "profile" for our scheduler, telling it which plugins to enable for each extension point.

    yaml
    # scheduler-config.yaml
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: true
    clientConnection:
      kubeconfig: /etc/kubernetes/scheduler.conf
    
    profiles:
      - schedulerName: data-locality-scheduler
        plugins:
          # We still need the default plugins for filtering, pre-scoring, etc.
          # We are only ADDING our custom scoring logic.
          score:
            enabled:
              - name: DataLocalityScorer
              # Also enable default scoring plugins for a balanced decision.
              - name: NodeResourcesBalancedAllocation
              - name: ImageLocality
              - name: TaintToleration
              - name: NodeAffinity
            disabled:
              # We can disable default plugins we don't want.
              - name: NodeResourcesLeastAllocated
        # Plugin-specific configuration can go here if needed.
        pluginConfig:
          - name: DataLocalityScorer
            args: {}

    2. Deployment Manifests:

    We need a ServiceAccount, ClusterRole, ClusterRoleBinding, a ConfigMap for the config, and a Deployment to run the scheduler.

    yaml
    # scheduler-deployment.yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: data-locality-scheduler
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: data-locality-scheduler-as-scheduler
    subjects:
      - kind: ServiceAccount
        name: data-locality-scheduler
        namespace: kube-system
    roleRef:
      kind: ClusterRole
      name: system:kube-scheduler # Use the built-in role for schedulers
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: data-locality-scheduler-config
      namespace: kube-system
    data:
      scheduler-config.yaml: |
        # Paste the content of scheduler-config.yaml here
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: data-locality-scheduler
      namespace: kube-system
      labels:
        app: data-locality-scheduler
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: data-locality-scheduler
      template:
        metadata:
          labels:
            app: data-locality-scheduler
        spec:
          serviceAccountName: data-locality-scheduler
          containers:
            - name: scheduler
              image: your-registry/data-locality-scheduler:v1.0
              args:
                - "--config=/etc/kubernetes/scheduler-config.yaml"
                - "-v=4" # Verbose logging
              resources:
                requests:
                  cpu: "100m"
                  memory: "256Mi"
              volumeMounts:
                - name: scheduler-config-volume
                  mountPath: /etc/kubernetes
          volumes:
            - name: scheduler-config-volume
              configMap:
                name: data-locality-scheduler-config

    Step 6: Using the Custom Scheduler

    Now, the final step. Let's simulate the environment and schedule a pod.

    First, label some nodes to indicate they have our cached dataset.

    bash
    # Node-1 has dataset-123
    kubectl label node k8s-node-1 data.cache.acme.com/dataset-123=true
    
    # Node-2 has dataset-456
    kubectl label node k8s-node-2 data.cache.acme.com/dataset-456=true
    
    # Node-3 has no specific datasets

    Now, create a pod that requests dataset-123 and tells Kubernetes to use our new scheduler.

    yaml
    # test-pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: data-job-123
      annotations:
        # This pod requires dataset-123
        data.locality.acme.com/dataset-id: "dataset-123"
    spec:
      # CRITICAL: Specify the scheduler to use.
      schedulerName: data-locality-scheduler
      containers:
        - name: processor
          image: busybox
          command: ["sh", "-c", "echo 'Processing data...' && sleep 3600"]

    Apply this manifest: kubectl apply -f test-pod.yaml. Then, check where it landed:

    kubectl get pod data-job-123 -o wide

    You will see that the pod is reliably scheduled on k8s-node-1. If you create another pod requesting dataset-456, it will land on k8s-node-2. A pod with no annotation will be scheduled according to the default scoring plugins we left enabled.


    Final Production Considerations

    * Observability is Non-Negotiable: Your custom scheduler is a critical cluster component. Expose Prometheus metrics from your plugin. Track scheduling attempts, successes, failures, and crucially, the latency of your Score function. A slow external API call can bring your entire cluster's pod scheduling to a crawl.

    * High Availability: The provided Deployment runs a single replica. For production, you must run multiple replicas. The leaderElection block in our KubeSchedulerConfiguration ensures that only one instance is active at a time, preventing race conditions.

    * Error Handling and State: What happens if your metadata service is down? Your Score function should handle that gracefully. A common pattern is to return a neutral score (0) and log an error. This allows scheduling to proceed based on other criteria, degrading gracefully rather than failing completely.

    * CycleState for Caching: If multiple plugins in your profile need the same external data, the first plugin can fetch it and store it in the framework.CycleState object. This object is a key-value store that persists for the duration of a single pod's scheduling attempt, allowing subsequent plugins to read the data without making redundant API calls.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles