Advanced K8s Pod Placement: Custom Schedulers & Topology Spreading
The Ceiling of Declarative Scheduling: Why Defaults Fall Short
As a senior engineer managing large-scale Kubernetes clusters, you've mastered the basics. nodeSelector, nodeAffinity, and taints/tolerations are second nature for directing workloads. Yet, you've inevitably hit a wall where these tools are too coarse. They answer "can a pod run here?" but struggle with the more nuanced, critical question: "should a pod run here for optimal performance, availability, and cost?"
Consider these real-world production scenarios:
This is where the default kube-scheduler's generic filtering (Predicates) and scoring (Priorities) algorithms show their limitations. To solve these problems, we need to move to a more sophisticated level of control. This post will explore two powerful, production-ready techniques:
*   podTopologySpreadConstraints: A declarative, in-tree mechanism for fine-grained control over how pods are distributed across failure domains.
* Custom Scheduler Plugins: The ultimate escape hatch, allowing you to inject bespoke business logic directly into the Kubernetes scheduling lifecycle using the Go-based scheduler framework.
We will not be covering the basics. We assume you know how to write a Deployment manifest and understand what the scheduler does at a high level. Let's dive deep.
Mastering `podTopologySpreadConstraints` for Resilient Architectures
podTopologySpreadConstraints evolved from podAntiAffinity but provides far more granular control. Instead of a simple "don't run here if X is present," it allows you to define rules about the degree of imbalance (maxSkew) of pods across topological domains (topologyKey).
Core Concepts Breakdown
topologyKey: The node label that defines the boundary of the failure domain. Common values are topology.kubernetes.io/zone, kubernetes.io/hostname, or custom labels like rack-id.maxSkew: The maximum permitted difference between the number of matching pods in any two domains. A maxSkew of 1 for a service with 3 replicas spread across 3 zones means each zone must have exactly one pod.whenUnsatisfiable: Defines the scheduler's behavior if the constraint cannot be met. DoNotSchedule (default) will fail to schedule the pod. ScheduleAnyway will schedule the pod but prioritize nodes that minimize the skew.labelSelector: How the scheduler identifies which pods belong to the group being spread. This must match the labels of the pods you are controlling.Scenario 1: Bulletproof High Availability for a Stateless Service
Let's enforce strict HA for a critical api-gateway service. We want to ensure pods are spread as evenly as possible across both availability zones and individual nodes within those zones.
# api-gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  labels:
    app: api-gateway
spec:
  replicas: 6
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
      - name: gateway-container
        image: nginx:1.21
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "topology.kubernetes.io/zone"
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: api-gateway
        # Optional: Add a note for human operators
        # This constraint ensures we never have a skew of more than 1 pod across AZs.
      - maxSkew: 1
        topologyKey: "kubernetes.io/hostname"
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: api-gateway
        # This is a soft preference to spread across nodes within a zone.
        # We prioritize AZ spread over node spread.Analysis of this Production Pattern:
   Two-Tier Spreading: We apply two constraints. The first is a hard requirement* (DoNotSchedule) to spread across zones. If the cluster has 3 zones, this ensures that after scheduling 6 pods, each zone will have exactly 2. If one zone goes down, pods for this deployment cannot be scheduled, preventing a full deployment in a single remaining zone.
   Soft Preference: The second constraint is a soft preference (ScheduleAnyway) to spread across hostnames. This tells the scheduler: "Within a given zone, try* to place pods on different nodes, but if you can't, don't fail the scheduling attempt." This is crucial for handling scenarios with uneven node counts or resource availability within zones.
*   Rolling Update Safety: During a rolling update, maxSkew: 1 is critical. As old pods are terminated and new ones are created, the scheduler will always try to maintain the balance. For example, when the 7th pod (a new one) is being scheduled while an old one is still terminating, it will be placed in the zone with the fewest pods, maintaining the spread throughout the update process.
Scenario 2: Co-locating Data Processors with a Distributed Cache
Imagine a scenario with a distributed cache like Redis (managed by a StatefulSet) and a fleet of stateless data processing workers. For performance, we want to spread the workers evenly across the same zones as the cache pods.
# data-processor-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-processor
spec:
  replicas: 10
  selector:
    matchLabels:
      app: data-processor
  template:
    metadata:
      labels:
        app: data-processor
    spec:
      containers:
      - name: processor
        image: my-data-processor:v1.2
      topologySpreadConstraints:
      - maxSkew: 1
        # This key assumes your nodes are labeled with their rack ID
        topologyKey: "topology.acme.com/rack-id"
        whenUnsatisfiable: DoNotSchedule
        # IMPORTANT: This selector targets the CACHE pods, not our own pods!
        labelSelector:
          matchLabels:
            app: redis-cache
        # Match pods whose node has a label 'storage-tier=ssd'
        nodeAffinityPolicy: Honor
        nodeTaintsPolicy: HonorAdvanced Pattern Analysis:
*   Cross-Workload Spreading: The key insight here is that labelSelector does not have to target the pods within the same Deployment. We are spreading our data-processor pods relative to the topology of the redis-cache pods. This ensures our compute workloads follow our data workloads.
   Interaction with Affinity/Taints: The nodeAffinityPolicy and nodeTaintsPolicy fields (available in recent K8s versions) control whether nodes that don't match the pod's nodeAffinity or are tainted are included in the skew calculation. Setting them to Honor provides a more accurate spread calculation by only considering the pool of nodes where the pod could* actually be scheduled.
Edge Cases and Performance Considerations
* Performance Overhead: Topology spreading is not free. For every pod scheduling attempt, the scheduler must list all nodes and pods in the cluster that match the constraints to calculate skews. In clusters with thousands of nodes and tens of thousands of pods, this can add measurable latency to pod startup times. Use this feature judiciously for critical workloads.
*   Skew Calculation with minDomains: For autoscaling scenarios, a new minDomains field helps you ensure a minimum level of spread even when replica counts are low. For example, minDomains: 3 would force the first three pods to land in three different domains.
*   Cluster Imbalance: If your underlying nodes are not evenly distributed (e.g., 10 nodes in us-east-1a, 2 nodes in us-east-1b), a strict maxSkew: 1 can lead to unschedulable pods. You must align your constraints with the physical reality of your cluster topology.
Building a Custom Scheduler Plugin for Data Locality
podTopologySpreadConstraints are powerful, but they operate only on Kubernetes-native primitives like labels. What if your scheduling logic depends on an external system, like a data catalog, a GPU monitoring service, or a license manager?
This is the domain of custom scheduler plugins. We will now build a scheduler plugin that scores nodes based on whether they have a specific dataset cached locally. This information resides in an external metadata service.
The Kubernetes Scheduler Framework
The modern way to build custom schedulers is not to fork and modify the kube-scheduler source, but to write plugins that hook into its well-defined extension points. A few key points:
*   Filter: Like predicates. Can the pod run on this node? Returns Success, Unschedulable, or Error.
Score: Like priorities. If the pod can* run on multiple nodes, this plugin assigns a numerical score (0-100) to each valid node. The node with the highest combined score from all Score plugins wins.
* Reserve/Unreserve: A transactional step. Once a node is chosen, the Reserve plugins claim the resources. If a later step fails, Unreserve rolls it back.
* Permit: A final gate. The pod is bound to the node, but this plugin can hold it for a period (e.g., to wait for a resource to become available) before allowing it to proceed.
Our data locality use case is a perfect fit for a Score plugin.
Step 1: Project Setup and Plugin Skeleton
We'll write our plugin in Go. First, set up your Go module.
# Create the project directory
mkdir data-locality-scheduler
cd data-locality-scheduler
# Initialize Go module
go mod init github.com/your-org/data-locality-scheduler
# Get Kubernetes dependencies (use a recent, compatible version)
go get k8s.io/[email protected]
go get k8s.io/[email protected]Now, let's create the plugin skeleton. Our plugin will be called DataLocalityScorer.
// plugins/data_locality_scorer.go
package plugins
import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"k8s.io/apimachinery/pkg/runtime"
	v1 "k8s.io/api/core/v1"
	"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
	// Name is the name of the plugin used in the scheduler configuration.
	Name = "DataLocalityScorer"
	// PodDatasetAnnotation is the annotation key on a pod to specify required dataset.
	PodDatasetAnnotation = "data.locality.acme.com/dataset-id"
	// NodeDatasetLabelPrefix is the prefix for node labels indicating cached datasets.
	NodeDatasetLabelPrefix = "data.cache.acme.com/"
)
// DataLocalityScorer is a plugin that scores nodes based on data locality.
type DataLocalityScorer struct {
	handle framework.Handle
}
// Compile-time check to ensure our plugin implements the ScorePlugin interface.
var _ framework.ScorePlugin = &DataLocalityScorer{}
// Name returns the name of the plugin.
func (dls *DataLocalityScorer) Name() string {
	return Name
}
// New creates a new instance of the DataLocalityScorer plugin.
func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
	return &DataLocalityScorer{
		handle: h,
	}, nil
}
// Score is the main logic of our plugin.
func (dls *DataLocalityScorer) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
    // Implementation coming next
    return 0, framework.NewStatus(framework.Success)
}
// ScoreExtensions returns a ScoreExtensions interface if the plugin implements one.
func (dls *DataLocalityScorer) ScoreExtensions() framework.ScoreExtensions {
	return nil // We don't need normalization for this simple plugin.
}Step 2: Implementing the Core Scoring Logic
This is where we add our business logic. The Score function is called for every node that passed the Filter phase.
Our logic:
data.locality.acme.com/dataset-id annotation. If not, it's not our concern; return a neutral score of 0.nodeName.data.cache.acme.com/: "true" .framework.MaxNodeScore).- If not, return the minimum score (0).
Note: A real-world implementation would call an external API. For simplicity and testability, we'll simulate this by checking node labels, a common pattern for conveying node-specific state to the scheduler.
// plugins/data_locality_scorer.go (updated Score function)
func (dls *DataLocalityScorer) Score(ctx context.Context, state *framework.CycleState, p *v1.Pod, nodeName string) (int64, *framework.Status) {
	log.Printf("Scoring pod %s on node %s", p.Name, nodeName)
	// 1. Check for our specific annotation.
	datasetID, ok := p.Annotations[PodDatasetAnnotation]
	if !ok {
		// This pod doesn't care about data locality, so we don't influence its score.
		log.Printf("Pod %s has no dataset annotation, returning neutral score.", p.Name)
		return 0, nil
	}
	// 2. Get the node object to inspect its labels.
	nodeInfo, err := dls.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.AsStatus(fmt.Errorf("getting node %q from snapshot: %w", nodeName, err))
	}
	node := nodeInfo.Node()
	// 3. Construct the expected label key.
	expectedLabelKey := NodeDatasetLabelPrefix + datasetID
	// 4. Check if the label exists and is true.
	if labelValue, exists := node.Labels[expectedLabelKey]; exists && labelValue == "true" {
		log.Printf("Node %s has required dataset %s. Assigning max score.", nodeName, datasetID)
		return framework.MaxNodeScore, nil
	}
	// 5. The node does not have the data.
	log.Printf("Node %s does not have dataset %s. Assigning min score.", nodeName, datasetID)
	return 0, nil
}Step 3: Registering the Plugin
Now we need a main.go to build a runnable scheduler binary that includes our plugin.
// main.go
package main
import (
	"os"
	"k8s.io/component-base/cli"
	"k8s.io/kubernetes/cmd/kube-scheduler/app"
	"github.com/your-org/data-locality-scheduler/plugins"
)
func main() {
	// Register our custom plugin with the scheduler framework's registry.
	command := app.NewSchedulerCommand(
		app.WithPlugin(plugins.Name, plugins.New),
	)
	code := cli.Run(command)
	os.Exit(code)
}
Step 4: Packaging and Deploying the Custom Scheduler
We'll package our scheduler using a multi-stage Dockerfile for a minimal, secure final image.
# Dockerfile
# --- Build Stage ---
FROM golang:1.21-alpine AS builder
WORKDIR /app
# Copy Go module files
COPY go.mod go.sum ./
# Download dependencies
RUN go mod download
# Copy the source code
COPY . .
# Build the binary
RUN CGO_ENABLED=0 GOOS=linux go build -a -o custom-scheduler .
# --- Final Stage ---
FROM alpine:3.18
WORKDIR /bin
# Copy the binary from the builder stage
COPY --from=builder /app/custom-scheduler .
# Run the binary
ENTRYPOINT ["/bin/custom-scheduler"]Build and push the image:
docker build -t your-registry/data-locality-scheduler:v1.0 .
docker push your-registry/data-locality-scheduler:v1.0
Step 5: Configuring and Deploying to Kubernetes
This is the most critical and often misunderstood part. We don't replace the default scheduler; we run ours alongside it and tell pods to use it via schedulerName.
1. KubeSchedulerConfiguration:
This YAML defines a "profile" for our scheduler, telling it which plugins to enable for each extension point.
# scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
  - schedulerName: data-locality-scheduler
    plugins:
      # We still need the default plugins for filtering, pre-scoring, etc.
      # We are only ADDING our custom scoring logic.
      score:
        enabled:
          - name: DataLocalityScorer
          # Also enable default scoring plugins for a balanced decision.
          - name: NodeResourcesBalancedAllocation
          - name: ImageLocality
          - name: TaintToleration
          - name: NodeAffinity
        disabled:
          # We can disable default plugins we don't want.
          - name: NodeResourcesLeastAllocated
    # Plugin-specific configuration can go here if needed.
    pluginConfig:
      - name: DataLocalityScorer
        args: {}2. Deployment Manifests:
We need a ServiceAccount, ClusterRole, ClusterRoleBinding, a ConfigMap for the config, and a Deployment to run the scheduler.
# scheduler-deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: data-locality-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: data-locality-scheduler-as-scheduler
subjects:
  - kind: ServiceAccount
    name: data-locality-scheduler
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler # Use the built-in role for schedulers
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: data-locality-scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    # Paste the content of scheduler-config.yaml here
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: data-locality-scheduler
  namespace: kube-system
  labels:
    app: data-locality-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: data-locality-scheduler
  template:
    metadata:
      labels:
        app: data-locality-scheduler
    spec:
      serviceAccountName: data-locality-scheduler
      containers:
        - name: scheduler
          image: your-registry/data-locality-scheduler:v1.0
          args:
            - "--config=/etc/kubernetes/scheduler-config.yaml"
            - "-v=4" # Verbose logging
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
          volumeMounts:
            - name: scheduler-config-volume
              mountPath: /etc/kubernetes
      volumes:
        - name: scheduler-config-volume
          configMap:
            name: data-locality-scheduler-configStep 6: Using the Custom Scheduler
Now, the final step. Let's simulate the environment and schedule a pod.
First, label some nodes to indicate they have our cached dataset.
# Node-1 has dataset-123
kubectl label node k8s-node-1 data.cache.acme.com/dataset-123=true
# Node-2 has dataset-456
kubectl label node k8s-node-2 data.cache.acme.com/dataset-456=true
# Node-3 has no specific datasetsNow, create a pod that requests dataset-123 and tells Kubernetes to use our new scheduler.
# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: data-job-123
  annotations:
    # This pod requires dataset-123
    data.locality.acme.com/dataset-id: "dataset-123"
spec:
  # CRITICAL: Specify the scheduler to use.
  schedulerName: data-locality-scheduler
  containers:
    - name: processor
      image: busybox
      command: ["sh", "-c", "echo 'Processing data...' && sleep 3600"]Apply this manifest: kubectl apply -f test-pod.yaml. Then, check where it landed:
kubectl get pod data-job-123 -o wide
You will see that the pod is reliably scheduled on k8s-node-1. If you create another pod requesting dataset-456, it will land on k8s-node-2. A pod with no annotation will be scheduled according to the default scoring plugins we left enabled.
Final Production Considerations
*   Observability is Non-Negotiable: Your custom scheduler is a critical cluster component. Expose Prometheus metrics from your plugin. Track scheduling attempts, successes, failures, and crucially, the latency of your Score function. A slow external API call can bring your entire cluster's pod scheduling to a crawl.
*   High Availability: The provided Deployment runs a single replica. For production, you must run multiple replicas. The leaderElection block in our KubeSchedulerConfiguration ensures that only one instance is active at a time, preventing race conditions.
*   Error Handling and State: What happens if your metadata service is down? Your Score function should handle that gracefully. A common pattern is to return a neutral score (0) and log an error. This allows scheduling to proceed based on other criteria, degrading gracefully rather than failing completely.
*   CycleState for Caching: If multiple plugins in your profile need the same external data, the first plugin can fetch it and store it in the framework.CycleState object. This object is a key-value store that persists for the duration of a single pod's scheduling attempt, allowing subsequent plugins to read the data without making redundant API calls.