Optimizing K8s Operator Reconciliation for High-Cardinality CRDs

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The High-Cardinality Catastrophe: When Standard Operators Fail

In the Kubernetes ecosystem, the operator pattern has emerged as the de facto standard for managing complex, stateful applications. Using the powerful abstractions provided by libraries like controller-runtime, engineers can encode operational knowledge into a control loop that continuously reconciles the desired state (defined in a Custom Resource) with the actual state of the cluster.

For most use cases—managing a database cluster, a message queue, or a handful of application deployments—the default scaffolding provided by tools like Kubebuilder or Operator SDK is more than sufficient. The controller watches for changes to its primary resource (AppConfig in our examples) and any secondary resources (Deployments, Services, ConfigMaps), and on every event, it triggers a Reconcile function. This works flawlessly for tens or even hundreds of CRs.

The architectural assumptions of this default model begin to break down catastrophically when you enter the realm of high-cardinality CRDs. Imagine a scenario where your operator manages a lightweight, per-tenant or per-user configuration. A large SaaS platform could easily have 10,000, 50,000, or even more instances of your AppConfig CR. At this scale, several critical bottlenecks emerge:

  • The Thundering Herd Problem: A common pattern is for an operator to Watch a shared resource, like a global ConfigMap containing operator-wide settings. If this ConfigMap is updated, controller-runtime's default behavior is to enqueue a reconciliation request for every single CR that the controller manages. With 50,000 CRs, this means 50,000 reconciliation requests are dumped into the work queue simultaneously, overwhelming the operator and hammering the Kubernetes API server.
  • Memory Pressure from Informer Caches: To avoid constant API server queries, controllers rely on informers that maintain an in-memory cache of the resources they watch. While incredibly efficient, this cache's memory footprint is directly proportional to the number and size of the objects. Caching 50,000 AppConfig objects, plus associated Deployments and Services, can lead to gigabytes of RAM usage, making your operator a significant resource hog.
  • API Server Saturation: Even with a cache, the Reconcile function frequently interacts with the API server to update status subresources, create/update secondary resources, and apply finalizers. With a high MaxConcurrentReconciles setting, thousands of concurrent reconciliation loops can easily saturate the API server's request limits, leading to 429 Too Many Requests errors and cascading failures across the cluster.
  • This article dissects these failure modes and provides three advanced, production-proven patterns for building operators that can scale to manage tens of thousands of CRs without collapsing. We will move from fine-grained tuning to fundamental architectural changes.

    Let's assume we are working with the following simplified AppConfig CRD:

    go
    // api/v1/appconfig_types.go
    
    package v1
    
    import (
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // AppConfigSpec defines the desired state of AppConfig
    type AppConfigSpec struct {
    	// Image is the container image to deploy.
    	Image string `json:"image"`
    
    	// Replicas is the number of desired replicas.
    	Replicas *int32 `json:"replicas"`
    
    	// GlobalConfigMapName is the name of a shared ConfigMap to source settings from.
    	GlobalConfigMapName string `json:"globalConfigMapName"`
    }
    
    // AppConfigStatus defines the observed state of AppConfig
    type AppConfigStatus struct {
    	// Conditions represent the latest available observations of an AppConfig's state.
    	Conditions []metav1.Condition `json:"conditions,omitempty"`
    
    	// ReadyReplicas is the number of ready replicas.
    	ReadyReplicas int32 `json:"readyReplicas"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    
    // AppConfig is the Schema for the appconfigs API
    type AppConfig struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   AppConfigSpec   `json:"spec,omitempty"`
    	Status AppConfigStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // AppConfigList contains a list of AppConfig
    type AppConfigList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []AppConfig `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&AppConfig{}, &AppConfigList{})
    }

    Pattern 1: Surgical Reconciliation with Event-Filtering Predicates

    The most immediate and impactful optimization is to stop reconciling when it's not necessary. By default, any change to a watched object (creation, update, deletion) triggers reconciliation. However, many of these changes are irrelevant to the controller's logic. For instance, a status update on our AppConfig CR should not trigger another reconciliation loop that re-evaluates the spec.

    controller-runtime provides a powerful mechanism for this: Predicate Functions. Predicates are filters that run before an event is enqueued for reconciliation. If the predicate returns false, the event is dropped entirely.

    Use Case: Ignoring Status Subresource Updates

    Our Reconcile function typically ends by updating the status subresource of the AppConfig CR. This update itself generates an UpdateEvent. Without a predicate, this would immediately trigger another reconciliation for the same object, creating a wasteful cycle. We can prevent this with a predicate that only allows events where the resource's Generation has changed. The Generation is a metadata field that increments only when the spec of the object is modified.

    go
    // internal/controller/appconfig_controller.go
    
    import (
    	// ... other imports
    	"sigs.k8s.io/controller-runtime/pkg/predicate"
    	"sigs.k8s.io/controller-runtime/pkg/event"
    )
    
    // ... Reconcile function ...
    
    func (r *AppConfigReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&appconfigv1.AppConfig{}).
    		Owns(&appsv1.Deployment{}).
    		// Add the predicate here
    		WithEventFilter(predicate.GenerationChangedPredicate{}).
    		Complete(r)
    }

    The built-in GenerationChangedPredicate is a great starting point. It handles UpdateEvent by comparing event.ObjectOld.GetGeneration() with event.ObjectNew.GetGeneration(). It allows all CreateEvent and DeleteEvent to pass through.

    Advanced Use Case: Defeating the Thundering Herd

    Now let's tackle the thundering herd problem. Our AppConfig spec references a GlobalConfigMapName. The operator needs to watch this ConfigMap to react to changes. A naive setup looks like this:

    go
    // Naive setup - DO NOT USE IN PRODUCTION AT SCALE
    func (r *AppConfigReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&appconfigv1.AppConfig{}).
    		Owns(&appsv1.Deployment{}).
            // This watch will cause a thundering herd
    		Watches(
    			&corev1.ConfigMap{},
    			http.HandlerFunc(r.enqueueRequestsForConfigMap),
    		).
    		WithEventFilter(predicate.GenerationChangedPredicate{}).
    		Complete(r)
    }
    
    // enqueueRequestsForConfigMap finds all AppConfigs that use this ConfigMap and enqueues them.
    func (r *AppConfigReconciler) enqueueRequestsForConfigMap(ctx context.Context, obj client.Object) []reconcile.Request {
        // ... logic to list ALL AppConfigs and check if they use this ConfigMap ...
        // This is the source of the problem. It returns a request for every AppConfig.
    }

    When the global-settings ConfigMap changes, this triggers a reconciliation for all 50,000 AppConfig instances. We can solve this with a custom predicate on the Watches call. Let's say our operator only cares about a specific key, appconfig.special.key, within the ConfigMap. We can write a predicate that only allows the event to pass if the value of that specific key has changed.

    go
    // internal/controller/predicates.go
    package controller
    
    import (
    	"sigs.k8s.io/controller-runtime/pkg/event"
    	"sigs.k8s.io/controller-runtime/pkg/predicate"
    	corev1 "k8s.io/api/core/v1"
    	logf "sigs.k8s.io/controller-runtime/pkg/log"
    )
    
    // ConfigMapDataChangedPredicate filters events for a specific key in a ConfigMap.
    func ConfigMapDataChangedPredicate(key string) predicate.Predicate {
    	log := logf.Log.WithName("configmap-predicate")
    
    	return predicate.Funcs{
    		UpdateFunc: func(e event.UpdateEvent) bool {
    			// We only care about Update events for ConfigMaps.
    			oldCM, ok := e.ObjectOld.(*corev1.ConfigMap)
    			if !ok {
    				log.Error(nil, "UpdateEvent ObjectOld is not a ConfigMap", "object", e.ObjectOld)
    				return false
    			}
    
    			newCM, ok := e.ObjectNew.(*corev1.ConfigMap)
    			if !ok {
    				log.Error(nil, "UpdateEvent ObjectNew is not a ConfigMap", "object", e.ObjectNew)
    				return false
    			}
    
    			oldValue := oldCM.Data[key]
    			newValue := newCM.Data[key]
    
    			// If the value of our specific key has changed, trigger reconciliation.
    			if oldValue != newValue {
    				log.Info("ConfigMap key changed, allowing reconciliation", "key", key, "configmap", newCM.Name)
    				return true
    			}
    
    			// Otherwise, ignore the event.
    			return false
    		},
    		CreateFunc: func(e event.CreateEvent) bool {
    			// We might want to reconcile if a new ConfigMap is created that could be relevant.
    			return true
    		},
    		DeleteFunc: func(e event.DeleteEvent) bool {
    			// Reconcile if a relevant ConfigMap is deleted.
    			return true
    		},
    		GenericFunc: func(e event.GenericEvent) bool {
    			// Generic events are rare, but we can allow them just in case.
    			return true
    		},
    	}
    }

    Now, we apply this predicate to our watch:

    go
    // internal/controller/appconfig_controller.go
    
    // ... in SetupWithManager ...
    		Watches(
    			&corev1.ConfigMap{},
    			http.HandlerFunc(r.enqueueRequestsForConfigMap),
    			builder.WithPredicates(ConfigMapDataChangedPredicate("appconfig.special.key")),
    		).
    // ...

    With this change, updates to the global-settings ConfigMap that modify other keys will be completely ignored by our controller, eliminating the thundering herd problem for irrelevant changes.


    Pattern 2: Controller Sharding for Horizontal Scalability

    Even with perfect predicates, you may reach a point where a single controller pod cannot handle the sheer volume of CRs. The CPU required for reconciliation logic and the memory for the informer cache become a bottleneck. The solution is to scale the controller horizontally, just like any other stateless application. This is called controller sharding.

    The core idea is to run multiple replicas of your operator deployment, where each replica is responsible for a distinct subset (a "shard") of the total CRs. The key is to ensure that each CR is managed by exactly one replica at any given time.

    Implementation Strategy

    We can achieve this by using a consistent hashing algorithm on the CR's name or UID to assign it to a shard. Each operator pod, upon startup, determines its own shard ID and then uses a predicate to filter out events for CRs belonging to other shards.

  • Assigning a Shard ID: The operator pod needs to know which shard it is. A common way is to use the pod's ordinal index from a StatefulSet or derive it from the pod name (e.g., my-operator-7 is shard 7).
  • Sharding Predicate: We'll create a predicate that performs the hash calculation and comparison.
  • go
    // internal/controller/sharding.go
    package controller
    
    import (
    	"hash/fnv"
    	"os"
    	"strconv"
    	"strings"
    
    	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
    	"k8s.io/apimachinery/pkg/runtime"
    	"sigs.k8s.io/controller-runtime/pkg/event"
    	"sigs.k8s.io/controller-runtime/pkg/predicate"
    	logf "sigs.k8s.io/controller-runtime/pkg/log"
    )
    
    // ShardPredicate filters objects based on a consistent hash of their name.
    func ShardPredicate() (predicate.Predicate, error) {
    	log := logf.Log.WithName("shard-predicate")
    
    	podName := os.Getenv("POD_NAME")
    	podNamespace := os.Getenv("POD_NAMESPACE")
    	numShardsStr := os.Getenv("NUM_SHARDS")
    
    	if podName == "" || podNamespace == "" || numShardsStr == "" {
    		log.Info("Disabling sharding: POD_NAME, POD_NAMESPACE, or NUM_SHARDS not set. Controller will process all resources.")
    		return predicate.Funcs{
    			UpdateFunc:  func(e event.UpdateEvent) bool { return true },
    			CreateFunc:  func(e event.CreateEvent) bool { return true },
    			DeleteFunc:  func(e event.DeleteEvent) bool { return true },
    			GenericFunc: func(e event.GenericEvent) bool { return true },
    		}, nil
    	}
    
    	// Extract ordinal from pod name (e.g., "my-operator-2" -> 2)
    	lastDash := strings.LastIndex(podName, "-")
    	if lastDash == -1 || lastDash == len(podName)-1 {
    		log.Info("Disabling sharding: Could not parse ordinal from POD_NAME", "podName", podName)
    		return predicate.Funcs{ /* ... return true ... */ }, nil
    	}
    	shardID, err := strconv.Atoi(podName[lastDash+1:])
    	if err != nil {
    		log.Info("Disabling sharding: Could not parse ordinal from POD_NAME", "podName", podName, "error", err)
    		return predicate.Funcs{ /* ... return true ... */ }, nil
    	}
    
    	numShards, err := strconv.Atoi(numShardsStr)
    	if err != nil || numShards <= 0 {
    		log.Error(err, "Invalid NUM_SHARDS value, disabling sharding", "value", numShardsStr)
    		return predicate.Funcs{ /* ... return true ... */ }, nil
    	}
    
    	log.Info("Sharding enabled", "shardID", shardID, "numShards", numShards)
    
    	hasher := fnv.New32a()
    
    	return predicate.Funcs{
    		UpdateFunc: func(e event.UpdateEvent) bool {
    			return isObjectInShard(e.ObjectNew, hasher, numShards, shardID)
    		},
    		CreateFunc: func(e event.CreateEvent) bool {
    			return isObjectInShard(e.Object, hasher, numShards, shardID)
    		},
    		DeleteFunc: func(e event.DeleteEvent) bool {
    			return isObjectInShard(e.Object, hasher, numShards, shardID)
    		},
    		GenericFunc: func(e event.GenericEvent) bool {
    			return isObjectInShard(e.Object, hasher, numShards, shardID)
    		},
    	}, nil
    }
    
    func isObjectInShard(obj runtime.Object, hasher *fnv.Fnv32a, numShards, shardID int) bool {
    	u, err := runtime.DefaultUnstructuredConverter.ToUnstructured(obj)
    	if err != nil {
    		return false
    	}
    	meta := &unstructured.Unstructured{Object: u}
    
    	key := meta.GetNamespace() + "/" + meta.GetName()
    	(*hasher).Write([]byte(key))
    	defer (*hasher).Reset()
    
    	hash := (*hasher).Sum32()
    	targetShard := int(hash % uint32(numShards))
    
    	return targetShard == shardID
    }
  • Applying the Shard Predicate: In your main.go, you construct this predicate and pass it to your controller setup.
  • go
    // cmd/main.go
    
    func main() {
    	// ... setup code ...
    
    	shardPredicate, err := controller.ShardPredicate()
    	if err != nil {
    		setupLog.Error(err, "unable to create shard predicate")
    		os.Exit(1)
    	}
    
    	if err = (&controller.AppConfigReconciler{
    		Client: mgr.GetClient(),
    		Scheme: mgr.GetScheme(),
    	}).SetupWithManager(mgr, shardPredicate); err != nil {
    		setupLog.Error(err, "unable to create controller", "controller", "AppConfig")
    		os.Exit(1)
    	}
    
    	// ... start manager ...
    }
    
    // internal/controller/appconfig_controller.go
    
    func (r *AppConfigReconciler) SetupWithManager(mgr ctrl.Manager, shardPredicate predicate.Predicate) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&appconfigv1.AppConfig{}).
    		// ... other watches ...
    		WithEventFilter(predicate.And(predicate.GenerationChangedPredicate{}, shardPredicate)).
    		Complete(r)
    }

    Edge Cases and Considerations for Sharding

    * StatefulSet Requirement: Your operator should be deployed as a StatefulSet to get stable pod names (my-operator-0, my-operator-1, etc.) for reliable shard IDs.

    * Rebalancing: This simple sharding model does not handle rebalancing. When you change NUM_SHARDS (e.g., scaling from 3 to 5 replicas), the mapping of CRs to shards changes. This can cause a temporary period where a CR is managed by two replicas or none. More advanced implementations might involve a coordination mechanism (like a leader that assigns shard ranges) to manage rebalancing gracefully.

    * Cache Memory: The biggest win of sharding is memory reduction. Each pod now only needs to cache Total CRs / Num Shards objects, drastically reducing its memory footprint.


    Pattern 3: Resilient Finalizers for Graceful Deletion at Scale

    Finalizers are a critical mechanism for ensuring external resources are cleaned up before a Kubernetes object is deleted. When you delete a CR with a finalizer, Kubernetes simply sets a deletionTimestamp and leaves the object in place. It's the operator's job to perform cleanup, and then remove the finalizer. Only then will the garbage collector actually delete the object.

    At scale, this process is fraught with peril. Imagine a request to delete 1,000 AppConfig CRs. Your Reconcile function for each will be triggered. If the external resource cleanup involves calling a slow or unreliable API, your reconciliation loops can get stuck.

    The Problem: Stuck Finalizers

    If the external API call hangs for 30 seconds, your reconciliation worker is tied up for 30 seconds. With a default MaxConcurrentReconciles of 1 or 2, you can only process a few deletions per minute. If the API is down, no deletions will proceed. This can lead to a massive backlog of terminating resources.

    Solution: Asynchronous Cleanup and Intelligent Requeuing

    Don't perform long-running, blocking operations directly in the Reconcile function. Instead, use the finalizer logic to trigger an asynchronous cleanup process and requeue with an appropriate backoff.

    go
    // internal/controller/appconfig_controller.go
    
    const appConfigFinalizer = "appconfig.my.domain/finalizer"
    
    func (r *AppConfigReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	appConfig := &appconfigv1.AppConfig{}
    	if err := r.Get(ctx, req.NamespacedName, appConfig); err != nil {
    		// ... handle not found ...
    	}
    
    	// Check if the object is being deleted
    	if appConfig.ObjectMeta.DeletionTimestamp.IsZero() {
    		// It's not being deleted, so we add our finalizer if it doesn't exist.
    		if !controllerutil.ContainsFinalizer(appConfig, appConfigFinalizer) {
    			controllerutil.AddFinalizer(appConfig, appConfigFinalizer)
    			if err := r.Update(ctx, appConfig); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    	} else {
    		// The object is being deleted.
    		if controllerutil.ContainsFinalizer(appConfig, appConfigFinalizer) {
    			// Our finalizer is present, so let's perform cleanup.
    			if err := r.cleanupExternalResources(ctx, appConfig); err != nil {
    				// If cleanup fails, we requeue for a retry. 
    				// IMPORTANT: Use exponential backoff to avoid hammering a failing API.
    				log.Error(err, "Failed to cleanup external resources, requeueing")
    				return ctrl.Result{RequeueAfter: 30 * time.Second}, nil // Simple backoff, consider exponential
    			}
    
    			// Cleanup was successful, remove the finalizer.
    			controllerutil.RemoveFinalizer(appConfig, appConfigFinalizer)
    			if err := r.Update(ctx, appConfig); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    
    		// Stop reconciliation as the item is being deleted.
    		return ctrl.Result{}, nil
    	}
    
    	// ... main reconciliation logic ...
    	return ctrl.Result{}, nil
    }
    
    // cleanupExternalResources simulates a potentially slow or failing API call.
    func (r *AppConfigReconciler) cleanupExternalResources(ctx context.Context, appConfig *appconfigv1.AppConfig) error {
    	log := log.FromContext(ctx)
    	log.Info("Cleaning up external resources...")
    
    	// Create a context with a timeout for the external API call.
    	// This prevents the reconcile loop from hanging indefinitely.
    	cleanupCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
    	defer cancel()
    
    	// DUMMY IMPLEMENTATION: In a real scenario, this would be an HTTP or gRPC call.
    	// We simulate a call that might fail.
    	if time.Now().Second()%4 == 0 { // Fail 25% of the time
    		return fmt.Errorf("failed to connect to external cleanup service")
    	}
    
    	// Simulate a slow API call
    	select {
    	case <-time.After(5 * time.Second):
    		log.Info("Successfully cleaned up external resources")
    		return nil
    	case <-cleanupCtx.Done():
    		return fmt.Errorf("cleanup timed out: %w", cleanupCtx.Err())
    	}
    }

    Key Takeaways from this pattern:

    * Timeout Your Calls: Never make a network call within a Reconcile loop without a context timeout. This prevents a single hung call from stalling a worker thread indefinitely.

    * Requeue on Failure: If cleanup fails, don't just return the error. This would cause an immediate requeue, hammering the failing external system. Instead, return a ctrl.Result{RequeueAfter: ...} with a backoff duration.

    * Idempotency: Your cleanup logic must be idempotent. It might be called multiple times if previous attempts failed, so it should not error out if the resource is already gone.


    Performance Analysis and Monitoring

    Implementing these patterns is only half the battle. You must monitor your operator's performance to understand its behavior under load. Key Prometheus metrics exposed by controller-runtime are essential:

    * controller_runtime_reconcile_total: The total number of reconciliations. A sudden spike can indicate a thundering herd.

    * controller_runtime_reconcile_errors_total: Tracks reconciliation failures. A rising count points to bugs or external system failures.

    * controller_runtime_reconcile_time_seconds (Histogram): The latency of your Reconcile function. Monitor the p95 and p99 latencies to catch performance regressions.

    * workqueue_depth: The number of items waiting in the work queue. If this number is consistently high, your controller cannot keep up with the event rate.

    * workqueue_adds_total: The rate at which items are added to the queue. This is your inbound event rate.

    By building a dashboard with these metrics, you can quantitatively measure the impact of your optimizations.

    Scenario (20,000 CRs)workqueue_depth (Peak on ConfigMap update)p99 Reconcile LatencyOperator Memory UsageAPI Server 429s
    Naive Operator~20,000500ms2.5 GiBHigh
    With Event Predicates~10 (only relevant updates)500ms2.5 GiBLow
    Sharded Operator (4 shards)~2 (per shard)450ms650 MiB (per shard)Low

    This hypothetical benchmark illustrates the dramatic improvements. Predicates solve the work queue explosion, while sharding solves the memory pressure and allows for higher concurrent processing across the cluster.

    Conclusion

    Building a Kubernetes operator that is robust at high-cardinality is an exercise in moving from imperative logic to declarative filtering and architecture. The default controller-runtime behavior is optimized for simplicity, not for massive scale. By embracing advanced patterns like event-filtering predicates, controller sharding, and resilient finalizer logic, you can build controllers that are not just functional but are truly cloud-native—scalable, resilient, and efficient. These techniques are the dividing line between a proof-of-concept operator and a production system capable of managing infrastructure at the scale of a modern SaaS platform.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles