Building a Stateful Kubernetes Operator with Kubebuilder and Finalizers

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The State Management Gap in Standard Kubernetes Resources

For senior engineers operating in the Kubernetes ecosystem, the limitations of standard workloads like Deployments and even StatefulSets become apparent when managing truly complex, stateful applications. Consider a distributed database, a sharded message queue, or a custom caching system. These applications often have lifecycle dependencies that extend beyond the Kubernetes API. They might need to:

  • Register and deregister nodes with an external service discovery mechanism.
  • Perform a data rebalancing or draining operation before a node is terminated.
  • Flush in-memory data to an external object store (like S3) before shutdown.
  • Clean up cloud resources (e.g., DNS records, load balancer rules) that were provisioned on its behalf.

When a user executes kubectl delete my-app, Kubernetes's default garbage collection is powerful but unaware of these external requirements. It will terminate pods and remove API objects, potentially leading to data loss, orphaned cloud resources, or an inconsistent application state. This is the gap the Operator pattern, specifically with robust lifecycle hooks, is designed to fill.

This article bypasses introductory concepts. We assume you are proficient in Go, understand Kubernetes's controller-runtime, and have perhaps built a basic operator before. Our focus is on the advanced implementation details required for production-grade stateful operators. We will construct an operator for a ClusteredCache custom resource, paying special attention to two critical, often misunderstood, aspects:

* Finalizers: For ensuring graceful, multi-step deletion logic that can interact with systems outside of Kubernetes.

* Status Sub-resources: For providing detailed, accurate, and standardized feedback on the state of the managed application.

We will write idempotent reconciliation logic, handle non-trivial edge cases, and optimize our controller's performance. Let's begin.


1. Defining the `ClusteredCache` API

First, we define our Custom Resource Definition (CRD). A well-designed API is the foundation of a reliable operator. Our ClusteredCache will manage a cluster of cache nodes, exposing configuration for size, version, and a detailed status block.

After scaffolding a new project with kubebuilder init and kubebuilder create api --group cache --version v1 --kind ClusteredCache, we'll flesh out api/v1/clusteredcache_types.go.

go
// api/v1/clusteredcache_types.go

package v1

import (
	appsv1 "k8s.io/api/apps/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// ClusteredCacheSpec defines the desired state of ClusteredCache
type ClusteredCacheSpec struct {
	// Number of cache node replicas.
	// +kubebuilder:validation:Minimum=1
	Size int32 `json:"size"`

	// Version of the cache node image to deploy.
	// +kubebuilder:validation:MinLength=1
	Version string `json:"version"`
}

// ClusteredCacheStatus defines the observed state of ClusteredCache
type ClusteredCacheStatus struct {
	// Conditions represent the latest available observations of the ClusteredCache's state.
	// +optional
	// +patchMergeKey=type
	// +patchStrategy=merge
	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`

	// ReadyReplicas is the number of cache nodes that are currently running and ready.
	ReadyReplicas int32 `json:"readyReplicas,omitempty"`

	// CurrentVersion is the version observed on the running StatefulSet.
	CurrentVersion string `json:"currentVersion,omitempty"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Size",type="integer",JSONPath=".spec.size"
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.readyReplicas"
//+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.conditions[?(@.type==\"Available\")].status"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

// ClusteredCache is the Schema for the clusteredcaches API
type ClusteredCache struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   ClusteredCacheSpec   `json:"spec,omitempty"`
	Status ClusteredCacheStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// ClusteredCacheList contains a list of ClusteredCache
type ClusteredCacheList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []ClusteredCache `json:"items"`
}

func init() {
	SchemeBuilder.Register(&ClusteredCache{}, &ClusteredCacheList{})
}

Key details in this definition:

* +kubebuilder:subresource:status: This is critical. It enables the status sub-resource, meaning updates to .spec and .status are handled via different API endpoints (/ and /status). This prevents conflicts and is a mandatory best practice.

* metav1.Condition: We use the standard Kubernetes Condition type for our status. This allows tools like kubectl wait and other controllers to understand the state of our resource in a standardized way (e.g., type: Available, status: True).

* Validation Markers: +kubebuilder:validation markers generate OpenAPI v3 schema for server-side validation, preventing invalid configurations from ever being persisted.

2. The Core Reconciliation Loop: Desired vs. Actual State

The heart of the operator is the Reconcile method in internal/controller/clusteredcache_controller.go. Its sole purpose is to drive the current state of the system towards the desired state defined in the ClusteredCache resource.

Our ClusteredCache operator will manage a StatefulSet to run the cache pods and a headless Service for peer discovery.

Here is a foundational Reconcile implementation. We will build upon this significantly.

go
// internal/controller/clusteredcache_controller.go

// ... imports

const ( 
    cacheFinalizer = "cache.example.com/finalizer"
)

type ClusteredCacheReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

func (r *ClusteredCacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	// 1. Fetch the ClusteredCache instance
	cache := &cachev1.ClusteredCache{}
	err := r.Get(ctx, req.NamespacedName, cache)
	if err != nil {
		if errors.IsNotFound(err) {
			log.Info("ClusteredCache resource not found. Ignoring since object must be deleted")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Failed to get ClusteredCache")
		return ctrl.Result{}, err
	}

    // ... Finalizer and deletion logic will go here ...

	// 2. Reconcile the child StatefulSet
	foundSts := &appsv1.StatefulSet{}
	err = r.Get(ctx, types.NamespacedName{Name: cache.Name, Namespace: cache.Namespace}, foundSts)
	if err != nil && errors.IsNotFound(err) {
		// Define and create a new StatefulSet
		sts := r.statefulSetForClusteredCache(cache)
		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
		if err = r.Create(ctx, sts); err != nil {
			log.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
			return ctrl.Result{}, err
		}
		// StatefulSet created successfully - return and requeue
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		log.Error(err, "Failed to get StatefulSet")
		return ctrl.Result{}, err
	}

	// 3. Ensure the StatefulSet size and version are correct
	size := cache.Spec.Size
	version := cache.Spec.Version

	// Check if the spec has changed and requires an update
	needsUpdate := false
	if *foundSts.Spec.Replicas != size {
		log.Info("StatefulSet replica count mismatch", "Expected", size, "Found", *foundSts.Spec.Replicas)
		foundSts.Spec.Replicas = &size
		needsUpdate = true
	}

	// Assuming the first container is our cache container
	if len(foundSts.Spec.Template.Spec.Containers) > 0 && foundSts.Spec.Template.Spec.Containers[0].Image != "my-cache-image:"+version {
		log.Info("StatefulSet image version mismatch", "Expected", version, "Found", foundSts.Spec.Template.Spec.Containers[0].Image)
		foundSts.Spec.Template.Spec.Containers[0].Image = "my-cache-image:" + version
		needsUpdate = true
	}

	if needsUpdate {
		if err = r.Update(ctx, foundSts); err != nil {
			log.Error(err, "Failed to update StatefulSet", "StatefulSet.Namespace", foundSts.Namespace, "StatefulSet.Name", foundSts.Name)
			return ctrl.Result{}, err
		}
		// Spec updated, requeue to check status after update
		return ctrl.Result{Requeue: true}, nil
	}

    // ... Status update logic will go here ...

	return ctrl.Result{}, nil
}

// statefulSetForClusteredCache returns a ClusteredCache StatefulSet object
func (r *ClusteredCacheReconciler) statefulSetForClusteredCache(c *cachev1.ClusteredCache) *appsv1.StatefulSet {
	labels := map[string]string{"app": "clustered-cache", "cache_cr": c.Name}
	replicas := c.Spec.Size
	image := "my-cache-image:" + c.Spec.Version

	sts := &appsv1.StatefulSet{
		ObjectMeta: metav1.ObjectMeta{
			Name:      c.Name,
			Namespace: c.Namespace,
		},
		Spec: appsv1.StatefulSetSpec{
			Replicas: &replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: labels,
			},
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: labels,
				},
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{{
						Image: image,
						Name:  "cache-node",
						Ports: []corev1.ContainerPort{{
							ContainerPort: 8080,
							Name:          "cache",
						}},
					}},
				},
			},
		},
	}

	// Set ClusteredCache instance as the owner and controller
	ctrl.SetControllerReference(c, sts, r.Scheme)
	return sts
}

// ... SetupWithManager function ...

A critical line here is ctrl.SetControllerReference(c, sts, r.Scheme). This establishes an owner-reference relationship. When the ClusteredCache CR is deleted, Kubernetes's garbage collector will automatically delete the owned StatefulSet and Service. This is the default behavior, but it is not sufficient for our stateful cleanup needs.

3. Deep Dive: Implementing Finalizers for Graceful Deletion

This is where we address the core problem. When a ClusteredCache object is deleted, we need to intercept the deletion process, perform our custom cleanup logic, and only then allow Kubernetes to remove the object.

The Finalizer Mechanism:

  • A finalizer is simply a string key added to the metadata.finalizers array of an object.
  • When a user requests to delete an object that has finalizers, the API server does not delete it immediately. Instead, it sets the metadata.deletionTimestamp field on the object.
    • This puts the object in a "deletion in progress" state. The object is still visible via the API.
  • It is the responsibility of the controller that added the finalizer to notice the deletionTimestamp, perform its cleanup logic, and then remove its finalizer from the metadata.finalizers array.
    • Once the finalizers array is empty, the Kubernetes API server completes the deletion of the object.

    Let's modify our Reconcile loop to implement this pattern.

    go
    // internal/controller/clusteredcache_controller.go (inside Reconcile)
    
    func (r *ClusteredCacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	cache := &cachev1.ClusteredCache{}
    	err := r.Get(ctx, req.NamespacedName, cache)
    	// ... (error handling as before)
    
    	// Check if the object is being deleted
    	isCacheMarkedToBeDeleted := cache.GetDeletionTimestamp() != nil
    	if isCacheMarkedToBeDeleted {
    		if controllerutil.ContainsFinalizer(cache, cacheFinalizer) {
    			// Run our finalizer logic. If it fails, we requeue the request.
    			if err := r.finalizeClusteredCache(ctx, log, cache); err != nil {
    				// Don't remove the finalizer if cleanup fails.
    				return ctrl.Result{}, err
    			}
    
    			// Cleanup was successful. Remove finalizer and update the object.
    			// This is what allows the deletion to proceed.
    			log.Info("Successfully finalized ClusteredCache. Removing finalizer.")
    			controllerutil.RemoveFinalizer(cache, cacheFinalizer)
    			err := r.Update(ctx, cache)
    			if err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		// Stop reconciliation as the item is being deleted
    		return ctrl.Result{}, nil
    	}
    
    	// Add finalizer for this CR if it doesn't have one
    	if !controllerutil.ContainsFinalizer(cache, cacheFinalizer) {
    		log.Info("Adding finalizer for the ClusteredCache")
    		controllerutil.AddFinalizer(cache, cacheFinalizer)
    		err = r.Update(ctx, cache)
    		if err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
        // ... (rest of the reconciliation logic for StatefulSet, Service, etc.)
    }
    
    // finalizeClusteredCache performs the cleanup logic before the CR is deleted.
    func (r *ClusteredCacheReconciler) finalizeClusteredCache(ctx context.Context, log logr.Logger, c *cachev1.ClusteredCache) error {
    	// This is our placeholder for complex cleanup logic.
    	// In a real-world scenario, this could involve:
    	// 1. Calling an external API to deregister the cluster.
    	// 2. Creating a Kubernetes Job to backup data from a PersistentVolume to S3.
    	// 3. Waiting for all connections to drain gracefully.
    
    	log.Info("Starting finalizer logic for ClusteredCache", "name", c.Name)
    
    	// Example: Simulate calling an external API that takes time.
    	log.Info("Deregistering from external monitoring service...")
    	time.Sleep(2 * time.Second) // Simulate network latency
    
    	// Example: Simulate backing up data.
    	// We could check the status of a backup Job here.
    	log.Info("Flushing cache data to persistent storage...")
    	time.Sleep(3 * time.Second) // Simulate I/O operations
    
    	log.Info("Finalizer logic completed successfully.")
    	return nil // Return nil on success
    }

    This structure is robust. If finalizeClusteredCache returns an error, the Reconcile function will exit, and the request will be requeued. The finalizer remains on the object, preventing deletion until the cleanup logic succeeds on a subsequent reconciliation attempt. This guarantees our external cleanup is performed.

    4. Advanced Status Management with Conditions

    A silent operator is a useless operator. We must provide clear, machine-readable status updates. Using the standard Condition type is the correct way to do this.

    Let's add status reconciliation to the end of our Reconcile loop. We'll check the status of the owned StatefulSet and update our ClusteredCache status accordingly.

    go
    // internal/controller/clusteredcache_controller.go (end of Reconcile)
    
    // ... after ensuring StatefulSet spec is correct ...
    
    // 4. Update the ClusteredCache status
    
    // First, create a deep copy of the status to avoid modifying the cache in-place
    // before the update call.
    newStatus := *cache.Status.DeepCopy()
    
    newStatus.ReadyReplicas = foundSts.Status.ReadyReplicas
    newStatus.CurrentVersion = cache.Spec.Version // Assume update succeeded
    
    // Set the "Available" condition based on the StatefulSet's status.
    if foundSts.Status.ReadyReplicas == cache.Spec.Size {
    	availableCond := metav1.Condition{
    		Type:               "Available",
    		Status:             metav1.ConditionTrue,
    		Reason:             "Ready",
    		Message:            "All cache nodes are ready.",
    		LastTransitionTime: metav1.Now(),
    	}
    	meta.SetStatusCondition(&newStatus.Conditions, availableCond)
    } else {
    	progressingCond := metav1.Condition{
    		Type:               "Available",
    		Status:             metav1.ConditionFalse,
    		Reason:             "Progressing",
    		Message:            fmt.Sprintf("Waiting for all cache nodes to be ready (%d/%d)", foundSts.Status.ReadyReplicas, cache.Spec.Size),
    		LastTransitionTime: metav1.Now(),
    	}
    	meta.SetStatusCondition(&newStatus.Conditions, progressingCond)
    }
    
    // Only update the status if it has changed
    if !reflect.DeepEqual(cache.Status, newStatus) {
    	cache.Status = newStatus
    	log.Info("Updating ClusteredCache status")
    	if err := r.Status().Update(ctx, cache); err != nil {
    		log.Error(err, "Failed to update ClusteredCache status")
    		return ctrl.Result{}, err
    	}
    }
    
    return ctrl.Result{}, nil

    Key Production Patterns here:

    * r.Status().Update(ctx, cache): We use the .Status() sub-resource client. This is critical. Using the generic r.Update() client could lead to race conditions where a user changes the .spec at the same time our controller is updating the .status, causing one of the updates to be lost. The sub-resource client mitigates this.

    * meta.SetStatusCondition: This utility from k8s.io/apimachinery/pkg/api/meta correctly adds or updates a condition in the Conditions slice, handling transitions and timestamps properly.

    * reflect.DeepEqual: We compare the old status with the new status before calling the API server. This prevents empty updates, reducing load on the API server and avoiding unnecessary reconciliation events for other controllers that might be watching our resource.

    5. Edge Cases and Performance Optimizations

    Building an operator that works on the happy path is simple. Building one that is resilient and efficient requires handling the edge cases.

    A. Idempotent Error Handling and Requeueing

    What if creating the StatefulSet fails because of a transient API server error or an invalid admission webhook? The Reconcile function will return an error, and controller-runtime will requeue the request with exponential backoff by default. This is often the desired behavior for unexpected errors.

    However, sometimes we want more control. For example, if a resource we depend on is not yet ready, we might want to requeue after a fixed delay instead of treating it as a hard error.

    go
    // Example of controlled requeue
    configMapName := "some-required-config"
    foundConfigMap := &corev1.ConfigMap{}
    err := r.Get(ctx, types.NamespacedName{Name: configMapName, Namespace: cache.Namespace}, foundConfigMap)
    if err != nil {
        if errors.IsNotFound(err) {
            log.Info("Dependent ConfigMap not found, requeueing after 10 seconds")
            // Don't return an error, just ask to be reconciled again after a delay.
            return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
        }
        // For other errors (e.g., permissions), return the error for exponential backoff.
        return ctrl.Result{}, err
    }

    B. Optimizing Reconciliation with Event Predicates

    By default, our controller will be triggered for every change to the ClusteredCache resource or any of the resources it owns (like the StatefulSet). This includes changes to the .status field that our own controller makes!

    This creates a feedback loop: we update the status, which triggers a new reconciliation, which might update the status again. While our reflect.DeepEqual check prevents API calls, the reconciliation function itself still runs, consuming CPU.

    We can filter these unnecessary events using a predicate. The predicate.GenerationChangedPredicate is perfect for this. It only triggers a reconciliation if the metadata.generation of a resource changes. The generation is a number that the API server increments only when the .spec of the resource changes. Status updates do not affect it.

    We apply this predicate in our controller setup in main.go or internal/controller/clusteredcache_controller.go's SetupWithManager function.

    go
    // internal/controller/clusteredcache_controller.go
    
    func (r *ClusteredCacheReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&cachev1.ClusteredCache{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
    		Owns(&appsv1.StatefulSet{}).
    		// You can also add predicates to owned resources.
    		// For example, only reconcile if the StatefulSet's replica count status changes.
    		// Owns(&appsv1.StatefulSet{}, builder.WithPredicates(predicate.ResourceVersionChangedPredicate{})).
    		Complete(r)
    }

    This simple change significantly reduces the number of pointless reconciliations, making the operator far more efficient in a busy cluster.

    C. Concurrency Control

    By default, a controller reconciles resources sequentially. If you have thousands of ClusteredCache instances, a long reconciliation on one instance (e.g., waiting for a finalizer) will block all others. You can configure the maximum number of concurrent reconciles in main.go.

    go
    // cmd/main.go
    
    // ...
    
    ctrl.NewControllerManagedBy(mgr). 
        For(&cachev1.ClusteredCache{}). 
        // ... 
        WithOptions(controller.Options{MaxConcurrentReconciles: 10}). 
        Complete(&controller.ClusteredCacheReconciler{ ... })
    
    // ...

    Setting MaxConcurrentReconciles to a value greater than 1 allows the operator to work on multiple ClusteredCache objects simultaneously. The optimal value depends on the nature of the reconciliation logic (CPU-bound vs. I/O-bound) and the resources available to the operator pod.

    Conclusion: Beyond Basic Automation

    We have moved far beyond a simple "create-if-not-exists" operator. By implementing a robust reconciliation loop that correctly utilizes finalizers for stateful cleanup and provides clear, standardized feedback via the status sub-resource, we have built the foundation for a production-grade controller.

    The key takeaways for senior engineers are:

    The Operator's contract is to manage the entire* lifecycle of an application, including its tear-down. Finalizers are the canonical mechanism for fulfilling this contract when external or ordered operations are required.

    * The status sub-resource is not optional. It is a critical feedback mechanism. Use the dedicated status client (.Status().Update()) and standardized Condition types to be a good citizen in the Kubernetes ecosystem.

    * Efficiency matters at scale. Use predicates (GenerationChangedPredicate) to eliminate redundant reconciliation cycles and configure concurrency to prevent bottlenecks.

    This pattern of managing desired state, handling external dependencies through finalizers, and reporting observed state through status conditions is the fundamental building block that allows Kubernetes to reliably manage the most complex software systems in the world. Mastering it is essential for anyone building truly cloud-native platforms.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles