Advanced Kubebuilder: Managing StatefulSets with Finalizers & Leader Election

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Problem: Why Vanilla StatefulSets Fall Short

As seasoned Kubernetes engineers, we appreciate StatefulSet for its guarantees around stable network identifiers and persistent storage. However, for truly complex stateful applications—like a custom replicated database, a distributed cache, or a message queue cluster—its lifecycle management capabilities are often insufficient.

Consider the deletion process. When a StatefulSet object is deleted, Kubernetes's garbage collector dutifully removes the pods (in reverse ordinal order) and associated resources. But what if your application requires a more sophisticated shutdown sequence?

  • Coordinated Data Backup: Before a database node is terminated, you must trigger a final backup or ensure its data has been fully replicated to its peers.
  • Connection Draining: A pod might need to stop accepting new connections and wait for existing ones to complete gracefully before shutting down.
  • Cluster Rebalancing: Decommissioning a node in a distributed system often requires a controlled data rebalancing process to be initiated and completed before the node's PVC is released.
  • Vanilla StatefulSet deletion hooks don't provide the control needed for these external, state-aware operations. This is where a custom Kubernetes operator becomes indispensable. In this post, we will build an operator to manage a KeyValueStore custom resource, which in turn provisions a StatefulSet. Our focus will be on two production-critical patterns that elevate an operator from a simple automation tool to a resilient, state-aware system manager: Finalizers and Leader Election.


    Section 1: Scaffolding the `KeyValueStore` Operator

    We'll assume you have a working Go environment and have installed the kubebuilder CLI. We'll start by scaffolding the project and defining our custom resource's API.

    bash
    # 1. Initialize the project
    mkdir keyvaluestore-operator && cd keyvaluestore-operator
    kubebuilder init --domain my.domain --repo my.domain/keyvaluestore-operator
    
    # 2. Create the API for our KeyValueStore resource
    kubebuilder create api --group cache --version v1alpha1 --kind KeyValueStore

    This generates the boilerplate. Our primary focus is on defining the schema in api/v1alpha1/keyvaluestore_types.go. We need a Spec to define the desired state and a Status to report the observed state. A robust Status is crucial for operators; it provides visibility and enables other components to react to the state of our custom resource.

    Here is a production-oriented schema. Note the use of json tags and +kubebuilder markers, which control CRD generation.

    api/v1alpha1/keyvaluestore_types.go

    go
    package v1alpha1
    
    import (
    	appsv1 "k8s.io/api/apps/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // KeyValueStoreSpec defines the desired state of KeyValueStore
    type KeyValueStoreSpec struct {
    	// +kubebuilder:validation:Minimum=1
    	// +kubebuilder:validation:Maximum=10
    	// Number of desired pods. Defaults to 3.
    	// +optional
    	Replicas *int32 `json:"replicas,omitempty"`
    
    	// Image is the container image to run for the KeyValueStore.
    	Image string `json:"image"`
    
    	// Port is the port to expose on the service.
    	Port int32 `json:"port"`
    }
    
    // KeyValueStoreStatus defines the observed state of KeyValueStore
    type KeyValueStoreStatus struct {
    	// Conditions represent the latest available observations of the KeyValueStore's state.
    	// +optional
    	// +patchMergeKey=type
    	// +patchStrategy=merge
    	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
    
    	// ReadyReplicas is the number of pods created by the StatefulSet that have a Ready condition.
    	ReadyReplicas int32 `json:"readyReplicas,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    //+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
    //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.readyReplicas"
    //+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.conditions[?(@.type=='Available')].status"
    //+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
    
    // KeyValueStore is the Schema for the keyvaluestores API
    type KeyValueStore struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   KeyValueStoreSpec   `json:"spec,omitempty"`
    	Status KeyValueStoreStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // KeyValueStoreList contains a list of KeyValueStore
    type KeyValueStoreList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []KeyValueStore `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&KeyValueStore{}, &KeyValueStoreList{})
    }

    After defining the types, run make manifests generate to update the CRD and generated code.


    Section 2: Implementing the Core Reconciliation Loop

    The heart of the operator is the Reconcile method in controllers/keyvaluestore_controller.go. Its job is to converge the current state of the system towards the desired state defined in the KeyValueStore Spec.

    Our initial reconciliation logic will be straightforward:

  • Fetch the KeyValueStore instance.
  • Check if a corresponding StatefulSet exists. If not, create one.
  • If it exists, ensure its spec (replicas, image) matches the KeyValueStore spec. If not, update it.
  • Update the KeyValueStore status based on the StatefulSet's status.
  • Here is the initial implementation of the Reconcile function. Note the use of helper functions to keep the main loop clean.

    controllers/keyvaluestore_controller.go

    go
    // ... imports
    import (
        // ... other imports
        appsv1 "k8s.io/api/apps/v1"
        corev1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/api/errors"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
        "k8s.io/apimachinery/pkg/types"
        ctrl "sigs.k8s.io/controller-runtime"
        "sigs.k8s.io/controller-runtime/pkg/log"
        cachev1alpha1 "my.domain/keyvaluestore-operator/api/v1alpha1"
    )
    
    func (r *KeyValueStoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	// 1. Fetch the KeyValueStore instance
    	instance := &cachev1alpha1.KeyValueStore{}
    	err := r.Get(ctx, req.NamespacedName, instance)
    	if err != nil {
    		if errors.IsNotFound(err) {
    			logger.Info("KeyValueStore resource not found. Ignoring since object must be deleted.")
    			return ctrl.Result{}, nil
    		}
    		logger.Error(err, "Failed to get KeyValueStore")
    		return ctrl.Result{}, err
    	}
    
    	// 2. Check if the StatefulSet already exists, if not create a new one
    	found := &appsv1.StatefulSet{}
    	err = r.Get(ctx, types.NamespacedName{Name: instance.Name, Namespace: instance.Namespace}, found)
    	if err != nil && errors.IsNotFound(err) {
    		// Define a new statefulset
    		sts := r.statefulSetForKeyValueStore(instance)
    		logger.Info("Creating a new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
    		err = r.Create(ctx, sts)
    		if err != nil {
    			logger.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
    			return ctrl.Result{}, err
    		}
    		// StatefulSet created successfully - return and requeue
    		return ctrl.Result{Requeue: true}, nil
    	} else if err != nil {
    		logger.Error(err, "Failed to get StatefulSet")
    		return ctrl.Result{}, err
    	}
    
    	// 3. Ensure the StatefulSet spec is up to date
    	// Default replicas to 3 if not set
    	desiredReplicas := int32(3)
    	if instance.Spec.Replicas != nil {
    		desiredReplicas = *instance.Spec.Replicas
    	}
    
    	if *found.Spec.Replicas != desiredReplicas || found.Spec.Template.Spec.Containers[0].Image != instance.Spec.Image {
    		found.Spec.Replicas = &desiredReplicas
    		found.Spec.Template.Spec.Containers[0].Image = instance.Spec.Image
    		err = r.Update(ctx, found)
    		if err != nil {
    			logger.Error(err, "Failed to update StatefulSet", "StatefulSet.Namespace", found.Namespace, "StatefulSet.Name", found.Name)
    			return ctrl.Result{}, err
    		}
    		// Spec updated - return and requeue
    		return ctrl.Result{Requeue: true}, nil
    	}
    
    	// 4. Update the KeyValueStore status with the current state
    	// This is a simplified status update. A production operator would have more complex logic.
    	instance.Status.ReadyReplicas = found.Status.ReadyReplicas
    	// ... more status updates here ...
    	err = r.Status().Update(ctx, instance)
    	if err != nil {
    		logger.Error(err, "Failed to update KeyValueStore status")
    		return ctrl.Result{}, err
    	}
    
    	return ctrl.Result{}, nil
    }
    
    // statefulSetForKeyValueStore returns a KeyValueStore StatefulSet object
    func (r *KeyValueStoreReconciler) statefulSetForKeyValueStore(kvs *cachev1alpha1.KeyValueStore) *appsv1.StatefulSet {
    	// ... implementation to create a StatefulSet object ...
        // This should include labels, owner references, container spec, volume claims etc.
        // Setting an OwnerReference is CRITICAL for garbage collection.
        sts := &appsv1.StatefulSet{
            // ... metadata and spec
        }
        ctrl.SetControllerReference(kvs, sts, r.Scheme)
        return sts
    }
    
    // SetupWithManager sets up the controller with the Manager.
    func (r *KeyValueStoreReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&cachev1alpha1.KeyValueStore{}).
    		Owns(&appsv1.StatefulSet{}). // This is key! It watches StatefulSets and enqueues the owner.
    		Complete(r)
    }

    This basic loop works for creation and updates. But when you run kubectl delete keyvaluestore my-store, the OwnerReference ensures the StatefulSet is deleted, but our operator has no chance to intervene.


    Section 3: Production Pattern: Graceful Deletion with Finalizers

    A finalizer is a key in an object's metadata that signals to the controller that there are pre-delete operations to perform. When you ask the API server to delete an object with finalizers, it does not delete it immediately. Instead, it sets the deletionTimestamp field on the object and returns. The object remains visible via the API. It is now the responsibility of the controller managing that finalizer to perform its cleanup tasks and then remove the finalizer from the object. Once the finalizer list is empty, the API server permanently deletes the object.

    Let's implement this. We will add a finalizer to our KeyValueStore resource. When a user requests deletion, our operator will detect the deletionTimestamp, perform a simulated backup, and only then remove the finalizer to allow deletion to complete.

    Step 1: Define the Finalizer Name

    It's good practice to use a domain-scoped name.

    controllers/keyvaluestore_controller.go

    go
    const keyValueStoreFinalizer = "cache.my.domain/finalizer"

    Step 2: Modify the Reconciliation Loop

    We need to restructure our Reconcile function to handle the deletion lifecycle.

    go
    func (r *KeyValueStoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	instance := &cachev1alpha1.KeyValueStore{}
    	err := r.Get(ctx, req.NamespacedName, instance)
    	// ... error handling for not found ...
    
    	// Check if the instance is being deleted
    	isMarkedForDeletion := instance.GetDeletionTimestamp() != nil
    	if isMarkedForDeletion {
    		if containsString(instance.GetFinalizers(), keyValueStoreFinalizer) {
    			// Run our finalizer logic
    			if err := r.finalizeKeyValueStore(ctx, instance); err != nil {
    				// Don't remove the finalizer if the finalization logic fails.
    				// The reconciliation will be retried.
    				return ctrl.Result{}, err
    			}
    
    			// Remove finalizer. Once all finalizers are removed, the object will be deleted.
    			instance.SetFinalizers(removeString(instance.GetFinalizers(), keyValueStoreFinalizer))
    			err := r.Update(ctx, instance)
    			if err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		return ctrl.Result{}, nil
    	}
    
    	// Add finalizer for this CR if it doesn't exist
    	if !containsString(instance.GetFinalizers(), keyValueStoreFinalizer) {
    		logger.Info("Adding finalizer for the KeyValueStore")
    		instance.SetFinalizers(append(instance.GetFinalizers(), keyValueStoreFinalizer))
    		err = r.Update(ctx, instance)
    		if err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
    	// ... Rest of the reconciliation logic (create/update StatefulSet) goes here ...
        // This part is unchanged from the previous section.
    
    	return ctrl.Result{}, nil
    }
    
    // finalizeKeyValueStore performs the cleanup logic before object deletion.
    func (r *KeyValueStoreReconciler) finalizeKeyValueStore(ctx context.Context, kvs *cachev1alpha1.KeyValueStore) error {
    	logger := log.FromContext(ctx)
    	logger.Info("Performing finalization tasks for KeyValueStore", "name", kvs.Name)
    
    	// In a real-world scenario, you would implement your pre-delete logic here.
    	// For example:
    	// 1. Call an external API to trigger a data backup.
    	// 2. Scale down the StatefulSet to 0 to gracefully terminate pods.
    	// 3. Wait for the backup to complete.
    
    	// This is a placeholder for that logic.
    	logger.Info("Simulating a call to an external backup service...")
    	// time.Sleep(15 * time.Second) // Simulate a long-running task
    	logger.Info("Backup simulation complete. Finalization successful.")
    
    	return nil
    }
    
    // Helper functions for finalizer string slice manipulation
    func containsString(slice []string, s string) bool {
    	for _, item := range slice {
    		if item == s {
    			return true
    		}
    	}
    	return false
    }
    
    func removeString(slice []string, s string) []string {
    	result := []string{}
    	for _, item := range slice {
    		if item == s {
    			continue
    		}
    		result = append(result, item)
    	}
    	return result
    }

    Now, the lifecycle is robust:

  • Creation: The operator adds the finalizer. The object cannot be deleted without the operator's consent.
  • Deletion Request: A user runs kubectl delete. The API server sets deletionTimestamp.
  • Reconciliation (Deletion): The operator sees the timestamp, executes finalizeKeyValueStore, and upon success, removes its finalizer.
  • Actual Deletion: The API server sees the finalizer list is empty and deletes the object. The OwnerReference on the StatefulSet then triggers its garbage collection.

  • Section 4: High Availability with Leader Election

    In a production environment, your operator is a critical control plane component. It cannot be a single point of failure. You must run multiple replicas of your operator pod. But this introduces a new problem: if multiple pods are running the same controller logic, they will all try to reconcile the same KeyValueStore object simultaneously. This leads to:

  • Race Conditions: Multiple controllers fighting to update the same StatefulSet or KeyValueStore status.
  • API Server Throttling: Each operator instance issues GET, UPDATE, and CREATE calls, overwhelming the Kubernetes API server.
  • Leader Election solves this. Only one instance of the operator, the "leader," is active at any given time. The other instances remain on standby, ready to take over if the leader fails. Kubebuilder's controller-runtime provides this functionality out of the box.

    If you inspect main.go, you'll find the configuration:

    main.go

    go
    // ...
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme:                 scheme,
        MetricsBindAddress:     metricsAddr,
        Port:                   9443,
        HealthProbeBindAddress: probeAddr,
        LeaderElection:         true, // <-- This is the magic!
        LeaderElectionID:       "f8f9d78d.my.domain",
    })
    // ...

    By default, LeaderElection is enabled. It works by creating a Lease object in the cluster (or a ConfigMap/Endpoint in older versions). All operator instances attempt to acquire a lock on this lease. Only one succeeds and becomes the leader. The leader continuously renews its lease. If it fails to do so (e.g., the pod crashes or becomes unresponsive), another instance will acquire the lock and become the new leader.

    Advanced Tuning of Leader Election

    For most use cases, the defaults are fine. But in high-stakes environments, you may need to tune the parameters to balance responsiveness against API server load. These are configured in ctrl.Options:

  • LeaseDuration: The duration that non-leader candidates will wait to force acquire leadership. Default is 15s.
  • RenewDeadline: The duration that the leader will retry refreshing its lease before giving up. Default is 10s.
  • RetryPeriod: The duration the clients should wait between attempting to acquire or renew a lease. Default is 2s.
  • The key relationship is LeaseDuration > RenewDeadline. A failover will take, on average, LeaseDuration.

  • Aggressive Failover (e.g., for a critical database operator): You might reduce these values. This means a new leader is elected faster, but it also means all operator pods (leader and standbys) will be talking to the API server more frequently to check/update the lease.
  • go
        mgr, err := ctrl.NewManager(config, ctrl.Options{
            // ...
            LeaderElection:         true,
            LeaderElectionID:       "critical-db.my.domain",
            LeaseDuration:          &metav1.Duration{Duration: 10 * time.Second},
            RenewDeadline:          &metav1.Duration{Duration: 7 * time.Second},
            RetryPeriod:            &metav1.Duration{Duration: 2 * time.Second},
        })
  • Less Aggressive (e.g., for a batch job operator where a few seconds of downtime is acceptable): You can increase these values to reduce chatter with the API server.
  • go
        mgr, err := ctrl.NewManager(config, ctrl.Options{
            // ...
            LeaderElection:         true,
            LeaderElectionID:       "batch-job.my.domain",
            LeaseDuration:          &metav1.Duration{Duration: 60 * time.Second},
            RenewDeadline:          &metav1.Duration{Duration: 40 * time.Second},
            RetryPeriod:            &metav1.Duration{Duration: 5 * time.Second},
        })

    Choosing the right values is a trade-off between failover latency and control plane load.


    Section 5: Edge Cases and Performance Considerations

    Production systems are defined by how they handle failure. Here are some advanced considerations for your reconciliation loop.

    1. Reconcile Loop Backoff

    What if a transient error occurs (e.g., the API server is temporarily unavailable, or a webhook denies an update)? The default controller-runtime behavior is to requeue the request with an exponential backoff. If you return ctrl.Result{}, err, this happens automatically. However, sometimes you want to control this explicitly. For example, if your finalizer calls an external API that is rate-limited, you should not retry immediately.

    go
    // Inside the reconcile loop
    if err := r.callExternalRateLimitedAPI(); err != nil {
        if isRateLimitError(err) {
            logger.Info("External API is rate-limiting us. Requeuing after a delay.")
            // Return a specific requeue time to avoid hammering the service.
            return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
        }
        return ctrl.Result{}, err // Let controller-runtime handle other errors with exponential backoff
    }

    2. Idempotency and Partial Failures

    The reconciliation loop MUST be idempotent. This means running it multiple times with the same input (KeyValueStore state) must produce the same result. Consider a partial failure: the operator successfully updates the StatefulSet's image, but then fails to update the KeyValueStore's Status field before returning. On the next reconciliation, the operator must not fail. It should observe that the StatefulSet is already in the desired state and proceed directly to the status update, effectively healing itself.

    This is why we always read before we write. We fetch the current state of the StatefulSet (r.Get(...)) and compare it to the desired state. We only issue an Update call if there's a delta.

    3. Controller Concurrency

    By default, a controller reconciles one object at a time. If you have thousands of KeyValueStore instances, a long reconciliation for one (e.g., waiting for a backup) will block all others. You can increase concurrency via the MaxConcurrentReconciles option.

    controllers/keyvaluestore_controller.go (in SetupWithManager)

    go
    func (r *KeyValueStoreReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&cachev1alpha1.KeyValueStore{}).
    		Owns(&appsv1.StatefulSet{}).
    		WithOptions(controller.Options{MaxConcurrentReconciles: 5}). // <-- Here
    		Complete(r)
    }

    This allows the controller to run up to 5 Reconcile loops in parallel goroutines. Caution: This is a powerful tool but requires your reconciliation logic to be completely thread-safe. Since each Reconcile call is for a different object (req.NamespacedName), you are generally safe from race conditions on a specific object, but be mindful of shared clients or caches within your Reconciler struct.

    Conclusion

    We have moved far beyond a basic operator. By implementing Finalizers, we have given our operator the power to manage the entire lifecycle of a stateful application, including performing critical pre-deletion tasks that prevent data loss. By understanding and tuning Leader Election, we have transformed our operator from a single point of failure into a highly available, fault-tolerant component of our production control plane.

    These patterns—graceful shutdown via finalizers, high availability via leader election, and robust error handling with idempotent reconciliation—are the hallmarks of a production-grade Kubernetes operator. They provide the safety, reliability, and predictability required to automate the management of your most critical stateful services.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles