Advanced Go Operator: Managing StatefulSet Failovers with Finalizers

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond the Basic Reconciler: The StatefulSet Update Problem

For senior engineers managing stateful applications on Kubernetes, the limitations of the default StatefulSet rolling update strategy are a common source of production anxiety. While a StatefulSet provides stable network identifiers and persistent storage, its OnDelete or RollingUpdate strategies operate with a critical blind spot: they only understand Kubernetes primitives like pod readiness probes. They have no intrinsic knowledge of your application's internal state, such as cluster quorum, data replication status, or active client connections.

Consider a three-node Raft-based distributed database like etcd or a message queue like Kafka. A standard rolling update will terminate pod-2, wait for pod-2 (new version) to become ready, and then proceed to pod-1. This process can be catastrophic if:

  • Quorum is lost: Terminating a master node before a new one is elected and synchronized can bring the entire cluster down.
  • Data is not fully replicated: If pod-2 holds primary shards, the StatefulSet controller has no way of knowing if they have been successfully handed over to other replicas before termination.
  • Connection draining is incomplete: In-flight transactions or long-lived client connections can be abruptly cut, leading to data inconsistency or client-side errors.
  • To solve this, we must build an intelligence layer that understands the application's state. This is the perfect use case for a custom Kubernetes Operator. However, a naive operator that simply updates the StatefulSet's pod template spec is no better than kubectl apply. The real solution lies in taking direct control of the pod lifecycle during an update. This article demonstrates an advanced pattern: using pod-level finalizers to pause the StatefulSet's update process, allowing our operator to perform critical application-aware checks before permitting pod termination.


    Architectural Overview: The Operator and its Custom Resource

    Our goal is to manage a DistributedCache application. We will define a Custom Resource Definition (CRD) named ClusteredCache that represents a single instance of our application. The operator will watch ClusteredCache resources and orchestrate the underlying StatefulSet and its pods.

    Our operator's core responsibility during an upgrade is to execute this sequence for each pod, in reverse ordinal order (from n-1 down to 0):

  • Detect Spec Change: The operator detects a change in the ClusteredCache spec (e.g., a new container image).
  • Initiate Controlled Update: Instead of immediately updating the StatefulSet template, the operator begins a controlled, pod-by-pod update.
  • Seize Control with a Finalizer: The operator adds a custom finalizer (e.g., cache.my.domain/pre-terminate-hook) to the target pod (e.g., my-cache-2).
  • Trigger Termination: The operator now updates the StatefulSet template. The StatefulSet controller sees the change and attempts to delete my-cache-2 to replace it. The deletion is blocked because our finalizer is present, and the pod enters the Terminating state.
  • Perform Application-Aware Checks: The operator's reconciler, seeing the pod is terminating, executes its custom logic. This could involve calling a /ready-for-shutdown endpoint on the other pods, checking a metrics endpoint to ensure replication lag is zero, or verifying that cluster leadership has been transferred.
  • Release Control: Once all checks pass, the operator removes the finalizer from the pod.
  • Complete Termination: With the finalizer gone, the Kubernetes API server allows the pod deletion to complete.
  • Wait for New Pod: The StatefulSet controller creates the new version of the pod. The operator waits for it to become fully Ready.
  • Repeat: The process repeats for the next pod (my-cache-1).
  • This pattern gives our operator ultimate authority over the update process, ensuring application-level safety that Kubernetes alone cannot provide.

    Step 1: Defining the `ClusteredCache` CRD

    Everything starts with a well-defined API. We'll use kubebuilder to scaffold our project and define the ClusteredCache type. The Spec will contain the desired state, and the Status will reflect the observed state, which is crucial for idempotent reconciliation.

    File: api/v1/clusteredcache_types.go

    go
    package v1
    
    import (
    	appsv1 "k8s.io/api/apps/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // ClusteredCacheSpec defines the desired state of ClusteredCache
    type ClusteredCacheSpec struct {
    	// Replicas is the desired number of instances.
    	// +kubebuilder:validation:Minimum=1
    	Replicas *int32 `json:"replicas"`
    
    	// Image is the container image to run for the cache pods.
    	Image string `json:"image"`
    }
    
    // ClusteredCacheStatus defines the observed state of ClusteredCache
    type ClusteredCacheStatus struct {
    	// Conditions represent the latest available observations of the ClusteredCache's state.
    	// +optional
    	// +patchMergeKey=type
    	// +patchStrategy=merge
    	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
    
    	// ReadyReplicas is the number of pods created by the StatefulSet that have a Ready condition.
    	ReadyReplicas int32 `json:"readyReplicas,omitempty"`
    
    	// CurrentVersion is the version of the image currently deployed.
    	CurrentVersion string `json:"currentVersion,omitempty"`
    
    	// TargetVersion is the version of the image being updated to.
    	TargetVersion string `json:"targetVersion,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    //+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
    //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.readyReplicas"
    //+kubebuilder:printcolumn:name="Version",type="string",JSONPath=".status.currentVersion"
    //+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.conditions[?(@.type==\"Available\")].reason"
    
    // ClusteredCache is the Schema for the clusteredcaches API
    type ClusteredCache struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   ClusteredCacheSpec   `json:"spec,omitempty"`
    	Status ClusteredCacheStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // ClusteredCacheList contains a list of ClusteredCache
    type ClusteredCacheList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []ClusteredCache `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&ClusteredCache{}, &ClusteredCacheList{})
    }

    Key elements here:

    * +kubebuilder:subresource:status: This is critical. It tells the controller-runtime to use the status subresource, which prevents conflicts between our operator updating the status and a user updating the spec.

    * Conditions: We use the standard metav1.Condition type to provide detailed, machine-readable status updates. This is a best practice for modern operators.

    * CurrentVersion and TargetVersion: These fields in the Status allow us to track the progress of an upgrade, making our reconciler more robust and idempotent.

    After defining the types, run make manifests and make install to apply the CRD to your cluster.

    Step 2: The Core Reconciler Logic with Finalizers

    Now we implement the heart of the operator. The Reconcile method in internal/controller/clusteredcache_controller.go is where our logic will live. We'll break it down into the main reconciliation flow and the specialized upgrade logic.

    First, let's define our finalizer constant:

    go
    const ( 
        cachePodFinalizer = "cache.my.domain/pre-terminate-hook"
    )

    The main Reconcile function acts as a dispatcher.

    File: internal/controller/clusteredcache_controller.go (Reconcile method)

    go
    func (r *ClusteredCacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	// 1. Fetch the ClusteredCache instance
    	cache := &cachev1.ClusteredCache{}
    	if err := r.Get(ctx, req.NamespacedName, cache); err != nil {
    		return ctrl.Result{}, client.IgnoreNotFound(err)
    	}
    
    	// 2. Handle deletion: Right now, we don't have external resources, so we just remove our own finalizer.
    	// A real-world operator might need to deprovision cloud resources here.
    	if !cache.ObjectMeta.DeletionTimestamp.IsZero() {
    		// The object is being deleted
    		// Our logic doesn't require a finalizer on the CR itself, but if it did, we'd handle it here.
    		return ctrl.Result{}, nil
    	}
    
    	// 3. Reconcile the StatefulSet
    	sts := &appsv1.StatefulSet{}
    	err := r.Get(ctx, types.NamespacedName{Name: cache.Name, Namespace: cache.Namespace}, sts)
    	if err != nil && errors.IsNotFound(err) {
    		log.Info("Creating a new StatefulSet")
    		sts, err = r.statefulSetForClusteredCache(cache)
    		if err != nil {
    			log.Error(err, "Failed to define new StatefulSet")
    			// ... update status with error condition ...
    			return ctrl.Result{}, err
    		}
    		if err = r.Create(ctx, sts); err != nil {
    			log.Error(err, "Failed to create new StatefulSet")
    			// ... update status with error condition ...
    			return ctrl.Result{}, err
    		}
    		// StatefulSet created, requeue to check status after a short delay
    		return ctrl.Result{RequeueAfter: time.Second * 5}, nil
    	} else if err != nil {
    		log.Error(err, "Failed to get StatefulSet")
    		return ctrl.Result{}, err
    	}
    
    	// 4. Check for an upgrade and orchestrate if necessary
    	if r.isUpgradeRequired(cache, sts) {
    		log.Info("Upgrade required. Initiating controlled rolling update.")
    		// This is our advanced logic
    		return r.reconcileUpgrade(ctx, cache, sts)
    	}
    
    	// 5. If not upgrading, ensure the StatefulSet is scaled correctly
    	if *cache.Spec.Replicas != *sts.Spec.Replicas {
    		log.Info("Scaling StatefulSet", "current", *sts.Spec.Replicas, "desired", *cache.Spec.Replicas)
    		sts.Spec.Replicas = cache.Spec.Replicas
    		if err := r.Update(ctx, sts); err != nil {
    			log.Error(err, "Failed to scale StatefulSet")
    			return ctrl.Result{}, err
    		}
    	}
    
    	// 6. Update the ClusteredCache status
    	return r.updateCacheStatus(ctx, cache, sts)
    }
    
    func (r *ClusteredCacheReconciler) isUpgradeRequired(cache *cachev1.ClusteredCache, sts *appsv1.StatefulSet) bool {
    	// Simple check: is the image in the spec different from the one in the StatefulSet?
    	return cache.Spec.Image != sts.Spec.Template.Spec.Containers[0].Image
    }

    This structure sets up a clear flow: fetch, check for deletion, ensure the child resource (StatefulSet) exists, and then delegate to specialized functions for upgrades or status updates.

    Step 3: Deep Dive - The `reconcileUpgrade` Implementation

    This is where the finalizer pattern comes to life. The reconcileUpgrade function is responsible for the entire orchestrated update process.

    go
    func (r *ClusteredCacheReconciler) reconcileUpgrade(ctx context.Context, cache *cachev1.ClusteredCache, sts *appsv1.StatefulSet) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	// Set status to indicate an upgrade is in progress
    	// (Implementation of setUpgradeStatus omitted for brevity)
    	// This should update cache.Status.TargetVersion and add an "Upgrading" condition.
    
    	// We update pods in reverse ordinal order (e.g., for 3 replicas: pod-2, pod-1, pod-0)
    	for i := *sts.Spec.Replicas - 1; i >= 0; i-- {
    		pod := &corev1.Pod{}
    		podName := fmt.Sprintf("%s-%d", sts.Name, i)
    		err := r.Get(ctx, types.NamespacedName{Name: podName, Namespace: sts.Namespace}, pod)
    		if err != nil {
    			log.Error(err, "Failed to get pod for upgrade check", "PodName", podName)
    			return ctrl.Result{}, err
    		}
    
    		// Check if this pod is already updated
    		if pod.Spec.Containers[0].Image == cache.Spec.Image {
    			log.V(1).Info("Pod is already updated, skipping", "PodName", podName)
    			continue
    		}
    
    		// This is the pod we need to update next.
    		log.Info("Processing pod for update", "PodName", podName)
    
    		// If the pod is not yet terminating, we need to start the process.
    		if pod.ObjectMeta.DeletionTimestamp.IsZero() {
    			// 1. Add our finalizer to the pod to block deletion
    			if !controllerutil.ContainsFinalizer(pod, cachePodFinalizer) {
    				log.Info("Adding finalizer to pod", "PodName", podName)
    				controllerutil.AddFinalizer(pod, cachePodFinalizer)
    				if err := r.Update(ctx, pod); err != nil {
    					log.Error(err, "Failed to add finalizer to pod", "PodName", podName)
    					return ctrl.Result{}, err
    				}
    			}
    
    			// 2. Update the StatefulSet template. This will trigger the deletion of the pod.
    			log.Info("Updating StatefulSet template to trigger pod replacement")
    			sts.Spec.Template.Spec.Containers[0].Image = cache.Spec.Image
    			if err := r.Update(ctx, sts); err != nil {
    				log.Error(err, "Failed to update StatefulSet for upgrade")
    				return ctrl.Result{}, err
    			}
    
    			// We've initiated the termination. Requeue to handle the next state.
    			return ctrl.Result{Requeue: true}, nil
    		}
    
    		// If we reach here, the pod IS terminating (DeletionTimestamp is not zero)
    		// and our finalizer should be on it.
    		if controllerutil.ContainsFinalizer(pod, cachePodFinalizer) {
    			log.Info("Pod is terminating, performing application health checks", "PodName", podName)
    
    			// 3. Perform application-specific pre-termination checks
    			healthy, err := r.performApplicationHealthCheck(ctx, cache, pod)
    			if err != nil {
    				log.Error(err, "Application health check failed", "PodName", podName)
    				// Requeue with a backoff to try again later
    				return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    			}
    
    			if !healthy {
    				log.Info("Application is not ready for pod termination, waiting...", "PodName", podName)
    				// Requeue to check again
    				return ctrl.Result{RequeueAfter: 15 * time.Second}, nil
    			}
    
    			// 4. Health checks passed. Remove the finalizer to allow deletion.
    			log.Info("Application health checks passed. Removing finalizer.", "PodName", podName)
    			controllerutil.RemoveFinalizer(pod, cachePodFinalizer)
    			if err := r.Update(ctx, pod); err != nil {
    				log.Error(err, "Failed to remove finalizer from pod", "PodName", podName)
    				return ctrl.Result{}, err
    			}
    		}
    
    		// The pod is being handled. We must wait for it to be fully gone and the new one to be ready
    		// before proceeding to the next pod in the loop. So we requeue immediately.
    		log.Info("Waiting for pod to be replaced", "PodName", podName)
    		return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
    	}
    
    	// If the loop completes, all pods are updated.
    	log.Info("All pods have been updated successfully.")
    	// Reset the StatefulSet's update strategy if we changed it (optional)
    	// Update status to reflect completion
        // ... update status with success condition ...
    	return ctrl.Result{}, nil
    }

    The logic is sequential and idempotent. At any point, if the operator crashes and restarts, it will re-evaluate the state of the cluster and pick up exactly where it left off. For example, if it crashes after adding the finalizer but before updating the StatefulSet, on the next reconcile it will see the finalizer is present, skip adding it, and proceed to update the StatefulSet.

    The Application Health Check

    The performApplicationHealthCheck function is where your domain-specific knowledge is encoded. This is a placeholder for what would be a robust implementation.

    go
    // performApplicationHealthCheck checks if the rest of the cluster is healthy enough
    // to tolerate the termination of the given pod.
    func (r *ClusteredCacheReconciler) performApplicationHealthCheck(ctx context.Context, cache *cachev1.ClusteredCache, podToTerminate *corev1.Pod) (bool, error) {
    	log := log.FromContext(ctx)
    	log.Info("Executing health check for cache cluster", "TerminatingPod", podToTerminate.Name)
    
    	// Get all pods for this ClusteredCache instance
    	podList := &corev1.PodList{}
    	listOpts := []client.ListOption{
    		client.InNamespace(cache.Namespace),
    		client.MatchingLabels{"app": cache.Name}, // Assuming we set this label
    	}
    	if err := r.List(ctx, podList, listOpts...); err != nil {
    		return false, fmt.Errorf("failed to list pods for health check: %w", err)
    	}
    
    	// Check the health of all OTHER pods in the cluster
    	for _, pod := range podList.Items {
    		// Skip the pod that is already terminating
    		if pod.Name == podToTerminate.Name {
    			continue
    		}
    
    		// Example Check 1: Is the pod running and ready?
    		if !isPodReady(&pod) {
    			log.Info("Health check failed: A peer pod is not ready", "PeerPod", pod.Name)
    			return false, nil
    		}
    
    		// Example Check 2: Query an application-specific endpoint
    		// This requires setting up a way to communicate with the pod, e.g., via a client.
    		// For a real implementation, you'd use a proper HTTP client with timeouts and retries.
    		/*
    			clusterClient := NewClusterClientForPod(pod)
    			isSafe, err := clusterClient.IsReadyForFailover()
    			if err != nil {
    				log.Error(err, "Failed to query pod's failover status endpoint", "PeerPod", pod.Name)
    				return false, err // Return error to trigger backoff
    			}
    			if !isSafe {
    				log.Info("Health check failed: Peer pod reports it is not safe for failover", "PeerPod", pod.Name)
    				return false, nil // Return false to retry soon
    			}
    		*/
    	}
    
    	// If we get here, all other pods are ready and report it's safe to proceed.
    	log.Info("Health check passed!")
    	return true, nil
    }
    
    func isPodReady(pod *corev1.Pod) bool {
    	for _, cond := range pod.Status.Conditions {
    		if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
    			return true
    		}
    	}
    	return false
    }

    Advanced Edge Cases and Production Considerations

    This pattern is powerful, but in a production environment, several edge cases must be handled gracefully.

    1. Stuck Upgrades

    Problem: What if performApplicationHealthCheck never returns true? The upgrade will be permanently stuck on one pod.

    Solution: Implement a timeout mechanism. In the ClusteredCache status, add a LastTransitionTime to your Upgrading condition. In the reconcileUpgrade function, check if the time since the last transition exceeds a configured deadline (e.g., 15 minutes). If it does, the operator should:

  • Mark the upgrade as failed: Update the ClusteredCache status with a condition like Type: Available, Status: False, Reason: UpgradeFailed, Message: Timed out waiting for health checks on pod my-cache-2.
  • Attempt a rollback: This is complex. The safest automatic action might be to stop proceeding with the upgrade. A more advanced operator could try to revert the StatefulSet template to the CurrentVersion from the status and remove the finalizer to let the old pod restart. This requires careful state management to avoid flapping.
  • Alert: Fire an alert to notify the SRE team that manual intervention is required.
  • 2. Controller Idempotency and Status Management

    Problem: The Reconcile function can be triggered multiple times for the same state. Our logic must not perform duplicate actions, like trying to add a finalizer that already exists.

    Solution: The code shown already handles this well by checking for the existence of the finalizer before adding it (!controllerutil.ContainsFinalizer(...)). This principle must be applied everywhere. Before taking any action, read the current state from the cluster and compare it to the desired state. The status subresource is your best friend. Storing CurrentVersion and TargetVersion allows the operator to know if it's in the middle of an upgrade even if it just restarted.

    3. Operator Crash and Restart

    Problem: The operator pod itself can be rescheduled or crash.

    Solution: This pattern is inherently resilient to operator failure. The state is stored on the Kubernetes objects themselves (the pod finalizer, the ClusteredCache status, the StatefulSet spec). When the operator restarts, its reconciliation loop for the ClusteredCache will trigger. It will see a pod in the Terminating state with its finalizer and pick up the health-checking process exactly where it left off. This is a fundamental benefit of the Kubernetes controller model.

    4. Manual Intervention and State Drift

    Problem: A cluster administrator with high privileges might manually intervene, for example, by running kubectl patch pod my-cache-2 --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]' to force-delete a pod.

    Solution: The operator must be able to detect this state drift. On the next reconciliation, the operator might see that the StatefulSet template has been updated, but the pod it thought was terminating is now gone and a new one is running. Its internal state model is now out of sync. The reconciler should be written to be a level-based state machine, not an edge-based one. It should always re-evaluate the entire state of all pods and decide the next correct action, rather than assuming its last action succeeded. In this case, it would see my-cache-2 is now running the new version and simply move on to my-cache-1.

    5. Performance and Scalability

    Problem: A single operator managing thousands of ClusteredCache resources could become a bottleneck.

    Solution:

    * Concurrent Reconciles: The controller-runtime manager allows you to configure MaxConcurrentReconciles. For operators that perform blocking operations (like our health checks), increasing this can improve throughput.

    Targeted Watches: Ensure your controller is only watching the resources it needs. Use Owns(&appsv1.StatefulSet{}) and Owns(&corev1.Pod{}) in your SetupWithManager function. This way, a change to any* pod owned by a ClusteredCache will trigger a reconcile for that specific ClusteredCache instance, not all of them.

    * Client-Side Caching: The controller-runtime client is a caching client by default. This is highly efficient, but be aware that it can be slightly stale. For operations that require absolute latest state (which is rare), you can use a non-caching client (manager.GetAPIReader()). For our use case, the caching client is perfectly fine.

    Conclusion

    The default Kubernetes controllers provide powerful building blocks, but they are intentionally generic. For complex, stateful applications, achieving true operational safety requires encoding application-specific knowledge into the control plane. The pattern of using a custom operator to apply pod-level finalizers allows you to precisely intercept and manage the pod lifecycle during critical procedures like rolling updates.

    By seizing control from the StatefulSet controller at the most critical moment—just before termination—you can perform any validation necessary to guarantee the stability and availability of your service. This moves your infrastructure from simply managing containers to actively orchestrating the application's state, which is the ultimate promise of the Operator pattern. While more complex than a simple Helm chart, this level of control is non-negotiable for running mission-critical stateful services in production.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles