Advanced Go Operator: Managing StatefulSet Failovers with Finalizers
Beyond the Basic Reconciler: The StatefulSet Update Problem
For senior engineers managing stateful applications on Kubernetes, the limitations of the default StatefulSet
rolling update strategy are a common source of production anxiety. While a StatefulSet
provides stable network identifiers and persistent storage, its OnDelete
or RollingUpdate
strategies operate with a critical blind spot: they only understand Kubernetes primitives like pod readiness probes. They have no intrinsic knowledge of your application's internal state, such as cluster quorum, data replication status, or active client connections.
Consider a three-node Raft-based distributed database like etcd or a message queue like Kafka. A standard rolling update will terminate pod-2, wait for pod-2 (new version) to become ready, and then proceed to pod-1. This process can be catastrophic if:
To solve this, we must build an intelligence layer that understands the application's state. This is the perfect use case for a custom Kubernetes Operator. However, a naive operator that simply updates the StatefulSet
's pod template spec is no better than kubectl apply
. The real solution lies in taking direct control of the pod lifecycle during an update. This article demonstrates an advanced pattern: using pod-level finalizers to pause the StatefulSet's update process, allowing our operator to perform critical application-aware checks before permitting pod termination.
Architectural Overview: The Operator and its Custom Resource
Our goal is to manage a DistributedCache
application. We will define a Custom Resource Definition (CRD) named ClusteredCache
that represents a single instance of our application. The operator will watch ClusteredCache
resources and orchestrate the underlying StatefulSet
and its pods.
Our operator's core responsibility during an upgrade is to execute this sequence for each pod, in reverse ordinal order (from n-1
down to 0
):
ClusteredCache
spec (e.g., a new container image).StatefulSet
template, the operator begins a controlled, pod-by-pod update.cache.my.domain/pre-terminate-hook
) to the target pod (e.g., my-cache-2
).StatefulSet
template. The StatefulSet controller sees the change and attempts to delete my-cache-2
to replace it. The deletion is blocked because our finalizer is present, and the pod enters the Terminating
state./ready-for-shutdown
endpoint on the other pods, checking a metrics endpoint to ensure replication lag is zero, or verifying that cluster leadership has been transferred.Ready
.my-cache-1
).This pattern gives our operator ultimate authority over the update process, ensuring application-level safety that Kubernetes alone cannot provide.
Step 1: Defining the `ClusteredCache` CRD
Everything starts with a well-defined API. We'll use kubebuilder
to scaffold our project and define the ClusteredCache
type. The Spec
will contain the desired state, and the Status
will reflect the observed state, which is crucial for idempotent reconciliation.
File: api/v1/clusteredcache_types.go
package v1
import (
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// ClusteredCacheSpec defines the desired state of ClusteredCache
type ClusteredCacheSpec struct {
// Replicas is the desired number of instances.
// +kubebuilder:validation:Minimum=1
Replicas *int32 `json:"replicas"`
// Image is the container image to run for the cache pods.
Image string `json:"image"`
}
// ClusteredCacheStatus defines the observed state of ClusteredCache
type ClusteredCacheStatus struct {
// Conditions represent the latest available observations of the ClusteredCache's state.
// +optional
// +patchMergeKey=type
// +patchStrategy=merge
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
// ReadyReplicas is the number of pods created by the StatefulSet that have a Ready condition.
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
// CurrentVersion is the version of the image currently deployed.
CurrentVersion string `json:"currentVersion,omitempty"`
// TargetVersion is the version of the image being updated to.
TargetVersion string `json:"targetVersion,omitempty"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.readyReplicas"
//+kubebuilder:printcolumn:name="Version",type="string",JSONPath=".status.currentVersion"
//+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.conditions[?(@.type==\"Available\")].reason"
// ClusteredCache is the Schema for the clusteredcaches API
type ClusteredCache struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ClusteredCacheSpec `json:"spec,omitempty"`
Status ClusteredCacheStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// ClusteredCacheList contains a list of ClusteredCache
type ClusteredCacheList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []ClusteredCache `json:"items"`
}
func init() {
SchemeBuilder.Register(&ClusteredCache{}, &ClusteredCacheList{})
}
Key elements here:
* +kubebuilder:subresource:status
: This is critical. It tells the controller-runtime to use the status subresource, which prevents conflicts between our operator updating the status and a user updating the spec.
* Conditions
: We use the standard metav1.Condition
type to provide detailed, machine-readable status updates. This is a best practice for modern operators.
* CurrentVersion
and TargetVersion
: These fields in the Status
allow us to track the progress of an upgrade, making our reconciler more robust and idempotent.
After defining the types, run make manifests
and make install
to apply the CRD to your cluster.
Step 2: The Core Reconciler Logic with Finalizers
Now we implement the heart of the operator. The Reconcile
method in internal/controller/clusteredcache_controller.go
is where our logic will live. We'll break it down into the main reconciliation flow and the specialized upgrade logic.
First, let's define our finalizer constant:
const (
cachePodFinalizer = "cache.my.domain/pre-terminate-hook"
)
The main Reconcile
function acts as a dispatcher.
File: internal/controller/clusteredcache_controller.go
(Reconcile method)
func (r *ClusteredCacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the ClusteredCache instance
cache := &cachev1.ClusteredCache{}
if err := r.Get(ctx, req.NamespacedName, cache); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Handle deletion: Right now, we don't have external resources, so we just remove our own finalizer.
// A real-world operator might need to deprovision cloud resources here.
if !cache.ObjectMeta.DeletionTimestamp.IsZero() {
// The object is being deleted
// Our logic doesn't require a finalizer on the CR itself, but if it did, we'd handle it here.
return ctrl.Result{}, nil
}
// 3. Reconcile the StatefulSet
sts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: cache.Name, Namespace: cache.Namespace}, sts)
if err != nil && errors.IsNotFound(err) {
log.Info("Creating a new StatefulSet")
sts, err = r.statefulSetForClusteredCache(cache)
if err != nil {
log.Error(err, "Failed to define new StatefulSet")
// ... update status with error condition ...
return ctrl.Result{}, err
}
if err = r.Create(ctx, sts); err != nil {
log.Error(err, "Failed to create new StatefulSet")
// ... update status with error condition ...
return ctrl.Result{}, err
}
// StatefulSet created, requeue to check status after a short delay
return ctrl.Result{RequeueAfter: time.Second * 5}, nil
} else if err != nil {
log.Error(err, "Failed to get StatefulSet")
return ctrl.Result{}, err
}
// 4. Check for an upgrade and orchestrate if necessary
if r.isUpgradeRequired(cache, sts) {
log.Info("Upgrade required. Initiating controlled rolling update.")
// This is our advanced logic
return r.reconcileUpgrade(ctx, cache, sts)
}
// 5. If not upgrading, ensure the StatefulSet is scaled correctly
if *cache.Spec.Replicas != *sts.Spec.Replicas {
log.Info("Scaling StatefulSet", "current", *sts.Spec.Replicas, "desired", *cache.Spec.Replicas)
sts.Spec.Replicas = cache.Spec.Replicas
if err := r.Update(ctx, sts); err != nil {
log.Error(err, "Failed to scale StatefulSet")
return ctrl.Result{}, err
}
}
// 6. Update the ClusteredCache status
return r.updateCacheStatus(ctx, cache, sts)
}
func (r *ClusteredCacheReconciler) isUpgradeRequired(cache *cachev1.ClusteredCache, sts *appsv1.StatefulSet) bool {
// Simple check: is the image in the spec different from the one in the StatefulSet?
return cache.Spec.Image != sts.Spec.Template.Spec.Containers[0].Image
}
This structure sets up a clear flow: fetch, check for deletion, ensure the child resource (StatefulSet) exists, and then delegate to specialized functions for upgrades or status updates.
Step 3: Deep Dive - The `reconcileUpgrade` Implementation
This is where the finalizer pattern comes to life. The reconcileUpgrade
function is responsible for the entire orchestrated update process.
func (r *ClusteredCacheReconciler) reconcileUpgrade(ctx context.Context, cache *cachev1.ClusteredCache, sts *appsv1.StatefulSet) (ctrl.Result, error) {
log := log.FromContext(ctx)
// Set status to indicate an upgrade is in progress
// (Implementation of setUpgradeStatus omitted for brevity)
// This should update cache.Status.TargetVersion and add an "Upgrading" condition.
// We update pods in reverse ordinal order (e.g., for 3 replicas: pod-2, pod-1, pod-0)
for i := *sts.Spec.Replicas - 1; i >= 0; i-- {
pod := &corev1.Pod{}
podName := fmt.Sprintf("%s-%d", sts.Name, i)
err := r.Get(ctx, types.NamespacedName{Name: podName, Namespace: sts.Namespace}, pod)
if err != nil {
log.Error(err, "Failed to get pod for upgrade check", "PodName", podName)
return ctrl.Result{}, err
}
// Check if this pod is already updated
if pod.Spec.Containers[0].Image == cache.Spec.Image {
log.V(1).Info("Pod is already updated, skipping", "PodName", podName)
continue
}
// This is the pod we need to update next.
log.Info("Processing pod for update", "PodName", podName)
// If the pod is not yet terminating, we need to start the process.
if pod.ObjectMeta.DeletionTimestamp.IsZero() {
// 1. Add our finalizer to the pod to block deletion
if !controllerutil.ContainsFinalizer(pod, cachePodFinalizer) {
log.Info("Adding finalizer to pod", "PodName", podName)
controllerutil.AddFinalizer(pod, cachePodFinalizer)
if err := r.Update(ctx, pod); err != nil {
log.Error(err, "Failed to add finalizer to pod", "PodName", podName)
return ctrl.Result{}, err
}
}
// 2. Update the StatefulSet template. This will trigger the deletion of the pod.
log.Info("Updating StatefulSet template to trigger pod replacement")
sts.Spec.Template.Spec.Containers[0].Image = cache.Spec.Image
if err := r.Update(ctx, sts); err != nil {
log.Error(err, "Failed to update StatefulSet for upgrade")
return ctrl.Result{}, err
}
// We've initiated the termination. Requeue to handle the next state.
return ctrl.Result{Requeue: true}, nil
}
// If we reach here, the pod IS terminating (DeletionTimestamp is not zero)
// and our finalizer should be on it.
if controllerutil.ContainsFinalizer(pod, cachePodFinalizer) {
log.Info("Pod is terminating, performing application health checks", "PodName", podName)
// 3. Perform application-specific pre-termination checks
healthy, err := r.performApplicationHealthCheck(ctx, cache, pod)
if err != nil {
log.Error(err, "Application health check failed", "PodName", podName)
// Requeue with a backoff to try again later
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
if !healthy {
log.Info("Application is not ready for pod termination, waiting...", "PodName", podName)
// Requeue to check again
return ctrl.Result{RequeueAfter: 15 * time.Second}, nil
}
// 4. Health checks passed. Remove the finalizer to allow deletion.
log.Info("Application health checks passed. Removing finalizer.", "PodName", podName)
controllerutil.RemoveFinalizer(pod, cachePodFinalizer)
if err := r.Update(ctx, pod); err != nil {
log.Error(err, "Failed to remove finalizer from pod", "PodName", podName)
return ctrl.Result{}, err
}
}
// The pod is being handled. We must wait for it to be fully gone and the new one to be ready
// before proceeding to the next pod in the loop. So we requeue immediately.
log.Info("Waiting for pod to be replaced", "PodName", podName)
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
// If the loop completes, all pods are updated.
log.Info("All pods have been updated successfully.")
// Reset the StatefulSet's update strategy if we changed it (optional)
// Update status to reflect completion
// ... update status with success condition ...
return ctrl.Result{}, nil
}
The logic is sequential and idempotent. At any point, if the operator crashes and restarts, it will re-evaluate the state of the cluster and pick up exactly where it left off. For example, if it crashes after adding the finalizer but before updating the StatefulSet
, on the next reconcile it will see the finalizer is present, skip adding it, and proceed to update the StatefulSet
.
The Application Health Check
The performApplicationHealthCheck
function is where your domain-specific knowledge is encoded. This is a placeholder for what would be a robust implementation.
// performApplicationHealthCheck checks if the rest of the cluster is healthy enough
// to tolerate the termination of the given pod.
func (r *ClusteredCacheReconciler) performApplicationHealthCheck(ctx context.Context, cache *cachev1.ClusteredCache, podToTerminate *corev1.Pod) (bool, error) {
log := log.FromContext(ctx)
log.Info("Executing health check for cache cluster", "TerminatingPod", podToTerminate.Name)
// Get all pods for this ClusteredCache instance
podList := &corev1.PodList{}
listOpts := []client.ListOption{
client.InNamespace(cache.Namespace),
client.MatchingLabels{"app": cache.Name}, // Assuming we set this label
}
if err := r.List(ctx, podList, listOpts...); err != nil {
return false, fmt.Errorf("failed to list pods for health check: %w", err)
}
// Check the health of all OTHER pods in the cluster
for _, pod := range podList.Items {
// Skip the pod that is already terminating
if pod.Name == podToTerminate.Name {
continue
}
// Example Check 1: Is the pod running and ready?
if !isPodReady(&pod) {
log.Info("Health check failed: A peer pod is not ready", "PeerPod", pod.Name)
return false, nil
}
// Example Check 2: Query an application-specific endpoint
// This requires setting up a way to communicate with the pod, e.g., via a client.
// For a real implementation, you'd use a proper HTTP client with timeouts and retries.
/*
clusterClient := NewClusterClientForPod(pod)
isSafe, err := clusterClient.IsReadyForFailover()
if err != nil {
log.Error(err, "Failed to query pod's failover status endpoint", "PeerPod", pod.Name)
return false, err // Return error to trigger backoff
}
if !isSafe {
log.Info("Health check failed: Peer pod reports it is not safe for failover", "PeerPod", pod.Name)
return false, nil // Return false to retry soon
}
*/
}
// If we get here, all other pods are ready and report it's safe to proceed.
log.Info("Health check passed!")
return true, nil
}
func isPodReady(pod *corev1.Pod) bool {
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
return true
}
}
return false
}
Advanced Edge Cases and Production Considerations
This pattern is powerful, but in a production environment, several edge cases must be handled gracefully.
1. Stuck Upgrades
Problem: What if performApplicationHealthCheck
never returns true
? The upgrade will be permanently stuck on one pod.
Solution: Implement a timeout mechanism. In the ClusteredCache
status, add a LastTransitionTime
to your Upgrading
condition. In the reconcileUpgrade
function, check if the time since the last transition exceeds a configured deadline (e.g., 15 minutes). If it does, the operator should:
ClusteredCache
status with a condition like Type: Available, Status: False, Reason: UpgradeFailed, Message: Timed out waiting for health checks on pod my-cache-2
.StatefulSet
template to the CurrentVersion
from the status and remove the finalizer to let the old pod restart. This requires careful state management to avoid flapping.2. Controller Idempotency and Status Management
Problem: The Reconcile
function can be triggered multiple times for the same state. Our logic must not perform duplicate actions, like trying to add a finalizer that already exists.
Solution: The code shown already handles this well by checking for the existence of the finalizer before adding it (!controllerutil.ContainsFinalizer(...)
). This principle must be applied everywhere. Before taking any action, read the current state from the cluster and compare it to the desired state. The status subresource is your best friend. Storing CurrentVersion
and TargetVersion
allows the operator to know if it's in the middle of an upgrade even if it just restarted.
3. Operator Crash and Restart
Problem: The operator pod itself can be rescheduled or crash.
Solution: This pattern is inherently resilient to operator failure. The state is stored on the Kubernetes objects themselves (the pod finalizer, the ClusteredCache
status, the StatefulSet
spec). When the operator restarts, its reconciliation loop for the ClusteredCache
will trigger. It will see a pod in the Terminating
state with its finalizer and pick up the health-checking process exactly where it left off. This is a fundamental benefit of the Kubernetes controller model.
4. Manual Intervention and State Drift
Problem: A cluster administrator with high privileges might manually intervene, for example, by running kubectl patch pod my-cache-2 --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
to force-delete a pod.
Solution: The operator must be able to detect this state drift. On the next reconciliation, the operator might see that the StatefulSet
template has been updated, but the pod it thought was terminating is now gone and a new one is running. Its internal state model is now out of sync. The reconciler should be written to be a level-based state machine, not an edge-based one. It should always re-evaluate the entire state of all pods
and decide the next correct action, rather than assuming its last action succeeded. In this case, it would see my-cache-2
is now running the new version and simply move on to my-cache-1
.
5. Performance and Scalability
Problem: A single operator managing thousands of ClusteredCache
resources could become a bottleneck.
Solution:
* Concurrent Reconciles: The controller-runtime
manager allows you to configure MaxConcurrentReconciles
. For operators that perform blocking operations (like our health checks), increasing this can improve throughput.
Targeted Watches: Ensure your controller is only watching the resources it needs. Use Owns(&appsv1.StatefulSet{})
and Owns(&corev1.Pod{})
in your SetupWithManager
function. This way, a change to any* pod owned by a ClusteredCache
will trigger a reconcile for that specific ClusteredCache
instance, not all of them.
* Client-Side Caching: The controller-runtime
client is a caching client by default. This is highly efficient, but be aware that it can be slightly stale. For operations that require absolute latest state (which is rare), you can use a non-caching client (manager.GetAPIReader()
). For our use case, the caching client is perfectly fine.
Conclusion
The default Kubernetes controllers provide powerful building blocks, but they are intentionally generic. For complex, stateful applications, achieving true operational safety requires encoding application-specific knowledge into the control plane. The pattern of using a custom operator to apply pod-level finalizers allows you to precisely intercept and manage the pod lifecycle during critical procedures like rolling updates.
By seizing control from the StatefulSet
controller at the most critical moment—just before termination—you can perform any validation necessary to guarantee the stability and availability of your service. This moves your infrastructure from simply managing containers to actively orchestrating the application's state, which is the ultimate promise of the Operator pattern. While more complex than a simple Helm chart, this level of control is non-negotiable for running mission-critical stateful services in production.