Kubebuilder Operator for StatefulSet Zero-Downtime Updates

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Flaw in Standard StatefulSet Updates for Critical Systems

As senior engineers, we appreciate the declarative power of Kubernetes. A StatefulSet is a testament to this, providing stable identities and persistent storage for stateful applications. However, its default RollingUpdate strategy, while effective for many scenarios, harbors a critical flaw when managing tightly-coupled, primary-replica clustered systems like PostgreSQL, Redis, or Elasticsearch.

The default strategy updates pods in reverse ordinal order (from N-1 down to 0). For a primary-replica database cluster where pod-0 is typically the primary, this is backward. Even if configured to update in forward order, a simple pod replacement is a blunt instrument. It doesn't understand the application's internal state. A rolling update might terminate the primary pod before a replica has been cleanly promoted, leading to a service outage, write interruptions, or worse, a split-brain scenario.

This is where the Operator pattern transcends simple automation. A custom operator allows us to inject application-specific intelligence into the control plane. This article details the implementation of a sophisticated Kubernetes Operator using Kubebuilder that orchestrates a zero-downtime update for a hypothetical ClusteredDatabase application. We will seize control from the default StatefulSet controller by using an OnDelete update strategy and managing the entire failover and update process within our custom reconciliation loop.

We will not be covering the basics of what an Operator is or how to install Kubebuilder. We assume you are familiar with Go, Kubernetes controllers, and Custom Resource Definitions (CRDs).


Defining the `ClusteredDatabase` API

First, we define the contract for our operator. After scaffolding a new project with kubebuilder create api, we'll flesh out our api/v1alpha1/clustereddatabase_types.go file. Our CRD needs to capture the desired state (Spec) and expose the observed state (Status).

The Spec: Desired State

The Spec will define the configuration for our database cluster.

go
// api/v1alpha1/clustereddatabase_types.go

package v1alpha1

import (
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// ClusteredDatabaseSpec defines the desired state of ClusteredDatabase
type ClusteredDatabaseSpec struct {
	// +kubebuilder:validation:Minimum=1
	// Number of desired pods. Must be at least 1.
	Replicas *int32 `json:"replicas"`

	// Image is the container image to run for the database.
	Image string `json:"image"`

	// Resources defines the CPU and memory requests and limits for the database container.
	Resources corev1.ResourceRequirements `json:"resources,omitempty"`

	// Storage defines the persistent storage configuration for the database.
	Storage StorageSpec `json:"storage"`
}

// StorageSpec defines the storage configuration for each pod in the cluster.
type StorageSpec struct {
	// StorageClassName is the name of the StorageClass to use for the PersistentVolumeClaims.
	StorageClassName string `json:"storageClassName"`

	// Size is the size of the storage volume, e.g., "10Gi".
	Size string `json:"size"`
}

The Status: Observed State

The Status subresource is critical for a robust operator. It provides visibility into the operator's actions and the cluster's health. It's the primary mechanism for users and other controllers to understand the current state without having to inspect multiple underlying resources.

go
// api/v1alpha1/clustereddatabase_types.go (continued)

// ClusteredDatabaseStatus defines the observed state of ClusteredDatabase
type ClusteredDatabaseStatus struct {
	// ReadyReplicas is the number of pods in the cluster that are in a Ready state.
	ReadyReplicas int32 `json:"readyReplicas"`

	// CurrentPrimary identifies the pod that is currently acting as the primary.
	CurrentPrimary string `json:"currentPrimary,omitempty"`

	// Conditions represent the latest available observations of the ClusteredDatabase's state.
	// +optional
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

// These are the valid conditions for a ClusteredDatabase.
const (
    // AvailableCondition indicates that the database cluster is fully operational and available.
    AvailableCondition string = "Available"
    // UpdatingCondition indicates that the cluster is undergoing an update.
    UpdatingCondition string = "Updating"
    // DegradedCondition indicates that the cluster is not fully operational.
    DegradedCondition string = "Degraded"
)

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
// +kubebuilder:printcolumn:name="Ready",type="integer",JSONPath=".status.readyReplicas"
// +kubebuilder:printcolumn:name="Primary",type="string",JSONPath=".status.currentPrimary"
// +kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.conditions[?(@.type==\"Available\")].status"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

// ClusteredDatabase is the Schema for the clustereddatabases API
type ClusteredDatabase struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   ClusteredDatabaseSpec   `json:"spec,omitempty"`
	Status ClusteredDatabaseStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// ClusteredDatabaseList contains a list of ClusteredDatabase
type ClusteredDatabaseList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []ClusteredDatabase `json:"items"`
}

func init() {
	SchemeBuilder.Register(&ClusteredDatabase{}, &ClusteredDatabaseList{})
}

Note the use of +kubebuilder markers. The subresource:status marker is crucial; it tells the API server that the .status field should be treated as a separate endpoint, preventing race conditions where a client's update to the spec overwrites the operator's update to the status.

After running make manifests generate, our CRD is ready.


The Core Reconciliation Loop: Orchestrating State

The heart of the operator is the Reconcile function in controllers/clustereddatabase_controller.go. Our implementation will be idempotent and state-driven, reacting to the difference between the desired state (CRD Spec) and the observed state of the cluster.

Initial Setup and Finalizers

First, we handle the object fetching and finalizer logic. A finalizer prevents the CRD from being deleted until our operator has performed necessary cleanup, like deleting the StatefulSet and its associated PVCs. This is a production-must-have to prevent orphaned resources.

go
// controllers/clustereddatabase_controller.go

import (
	// ... other imports
	"time"

	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/api/resource"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"

	dbv1alpha1 "github.com/your-repo/clustereddb-operator/api/v1alpha1"
)

const clusteredDBFinalizer = "db.example.com/finalizer"

func (r *ClusteredDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := r.Log.WithValues("clustereddatabase", req.NamespacedName)

	// Fetch the ClusteredDatabase instance
	db := &dbv1alpha1.ClusteredDatabase{}
	err := r.Get(ctx, req.NamespacedName, db)
	if err != nil {
		if errors.IsNotFound(err) {
			log.Info("ClusteredDatabase resource not found. Ignoring since object must be deleted")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Failed to get ClusteredDatabase")
		return ctrl.Result{}, err
	}

	// Check if the instance is being deleted
	isMarkedForDeletion := db.GetDeletionTimestamp() != nil
	if isMarkedForDeletion {
		if controllerutil.ContainsFinalizer(db, clusteredDBFinalizer) {
			// Run finalization logic
			if err := r.finalizeClusteredDB(ctx, log, db); err != nil {
				return ctrl.Result{}, err
			}

			// Remove finalizer. Once all finalizers are removed, the object will be deleted.
			controllerutil.RemoveFinalizer(db, clusteredDBFinalizer)
			err := r.Update(ctx, db)
			if err != nil {
				return ctrl.Result{}, err
			}
		}
		return ctrl.Result{}, nil
	}

	// Add finalizer for this CR if it doesn't exist
	if !controllerutil.ContainsFinalizer(db, clusteredDBFinalizer) {
		controllerutil.AddFinalizer(db, clusteredDBFinalizer)
		err = r.Update(ctx, db)
		if err != nil {
			return ctrl.Result{}, err
		}
	}

	// ... main reconciliation logic follows
    return r.reconcileCluster(ctx, log, db)
}

func (r *ClusteredDatabaseReconciler) finalizeClusteredDB(ctx context.Context, log logr.Logger, db *dbv1alpha1.ClusteredDatabase) error {
	log.Info("Finalizing ClusteredDatabase")
	// In a real operator, you might want to perform a graceful shutdown of the database,
	// backup data, or other cleanup tasks.
	// For this example, we'll just log.
	log.Info("ClusteredDatabase finalizer logic executed.")
	return nil
}

Initial Deployment and StatefulSet Management

On the first reconciliation, we'll create the necessary resources: a headless Service for network identity and the StatefulSet itself.

go
// controllers/clustereddatabase_controller.go

// A portion of the main reconcileCluster function
func (r *ClusteredDatabaseReconciler) reconcileCluster(ctx context.Context, log logr.Logger, db *dbv1alpha1.ClusteredDatabase) (ctrl.Result, error) {
	// Reconcile Headless Service
	headlessSvc := &corev1.Service{}
	err := r.Get(ctx, types.NamespacedName{Name: db.Name + "-headless", Namespace: db.Namespace}, headlessSvc)
	if err != nil && errors.IsNotFound(err) {
		// Define and create a new headless service
		hls := r.serviceForClusteredDB(db)
		log.Info("Creating a new Headless Service", "Service.Namespace", hls.Namespace, "Service.Name", hls.Name)
		if err := r.Create(ctx, hls); err != nil {
			log.Error(err, "Failed to create new Headless Service")
			return ctrl.Result{}, err
		}
	} else if err != nil {
		log.Error(err, "Failed to get Headless Service")
		return ctrl.Result{}, err
	}

	// Reconcile StatefulSet
	sts := &appsv1.StatefulSet{}
	err = r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, sts)
	if err != nil && errors.IsNotFound(err) {
		// Define and create a new StatefulSet
		newSts := r.statefulSetForClusteredDB(db)
		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", newSts.Namespace, "StatefulSet.Name", newSts.Name)
		if err = r.Create(ctx, newSts); err != nil {
			log.Error(err, "Failed to create new StatefulSet")
			return ctrl.Result{}, err
		}
		// StatefulSet created, requeue to check status after a short delay
		return ctrl.Result{RequeueAfter: time.Second * 5}, nil
	}

	// ... update logic follows

	return ctrl.Result{}, nil
}

// Helper function to define the StatefulSet
func (r *ClusteredDatabaseReconciler) statefulSetForClusteredDB(db *dbv1alpha1.ClusteredDatabase) *appsv1.StatefulSet {
	labels := map[string]string{"app": db.Name}
	replicas := db.Spec.Replicas

	// CRITICAL: Set the update strategy to OnDelete
	updateStrategy := appsv1.StatefulSetUpdateStrategy{
		Type: appsv1.OnDeleteStatefulSetStrategyType,
	}

	sts := &appsv1.StatefulSet{
		ObjectMeta: metav1.ObjectMeta{
			Name:      db.Name,
			Namespace: db.Namespace,
		},
		Spec: appsv1.StatefulSetSpec{
			ServiceName: db.Name + "-headless",
			Replicas:    replicas,
			Selector:    &metav1.LabelSelector{MatchLabels: labels},
			UpdateStrategy: updateStrategy,
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{Labels: labels},
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{{
						Image:     db.Spec.Image,
						Name:      "database",
						Ports:     []corev1.ContainerPort{{ContainerPort: 5432, Name: "db"}},
						Resources: db.Spec.Resources,
						// ... readiness/liveness probes are essential here
					}},
				},
			},
			VolumeClaimTemplates: []corev1.PersistentVolumeClaim{{
				ObjectMeta: metav1.ObjectMeta{Name: "data"},
				Spec: corev1.PersistentVolumeClaimSpec{
					AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
					StorageClassName: &db.Spec.Storage.StorageClassName,
					Resources: corev1.ResourceList{
						corev1.ResourceStorage: resource.MustParse(db.Spec.Storage.Size),
					},
				},
			}},
		},
	}

	controllerutil.SetControllerReference(db, sts, r.Scheme)
	return sts
}

// serviceForClusteredDB and other helpers would be defined similarly

The most important line here is Type: appsv1.OnDeleteStatefulSetStrategyType. This tells the StatefulSet controller to not automatically update pods when the template spec changes. It will only replace a pod if it is manually deleted. This gives our operator exclusive control over the update orchestration.


Implementing the Zero-Downtime Update Logic

This is where our operator's intelligence shines. The update process is triggered when we observe a difference between the StatefulSet's current spec (e.g., the pod image) and the spec defined in our ClusteredDatabase CRD.

Our strategy is as follows:

  • Detect a required update.
  • Update all replica pods (pod-N-1 down to pod-1) one by one.
    • Perform a controlled failover: promote a healthy, updated replica to become the new primary.
    • Update the old primary pod, which will rejoin the cluster as a replica.
    • Update the cluster status to reflect the successful update.

    Detecting the Need for an Update

    In our reconciliation loop, after ensuring the StatefulSet exists, we compare its image to our CRD's spec.

    go
    // Inside reconcileCluster function
    
    // ... after getting the StatefulSet
    
    currentImage := sts.Spec.Template.Spec.Containers[0].Image
    desiredImage := db.Spec.Image
    
    if currentImage != desiredImage {
        log.Info("Image version mismatch, initiating zero-downtime update", "current", currentImage, "desired", desiredImage)
        return r.performUpdate(ctx, log, db, sts)
    }
    
    // ... logic to check other spec changes (replicas, resources) would go here
    
    // If no updates are needed, we reconcile the current state (e.g., check primary health)
    // and update the status.
    return r.reconcileStatus(ctx, log, db, sts)

    The `performUpdate` Function

    This function orchestrates the entire update flow. It's a state machine driven by the conditions of the pods in the cluster.

    go
    // controllers/clustereddatabase_controller.go
    
    const (
    	PrimaryAnnotation = "db.example.com/role"
    )
    
    func (r *ClusteredDatabaseReconciler) performUpdate(ctx context.Context, log logr.Logger, db *dbv1alpha1.ClusteredDatabase, sts *appsv1.StatefulSet) (ctrl.Result, error) {
    	// First, update the StatefulSet's template. This won't trigger pod updates due to OnDelete strategy.
    	// The StatefulSet controller will update its `updateRevision`.
    	sts.Spec.Template.Spec.Containers[0].Image = db.Spec.Image
    	if err := r.Update(ctx, sts); err != nil {
    		log.Error(err, "Failed to update StatefulSet spec for image change")
    		return ctrl.Result{}, err
    	}
    
    	// Get all pods for this StatefulSet
    	podList := &corev1.PodList{}
    	// ... code to list pods with the correct labels
    
    	// Identify the current primary
    	primaryPodName := findPrimary(podList) // Helper to find pod with PrimaryAnnotation
    	if primaryPodName == "" {
    		log.Info("No primary found during update, may be initial setup. Requeuing.")
    		// This could happen if the primary pod is down. We need a robust recovery strategy here.
    		// For now, we'll requeue.
    		return ctrl.Result{RequeueAfter: time.Second * 10}, nil
    	}
    
    	// 1. Update replicas first, in reverse order
    	for i := int(*db.Spec.Replicas - 1); i >= 0; i-- {
    		podName := fmt.Sprintf("%s-%d", db.Name, i)
    		if podName == primaryPodName {
    			continue // Skip the primary for now
    		}
    
    		// Check if this replica pod needs an update
    		pod := findPodByName(podList, podName)
    		if pod == nil {
    			log.Info("Replica pod not found, waiting for it to be created", "pod", podName)
    			return ctrl.Result{RequeueAfter: time.Second * 10}, nil
    		}
    
    		if pod.Labels["controller-revision-hash"] != sts.Status.UpdateRevision {
    			log.Info("Deleting outdated replica pod to trigger update", "pod", pod.Name)
    			if err := r.Delete(ctx, pod); err != nil {
    				log.Error(err, "Failed to delete replica pod for update")
    				return ctrl.Result{}, err
    			}
    			// Wait for the pod to be recreated and become ready
    			log.Info("Waiting for replica to be updated", "pod", pod.Name)
    			return ctrl.Result{RequeueAfter: time.Second * 15}, nil
    		}
    	}
    
    	// 2. All replicas are updated. Now, perform failover.
    	log.Info("All replicas are up-to-date. Preparing for primary failover.")
    
    	// Find a suitable candidate for promotion (must be ready and updated)
    	newPrimaryCandidate := findPromotionCandidate(podList, primaryPodName, sts.Status.UpdateRevision)
    	if newPrimaryCandidate == nil {
    		log.Error(nil, "No healthy, updated replica found to promote. Aborting update.")
    		// This is a critical state. We should set a 'Degraded' condition on the CRD.
    		return ctrl.Result{RequeueAfter: time.Minute * 1}, nil
    	}
    
    	log.Info("Promoting new primary", "candidate", newPrimaryCandidate.Name)
    	
    	// In a real scenario, you would exec into the pods:
    	// 1. exec into old primary: run pre-demotion script (e.g., `pg_ctl stop -m fast`)
    	// 2. exec into new primary: run promotion script (e.g., `pg_ctl promote`)
    	
    	// We simulate this by updating annotations and the primary service
    	primaryPod := findPodByName(podList, primaryPodName)
    	
    	// Update annotations
    	primaryPod.Annotations[PrimaryAnnotation] = "replica"
    	newPrimaryCandidate.Annotations[PrimaryAnnotation] = "primary"
    
    	if err := r.Update(ctx, primaryPod); err != nil { /* ... handle error */ }
    	if err := r.Update(ctx, newPrimaryCandidate); err != nil { /* ... handle error */ }
    
    	// Update the primary service to point to the new primary
    	if err := r.updatePrimaryService(ctx, db, newPrimaryCandidate.Name); err != nil {
    		log.Error(err, "CRITICAL: Failed to update primary service. Manual intervention may be required.")
    		// Revert annotations if this fails? Complex error recovery is key here.
    		return ctrl.Result{}, err
    	}
    
    	// 3. Delete the old primary pod
    	log.Info("Deleting old primary pod to complete the update", "pod", primaryPod.Name)
    	if err := r.Delete(ctx, primaryPod); err != nil {
    		log.Error(err, "Failed to delete old primary pod")
    		return ctrl.Result{}, err
    	}
    
    	log.Info("Zero-downtime update process complete. Waiting for cluster to stabilize.")
    	return ctrl.Result{RequeueAfter: time.Second * 10}, nil
    }

    Managing the Primary Service

    For clients to connect to the database, they need a stable endpoint that always points to the primary. Our operator must manage a separate ClusterIP service for this purpose.

    yaml
    # Example primary-service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: my-db-primary
    spec:
      ports:
      - port: 5432
        protocol: TCP
        targetPort: 5432
      selector:
        app: my-db
        db.example.com/role: primary # This selector is key
      type: ClusterIP

    During the failover, our operator's updatePrimaryService function doesn't need to change the service itself. It only needs to ensure the pod labels/annotations are correct. The service's selector will automatically cause traffic to be redirected to the newly promoted primary. The logic shown in performUpdate of changing the PrimaryAnnotation on the pods is what drives this redirection. This is a clean, declarative approach.


    Edge Cases and Production Hardening

    A simple success path is not enough for a production system. Here's how we handle common edge cases.

    Idempotency

    The reconciliation loop must be idempotent. If the operator crashes mid-update, it should be able to recover its state on the next run. Our implementation achieves this by constantly observing the state of the cluster. For example, if it crashes after deleting a replica, on the next run it will see the pod is missing, wait for it to be recreated by the StatefulSet, and continue. If it crashes after promoting a new primary but before deleting the old one, it will see that the old primary pod is not up-to-date and proceed to delete it. State is stored in the cluster (pod revisions, annotations), not in the operator's memory.

    Failed Pods

    What if a new replica pod fails to become Ready after being updated? Our logic currently requeues with a fixed delay. A more robust implementation would use exponential backoff and eventually set a Degraded condition on the ClusteredDatabase CRD status after a certain number of failed attempts. This signals to human operators that intervention is required.

    go
    // Inside reconcile loop
    
    // If a pod is not ready after a timeout
    meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
        Type:    dbv1alpha1.DegradedCondition,
        Status:  metav1.ConditionTrue,
        Reason:  "UpdateFailed",
        Message: fmt.Sprintf("Pod %s failed to become ready after update", pod.Name),
    })
    // Update the status on the API server
    if err := r.Status().Update(ctx, db); err != nil {
        // handle error
    }
    // Requeue with exponential backoff
    return ctrl.Result{RequeueAfter: nextBackoffDuration}, nil

    Controller Concurrency

    For an operator that performs such delicate, sequential operations, it's critical to prevent multiple reconciliation loops for the same object from running concurrently. The controller-runtime manager allows you to configure this:

    go
    // main.go
    
    func main() {
    	// ... setup code
    
    	ctrl.NewControllerManagedBy(mgr). 
    		For(&dbv1alpha1.ClusteredDatabase{}).
    		WithOptions(controller.Options{MaxConcurrentReconciles: 1}). // CRITICAL!
    		Owns(&appsv1.StatefulSet{}).
    		Owns(&corev1.Service{}).
    		Complete(&r)
    
    	// ... start manager
    }

    Setting MaxConcurrentReconciles: 1 ensures that for our ClusteredDatabase controller, only one Reconcile function will be active at any given time, preventing race conditions within our update logic.


    Conclusion

    We have designed and implemented the core logic for a sophisticated Kubernetes Operator that goes far beyond basic resource templating. By taking direct control of the update process for a StatefulSet using the OnDelete strategy, we injected application-specific logic to perform a zero-downtime, controlled failover for a primary-replica database cluster.

    This pattern demonstrates the true power of Kubernetes extensibility. The built-in controllers provide powerful general-purpose primitives, but for mission-critical, stateful applications, a custom operator is often not a luxury but a necessity. The key takeaways are:

  • Seize Control: Use the OnDelete update strategy to disable the default controller logic and implement your own.
  • State in the Cluster: Use annotations, labels, and status conditions to track and drive your state machine. This ensures idempotency and resilience.
  • Orchestrate, Don't Just Create: The operator's value is in managing the lifecycle of the application, especially complex transitions like version upgrades and failovers.
  • Harden for Production: Implement finalizers, handle error states gracefully, set status conditions, and control concurrency to build a truly robust system.
  • This approach provides a blueprint for managing any complex, clustered application on Kubernetes, ensuring the highest levels of availability and operational autonomy.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles