Idempotent Kubernetes Operators with Finalizers for Stateful Services

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Fragility of State in a Declarative World

In the Kubernetes ecosystem, the control plane's primary directive is to converge the cluster's actual state with a user-defined desired state. For stateless applications managed by Deployments, this model is remarkably effective. However, when managing stateful services—databases, message queues, distributed caches—the convergence model reveals its sharp edges. A simple kubectl delete can trigger a cascade of ungraceful terminations, leaving orphaned persistent volumes, dangling DNS entries in an external service discovery, or corrupted cluster states.

This is the problem domain where Kubernetes Operators excel. By extending the Kubernetes API with Custom Resource Definitions (CRDs) and implementing custom control loops, we can encode complex, domain-specific operational logic directly into the cluster. But building an operator is not enough. A poorly designed operator can be more dangerous than manual management, introducing race conditions and non-deterministic behavior.

This article bypasses introductory concepts and dives directly into two architectural patterns that are fundamental to building production-grade, reliable operators: idempotency and finalizers. We will construct an operator for a fictional replicated database, RepliDB, to illustrate these concepts with production-ready Go code using the controller-runtime library.

Our goal is to build a controller that can:

  • Reliably converge a StatefulSet and Service to match our RepliDB CR specification.
    • Survive controller restarts and intermittent failures without creating duplicate resources or entering error loops.
  • Perform a multi-step, graceful shutdown sequence upon deletion, including interaction with a hypothetical external service, before allowing the RepliDB resource to be removed from the API server.

  • Defining the Custom Resource: `RepliDB`

    First, let's define the API for our stateful service. This CRD specifies the desired state for our replicated database. We'll use Kubebuilder markers to generate the CRD manifest and boilerplate Go code.

    api/v1alpha1/replidb_types.go

    go
    package v1alpha1
    
    import (
    	appsv1 "k8s.io/api/apps/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // RepliDBSpec defines the desired state of RepliDB
    type RepliDBSpec struct {
    	// +kubebuilder:validation:Minimum=1
    	// +kubebuilder:validation:Maximum=9
    	// Number of database replicas.
    	Replicas int32 `json:"replicas"`
    
    	// Image is the container image for the database node.
    	Image string `json:"image"`
    
    	// StorageClassName is the name of the StorageClass for the PersistentVolumeClaims.
    	StorageClassName string `json:"storageClassName"`
    }
    
    // RepliDBStatus defines the observed state of RepliDB
    type RepliDBStatus struct {
    	// Conditions represent the latest available observations of the RepliDB's state.
    	// +optional
    	Conditions []metav1.Condition `json:"conditions,omitempty"`
    
    	// ReadyReplicas is the number of database nodes that are considered ready.
    	ReadyReplicas int32 `json:"readyReplicas,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    //+kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=".spec.replicas"
    //+kubebuilder:printcolumn:name="Ready",type=string,JSONPath=".status.readyReplicas"
    //+kubebuilder:printcolumn:name="Age",type=date,JSONPath=".metadata.creationTimestamp"
    
    // RepliDB is the Schema for the replidbs API
    type RepliDB struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   RepliDBSpec   `json:"spec,omitempty"`
    	Status RepliDBStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // RepliDBList contains a list of RepliDB
    type RepliDBList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []RepliDB `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&RepliDB{}, &RepliDBList{})
    }

    This CRD allows a user to specify the number of replicas, the container image, and the storage class. The status subresource will track readiness and conditions, which is critical for observability.

    The Core Principle: Idempotent Reconciliation

    The heart of any operator is the Reconcile function. It's invoked whenever a change occurs to the custom resource or any of its owned resources. A common novice mistake is to write procedural logic: "if the StatefulSet doesn't exist, create it". This approach is fragile. The reconciliation loop can be triggered multiple times for the same state, and it can fail or be restarted at any point.

    Idempotency is the property where an operation can be applied multiple times without changing the result beyond the initial application. In our operator, the reconciliation must be idempotent. Whether it runs once or five times in a row, the outcome on the cluster should be identical: the actual state should converge to the desired state.

    This is achieved by structuring the Reconcile function as a state comparison and convergence engine:

  • Fetch the primary resource (RepliDB).
  • Construct the desired state of all secondary resources (e.g., StatefulSet, Service) in memory based on the RepliDB spec.
  • Fetch the actual state of secondary resources from the API server.
  • Compare desired vs. actual.
  • Act only on the differences. If a resource doesn't exist, create it. If it exists but its spec is wrong, update it. If it exists and is correct, do nothing.
  • Implementing the Idempotent Loop

    Let's implement this pattern for our RepliDB's StatefulSet. We'll create a helper function that builds the desired StatefulSet object. This separates state definition from state enforcement.

    controllers/replidb_controller.go (inside the RepliDBReconciler struct)

    go
    import (
    	// ... other imports
    	appsv1 "k8s.io/api/apps/v1"
    	corev1 "k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/api/errors"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/apimachinery/pkg/types"
    	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    )
    
    // desiredStatefulSetForRepliDB constructs the desired StatefulSet object.
    func (r *RepliDBReconciler) desiredStatefulSetForRepliDB(db *repldbv1alpha1.RepliDB) *appsv1.StatefulSet {
    	labels := map[string]string{
    		"app":       "replidb",
    		"replidb_cr": db.Name,
    	}
    
    	sts := &appsv1.StatefulSet{
    		ObjectMeta: metav1.ObjectMeta{
    			Name:      db.Name,
    			Namespace: db.Namespace,
    			Labels:    labels,
    		},
    		Spec: appsv1.StatefulSetSpec{
    			Replicas: &db.Spec.Replicas,
    			Selector: &metav1.LabelSelector{MatchLabels: labels},
    			ServiceName: db.Name, // Headless service name
    			Template: corev1.PodTemplateSpec{
    				ObjectMeta: metav1.ObjectMeta{Labels: labels},
    				Spec: corev1.PodSpec{
    					Containers: []corev1.Container{{
    						Name:  "database",
    						Image: db.Spec.Image,
    						Ports: []corev1.ContainerPort{{
    							ContainerPort: 5432,
    							Name:          "db-port",
    						}},
    					}},
    				},
    			},
    		},
    	}
    
    	// Set RepliDB instance as the owner and controller.
    	// This is crucial for garbage collection and for the reconciliation loop to be triggered on StatefulSet changes.
    	controllerutil.SetControllerReference(db, sts, r.Scheme)
    	return sts
    }

    Now, the Reconcile function uses this to enforce the state.

    controllers/replidb_controller.go (inside the Reconcile method, initial implementation)

    go
    func (r *RepliDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	// 1. Fetch the RepliDB instance
    	var replidb repldbv1alpha1.RepliDB
    	if err := r.Get(ctx, req.NamespacedName, &replidb); err != nil {
    		if errors.IsNotFound(err) {
    			log.Info("RepliDB resource not found. Ignoring since object must be deleted.")
    			return ctrl.Result{}, nil
    		}
    		log.Error(err, "Failed to get RepliDB")
    		return ctrl.Result{}, err
    	}
    
    	// 2. Reconcile the StatefulSet
    	foundSts := &appsv1.StatefulSet{}
    	err := r.Get(ctx, types.NamespacedName{Name: replidb.Name, Namespace: replidb.Namespace}, foundSts)
    
    	desiredSts := r.desiredStatefulSetForRepliDB(&replidb)
    
    	if err != nil && errors.IsNotFound(err) {
    		// StatefulSet does not exist, create it.
    		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
    		if err := r.Create(ctx, desiredSts); err != nil {
    			log.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
    			return ctrl.Result{}, err
    		}
    		// StatefulSet created successfully - return and requeue
    		return ctrl.Result{Requeue: true}, nil
    	} else if err != nil {
    		log.Error(err, "Failed to get StatefulSet")
    		return ctrl.Result{}, err
    	}
    
    	// 3. Ensure the StatefulSet spec is up to date.
    	// A simple deep equality check is often too naive. Real-world controllers need more sophisticated diffing.
    	// For this example, we'll check key fields.
    	if *foundSts.Spec.Replicas != replidb.Spec.Replicas || foundSts.Spec.Template.Spec.Containers[0].Image != replidb.Spec.Image {
    		log.Info("StatefulSet spec out of sync, updating...")
    		foundSts.Spec.Replicas = &replidb.Spec.Replicas
    		foundSts.Spec.Template.Spec.Containers[0].Image = replidb.Spec.Image
    		if err := r.Update(ctx, foundSts); err != nil {
    			log.Error(err, "Failed to update StatefulSet", "StatefulSet.Namespace", foundSts.Namespace, "StatefulSet.Name", foundSts.Name)
    			return ctrl.Result{}, err
    		}
    		return ctrl.Result{Requeue: true}, nil
    	}
    
    	// ... update status and finish reconciliation ...
    
    	return ctrl.Result{}, nil
    }

    This logic handles creation and updates idempotently. If you run kubectl apply with the same manifest multiple times, the controller will find that the StatefulSet spec matches the desired state and will simply exit the loop without performing any writes to the API server.

    The Deletion Dilemma: Why Finalizers are Essential

    Our current operator correctly manages the lifecycle of the StatefulSet. But what happens when a user executes kubectl delete replidb my-db?

  • The RepliDB object is marked for deletion by the Kubernetes API server.
  • Because we used SetControllerReference, Kubernetes's built-in garbage collector will see that the StatefulSet is owned by the deleted RepliDB and will proceed to delete it.
  • The StatefulSet controller then terminates its Pods.
  • This process is abrupt. It provides no opportunity for our operator to perform pre-deletion cleanup. For our RepliDB, we might need to:

    * Notify a primary node to gracefully hand over leadership.

    * Flush in-memory buffers to disk.

    * De-register the database nodes from an external service discovery endpoint.

    * Backup the data from the PersistentVolumes before they are released.

    This is where finalizers become the critical tool. A finalizer is a key in an object's metadata.finalizers list. When a finalizer is present, a kubectl delete command does not immediately delete the object. Instead, it sets the metadata.deletionTimestamp field to the current time. The object remains visible to the API and to our controller, but in a "deleting" state.

    It is now the responsibility of the controller that added the finalizer to perform its cleanup tasks and, once complete, remove the finalizer key from the list. Only when the finalizers list is empty will the API server actually delete the object.

    Implementing a Finalizer for Graceful Shutdown

    Let's integrate a finalizer into our RepliDBReconciler.

    Step 1: Define the finalizer name.

    It's a best practice to use a domain-qualified name to avoid collisions.

    controllers/replidb_controller.go

    go
    const replidbFinalizer = "db.example.com/finalizer"

    Step 2: Modify the Reconcile function to manage the finalizer.

    The reconciliation logic now splits into two main paths:

  • If DeletionTimestamp is not set: The object is not being deleted. This is the normal reconciliation path. Our first job here is to ensure our finalizer is present.
  • If DeletionTimestamp is set: The object is being deleted. We must execute our cleanup logic. If cleanup is successful, we remove the finalizer.
  • Here is the complete, production-grade Reconcile function structure:

    controllers/replidb_controller.go

    go
    func (r *RepliDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	// 1. Fetch the RepliDB instance
    	var replidb repldbv1alpha1.RepliDB
    	if err := r.Get(ctx, req.NamespacedName, &replidb); err != nil {
    		if errors.IsNotFound(err) {
    			return ctrl.Result{}, nil
    		}
    		return ctrl.Result{}, err
    	}
    
    	// 2. Check if the object is being deleted
    	isMarkedForDeletion := replidb.GetDeletionTimestamp() != nil
    	if isMarkedForDeletion {
    		if controllerutil.ContainsFinalizer(&replidb, replidbFinalizer) {
    			// Run our finalizer logic. If it fails, we'll try again later.
    			if err := r.finalizeRepliDB(ctx, &replidb); err != nil {
    				log.Error(err, "Failed to finalize RepliDB")
    				return ctrl.Result{}, err
    			}
    
    			// If successful, remove the finalizer and update the object.
    			controllerutil.RemoveFinalizer(&replidb, replidbFinalizer)
    			if err := r.Update(ctx, &replidb); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		return ctrl.Result{}, nil
    	}
    
    	// 3. Add finalizer for this CR if it doesn't exist
    	if !controllerutil.ContainsFinalizer(&replidb, replidbFinalizer) {
    		log.Info("Adding finalizer for RepliDB")
    		controllerutil.AddFinalizer(&replidb, replidbFinalizer)
    		if err := r.Update(ctx, &replidb); err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
    	// 4. Run the regular reconciliation logic (create/update StatefulSet, etc.)
    	// ... [The idempotent reconciliation logic from before] ...
    	foundSts := &appsv1.StatefulSet{}
    	err := r.Get(ctx, types.NamespacedName{Name: replidb.Name, Namespace: replidb.Namespace}, foundSts)
        // ... etc ...
    
    	return ctrl.Result{}, nil
    }
    
    // finalizeRepliDB contains the logic to run before deleting the CR.
    func (r *RepliDBReconciler) finalizeRepliDB(ctx context.Context, db *repldbv1alpha1.RepliDB) error {
    	log := log.FromContext(ctx)
    	log.Info("Starting finalization for RepliDB")
    
    	// In a real-world scenario, you would add complex cleanup logic here.
    	// For example, calling an external API to de-register the database.
    	// We'll simulate this with a log message.
    	log.Info("De-registering database from external discovery service", "dbName", db.Name)
    	// http.Post("https://discovery.example.com/deregister", ...)
    
    	// The built-in garbage collection will handle the owned StatefulSet, but if you had
    	// other external resources not managed by owner references, you would clean them up here.
    	// For example, deleting a DNS record in Route53 or a load balancer in a cloud provider.
    
    	log.Info("Finalization for RepliDB successful")
    	return nil
    }

    This structure is robust. When a RepliDB is created, the reconciler's first action is to add the finalizer. From that point on, the object cannot be fully deleted from the cluster until our finalizeRepliDB function runs successfully and the finalizer is removed. This guarantees our cleanup logic will execute.

    Advanced Edge Cases and Performance Considerations

    Building a simple operator is straightforward. Building one that withstands the chaos of a production environment requires thinking about failure modes.

    Edge Case 1: Partial Failure During Finalization

    What if our finalizeRepliDB function fails? For instance, the external discovery service it calls is unavailable. In our current implementation, the function returns an error, and the Reconcile call is requeued with exponential backoff by controller-runtime. This is the desired behavior. The RepliDB object will remain in a Terminating state, with the deletionTimestamp set, until the discovery service becomes available and our finalizer logic succeeds. This prevents the system from entering an inconsistent state where the database pods are gone but are still registered as active in the discovery service.

    Edge Case 2: Controller Restarts

    The operator Pod can be evicted, crash, or be rescheduled at any time. How does this affect our logic?

    * During normal reconciliation: State is stored in etcd (the CR itself). When the new controller pod starts, it will receive events for all RepliDB objects and its idempotent reconciliation loop will simply verify the state, making no changes if everything is already converged.

    During finalization: The state is also stored in etcd (the deletionTimestamp and the finalizer key). If the controller crashes mid-cleanup, the new instance will see the object is still marked for deletion and still has the finalizer, and it will re-trigger the finalizeRepliDB function. This is why the cleanup logic itself must also be idempotent*. For example, the external API call should be a DELETE request, which is typically idempotent, rather than a state-toggling POST.

    Performance: Watching, Predicates, and Caching

    By default, controller-runtime sets up watches that trigger reconciliation on any change to the primary resource (RepliDB) or secondary resources (StatefulSet, Service). For an operator managing thousands of CRs, this can be inefficient.

  • Event Filtering with Predicates: Many changes don't require reconciliation. For example, a change to the status subresource of our RepliDB, which our own controller is writing, shouldn't trigger another reconciliation loop. We can use predicates to filter these events.
  • controllers/replidb_controller.go (in SetupWithManager)

    go
        import "sigs.k8s.io/controller-runtime/pkg/predicate"
    
        func (r *RepliDBReconciler) SetupWithManager(mgr ctrl.Manager) error {
            return ctrl.NewControllerManagedBy(mgr).
                For(&repldbv1alpha1.RepliDB{}).
                Owns(&appsv1.StatefulSet{}).
                WithEventFilter(predicate.GenerationChangedPredicate{}).
                Complete(r)
        }

    GenerationChangedPredicate filters out events where only the metadata.generation field has not changed. Since status updates don't increment the generation number but spec changes do, this effectively ignores status-only updates, significantly reducing unnecessary reconciliations.

  • API Server Caching: controller-runtime's client is, by default, a caching client. r.Get() calls read from a local in-memory cache (an informer) that is kept in sync with the API server. This is highly efficient and avoids overwhelming etcd. However, it's crucial to remember that r.Update() and r.Create() calls go directly to the API server. This cache-on-read, direct-on-write pattern is fundamental to operator performance.
  • Concurrent Reconciles: For high-throughput operators, you can configure the controller manager to run multiple reconciliation loops in parallel.
  • main.go

    go
        // ...
        controller.New("replidb-controller", mgr, controller.Options{
            Reconciler: &controllers.RepliDBReconciler{ /* ... */ },
            MaxConcurrentReconciles: 5, // Default is 1
        })
        // ...

    This must be used with caution, as it can introduce race conditions if two reconciliations for the same object were to run concurrently (which controller-runtime prevents), or if reconciliations for different objects interact with the same external, non-atomic systems.

    Conclusion: From Logic to Reliability

    We have moved beyond a basic operator implementation to a robust, production-oriented architecture. The key takeaways are not about the specific code, but the patterns they represent:

    Idempotency is non-negotiable. Structure every reconciliation as a stateless comparison of desired vs. actual* state. Your operator must be able to crash and restart at any moment without corrupting the system it manages.

    * Finalizers are the canonical mechanism for safe deletion. They provide the essential hook for performing graceful shutdown and cleaning up any external resources that Kubernetes's garbage collector is unaware of. They transform deletion from an abrupt event into a managed, observable process.

    By combining these two patterns, you provide a powerful abstraction that allows users to manage the entire lifecycle of a complex, stateful application with the same declarative kubectl apply and kubectl delete commands they use for simple, stateless services. This is the true power of the Operator Framework: encoding deep, resilient operational knowledge into software that runs as a first-class citizen on your Kubernetes cluster.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles