Building Stateful Kubernetes Operators with Kubebuilder and Finalizers

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Operational Gap with Standard Stateful Resources

As a senior engineer, you've likely deployed stateful applications on Kubernetes using StatefulSets. They provide foundational primitives: stable network identities, ordered deployment, and persistent storage. However, for complex systems like a distributed database cluster (e.g., PostgreSQL with replication, a custom NoSQL store), StatefulSets are merely the starting point. They don't understand the application's internal logic.

Consider the deletion of a database cluster. A simple kubectl delete statefulset my-db will terminate pods and, depending on the PersistentVolumeReclaimPolicy, may orphan or delete the underlying PersistentVolumes. It will not:

  • Gracefully flush in-memory buffers to disk.
  • Notify other cluster members of its departure.
  • Perform a final, pre-deletion backup.
  • Execute a data migration or schema cleanup routine.

This is the operational gap where human intervention or brittle shell scripts traditionally reside. The Operator pattern closes this gap by encoding this operational knowledge into a Kubernetes-native controller.

This post will not cover the basics of the Operator pattern. We assume you understand Custom Resource Definitions (CRDs) and the concept of a reconciliation loop. Instead, we will focus on a critical, advanced technique for managing the lifecycle of stateful resources: Finalizers. Finalizers are the mechanism by which an Operator can safely manage the termination of its resources, ensuring critical cleanup tasks are performed before Kubernetes permanently deletes an object.

Scaffolding the `StatefulDB` Operator

We'll use Kubebuilder to create our Operator. We're building a hypothetical StatefulDB Operator. Assume you have Go, Docker, and Kubebuilder installed.

bash
# 1. Initialize the project
mkdir statefuldb-operator && cd statefuldb-operator
kubebuilder init --domain my.domain --repo my.domain/statefuldb-operator

# 2. Create the API for our StatefulDB resource
kubebuilder create api --group database --version v1alpha1 --kind StatefulDB

Now, let's define a sophisticated CRD in api/v1alpha1/statefuldb_types.go. A production-grade CRD needs a rich Spec for configuration and a detailed Status for observability.

go
// api/v1alpha1/statefuldb_types.go

package v1alpha1

import (
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// BackupStrategy defines the backup configuration.
 type BackupStrategy struct {
	// S3Bucket is the target S3 bucket for backups.
	// +kubebuilder:validation:Required
	S3Bucket string `json:"s3Bucket"`

	// Schedule is a cron-style backup schedule.
	// +kubebuilder:validation:Optional
	Schedule string `json:"schedule,omitempty"`
}

// StatefulDBSpec defines the desired state of StatefulDB
type StatefulDBSpec struct {
	// Number of desired pods. Defaults to 1.
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:default=1
	Replicas *int32 `json:"replicas"`

	// Image is the container image to run for the database.
	// +kubebuilder:validation:Required
	Image string `json:"image"`

	// StorageClassName for the PersistentVolumeClaims.
	// +kubebuilder:validation:Required
	StorageClassName string `json:"storageClassName"`

	// StorageSize for the PersistentVolumeClaims (e.g., "10Gi").
	// +kubebuilder:validation:Required
	StorageSize string `json:"storageSize"`

	// Backup configuration.
	// +kubebuilder:validation:Optional
	Backup BackupStrategy `json:"backup,omitempty"`
}

// StatefulDBStatus defines the observed state of StatefulDB
type StatefulDBStatus struct {
	// Conditions represent the latest available observations of the StatefulDB's state.
	// +optional
	// +patchMergeKey=type
	// +patchStrategy=merge
	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
//+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.conditions[?(@.type==\"Ready\")].status"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

// StatefulDB is the Schema for the statefuldbs API
type StatefulDB struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   StatefulDBSpec   `json:"spec,omitempty"`
	Status StatefulDBStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// StatefulDBList contains a list of StatefulDB
type StatefulDBList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []StatefulDB `json:"items"`
}

func init() {
	SchemeBuilder.Register(&StatefulDB{}, &StatefulDBList{})
}

After defining the types, run make manifests generate to update the CRD manifests and generated code.

The Core Reconciler: Implementing Idempotency

The heart of the operator is the Reconcile method in controllers/statefuldb_controller.go. Its primary responsibility is to converge the actual state of the cluster with the desired state defined in the StatefulDB custom resource.

A key principle here is idempotency. The Reconcile function may be called multiple times for the same object. It must produce the same result regardless of how many times it's executed.

Here is a snippet of a basic reconciliation for creating the underlying StatefulSet.

go
// controllers/statefuldb_controller.go

import (
	// ... other imports
	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/types"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)

func (r *StatefulDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := r.Log.WithValues("statefuldb", req.NamespacedName)

	// 1. Fetch the StatefulDB instance
	instance := &databasev1alpha1.StatefulDB{}
	err := r.Get(ctx, req.NamespacedName, instance)
	if err != nil {
		if errors.IsNotFound(err) {
			log.Info("StatefulDB resource not found. Ignoring since object must be deleted")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Failed to get StatefulDB")
		return ctrl.Result{}, err
	}

	// 2. Check if the StatefulSet already exists, if not create a new one
	found := &appsv1.StatefulSet{}
	err = r.Get(ctx, types.NamespacedName{Name: instance.Name, Namespace: instance.Namespace}, found)
	if err != nil && errors.IsNotFound(err) {
		// Define a new StatefulSet
		sts := r.statefulSetForStatefulDB(instance)
		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
		if err = r.Create(ctx, sts); err != nil {
			log.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
			return ctrl.Result{}, err
		}
		// StatefulSet created successfully - return and requeue
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		log.Error(err, "Failed to get StatefulSet")
		return ctrl.Result{}, err
	}

	// 3. Ensure the StatefulSet size is the same as the spec
	// ... update logic here ...

	return ctrl.Result{}, nil
}

// statefulSetForStatefulDB returns a StatefulDB StatefulSet object
func (r *StatefulDBReconciler) statefulSetForStatefulDB(db *databasev1alpha1.StatefulDB) *appsv1.StatefulSet {
	// ... logic to build the StatefulSet object ...
	// CRITICAL: Set the owner reference so that when the StatefulDB CR is deleted,
	// the StatefulSet is garbage collected by Kubernetes.
	sts := &appsv1.StatefulSet{ /* ... */ }
	ctrl.SetControllerReference(db, sts, r.Scheme)
	return sts
}

ctrl.SetControllerReference is vital. It establishes an owner-dependent relationship. If StatefulDB is deleted, Kubernetes' garbage collector will automatically delete the StatefulSet. But as we discussed, this is insufficient for our needs.

Advanced Lifecycle Control with Finalizers

Here we arrive at the core of this post. To perform actions before our StatefulDB resource is deleted, we need to use a finalizer. A finalizer is a key in the resource's metadata.finalizers list. When a resource has a finalizer, a kubectl delete command does not immediately delete it. Instead, it sets the metadata.deletionTimestamp field to the current time and enters a "terminating" state. The resource is only removed from the API server after its finalizers list is empty.

Our Operator will watch for this deletionTimestamp and perform its cleanup logic. Only upon successful completion will it remove its own finalizer, allowing the deletion to proceed.

Step 1: Define and Add the Finalizer

First, let's define our finalizer name. It's best practice to use a domain-qualified name to avoid collisions.

go
// controllers/statefuldb_controller.go
const statefulDBFinalizer = "database.my.domain/finalizer"

Next, in our Reconcile function, we check if the object is being deleted. If not, we ensure our finalizer is present.

go
// ... inside Reconcile function, after fetching the instance

// Check if the instance is being deleted
isMarkedForDeletion := instance.GetDeletionTimestamp() != nil
if isMarkedForDeletion {
    if controllerutil.ContainsFinalizer(instance, statefulDBFinalizer) {
        // Our finalizer is present, so let's handle external dependency cleanup.
        if err := r.handleFinalizer(ctx, instance); err != nil {
            // if fail to delete the external dependency here, return with error
            // so that it can be retried.
            return ctrl.Result{}, err
        }

        // Remove our finalizer from the list and update it.
        controllerutil.RemoveFinalizer(instance, statefulDBFinalizer)
        if err := r.Update(ctx, instance); err != nil {
            return ctrl.Result{}, err
        }
    }
    return ctrl.Result{}, nil
}

// Add finalizer for this CR if it doesn't exist yet
if !controllerutil.ContainsFinalizer(instance, statefulDBFinalizer) {
    log.Info("Adding finalizer for the StatefulDB")
    controllerutil.AddFinalizer(instance, statefulDBFinalizer)
    if err := r.Update(ctx, instance); err != nil {
        return ctrl.Result{}, err
    }
}

// ... rest of the reconciliation logic (creating StatefulSet, etc.)

This logic partitions the Reconcile function into two main paths:

  • Normal Reconciliation: deletionTimestamp is nil. The Operator adds its finalizer and proceeds to manage the StatefulSet.
  • Deletion Reconciliation: deletionTimestamp is set. The Operator executes its cleanup logic (handleFinalizer). Upon success, it removes the finalizer and updates the resource. This signals to Kubernetes that the Operator is finished, and the resource can be deleted.
  • Step 2: Implement the Finalizer Handler

    Our handleFinalizer function will contain the critical pre-delete logic. In our case, we want to create a Kubernetes Job to perform a final backup to S3 before the StatefulSet and its PVCs are destroyed.

    go
    // controllers/statefuldb_controller.go
    
    import (
    	// ... other imports
    	batchv1 "k8s.io/api/batch/v1"
    )
    
    func (r *StatefulDBReconciler) handleFinalizer(ctx context.Context, db *databasev1alpha1.StatefulDB) error {
    	log := r.Log.WithValues("statefuldb", db.NamespacedName)
    
    	// Check if a backup job already exists.
    	backupJob := &batchv1.Job{}
    	jobName := db.Name + "-final-backup"
    	err := r.Get(ctx, types.NamespacedName{Name: jobName, Namespace: db.Namespace}, backupJob)
    
    	if err != nil && errors.IsNotFound(err) {
    		// Job doesn't exist, create it.
    		log.Info("Creating final backup job")
    		job := r.newBackupJob(db, jobName)
    		if err := r.Create(ctx, job); err != nil {
    			log.Error(err, "Failed to create final backup job")
    			return err
    		}
    		// Job created. Requeue to check its status later.
    		// This is crucial: we don't proceed until the job is done.
    		return nil // Returning nil error, but Reconcile will be called again.
    	} else if err != nil {
    		log.Error(err, "Failed to get backup job")
    		return err
    	}
    
    	// Job exists, check its status.
    	if isJobFinished(backupJob) {
    		if backupJob.Status.Succeeded > 0 {
    			log.Info("Final backup job completed successfully.")
    			// Cleanup: Delete the job itself
    			_ = r.Delete(ctx, backupJob, client.PropagationPolicy(metav1.DeletePropagationBackground))
    			return nil // Success! The finalizer can now be removed.
    		} else {
    			log.Error(nil, "Final backup job failed.")
    			// The job failed. We are in a tricky state.
    			// We could retry, or update the status to indicate failure.
    			// For now, we return an error to halt finalizer removal.
    			return fmt.Errorf("final backup job %s failed", jobName)
    		}
    	}
    
    	log.Info("Final backup job is still running. Waiting...")
    	// Requeue to check job status again after a delay.
    	// The main Reconcile loop will requeue us, but returning an error ensures we don't proceed.
    	// A better approach would be to return a ctrl.Result{RequeueAfter: ...}
    	// from the main loop, but for simplicity we'll let it retry.
    	return fmt.Errorf("backup job not finished")
    }
    
    func (r *StatefulDBReconciler) newBackupJob(db *databasev1alpha1.StatefulDB, jobName string) *batchv1.Job {
    	// In a real implementation, this would mount the PVC from the StatefulSet
    	// and run a backup tool.
    	return &batchv1.Job{
    		ObjectMeta: metav1.ObjectMeta{
    			Name:      jobName,
    			Namespace: db.Namespace,
    		},
    		Spec: batchv1.JobSpec{
    			Template: corev1.PodTemplateSpec{
    				Spec: corev1.PodSpec{
    					Containers: []corev1.Container{{
    						Name:  "backup-pod",
    						Image: "my-backup-tool:latest", // Your backup tool image
    						Args:  []string{"--bucket", db.Spec.Backup.S3Bucket, "--source", "/data"},
    					}},
    					RestartPolicy: corev1.RestartPolicyNever,
    				},
    			},
    		},
    	}
    }
    
    func isJobFinished(job *batchv1.Job) bool {
    	for _, c := range job.Status.Conditions {
    		if (c.Type == batchv1.JobComplete || c.Type == batchv1.JobFailed) && c.Status == corev1.ConditionTrue {
    			return true
    		}
    	}
    	return false
    }

    This implementation is stateful. When handleFinalizer is first called, it creates the backup Job. On subsequent calls, it checks the Job's status. It only returns a nil error (signaling success) once the Job has completed successfully. If the Job fails or is still running, it returns an error, causing the controller-runtime to requeue the request and preventing the finalizer from being removed.

    Edge Cases and Production-Hardening

    What we have is a solid foundation, but production systems demand we consider the edge cases.

    Edge Case 1: The Backup Job Fails Indefinitely

    In our current code, if the backup Job fails, handleFinalizer will continuously return an error. The StatefulDB resource will be stuck in a Terminating state forever, and kubectl delete will hang. This is a common pitfall.

    Solution: Update the CRD Status to reflect the failure. This makes the state observable to humans and other automation.

    First, we need to add a Condition type to our status, which we already did in our _types.go file. Now let's use it.

    go
    // In handleFinalizer, when backupJob fails:
    if backupJob.Status.Failed > 0 {
        log.Error(nil, "Final backup job failed.")
        // Update status to reflect the failure
        backupFailedCondition := metav1.Condition{
            Type:    "BackupFailed",
            Status:  metav1.ConditionTrue,
            Reason:  "FinalBackupJobFailed",
            Message: fmt.Sprintf("The final backup job %s failed", jobName),
        }
        // This requires a helper function to set conditions properly
        // meta.SetStatusCondition(&db.Status.Conditions, backupFailedCondition)
        // if err := r.Status().Update(ctx, db); err != nil { ... }
    
        // Now, what's the policy? Do we retry? Or halt?
        // For now, we halt to prevent accidental data loss.
        return fmt.Errorf("final backup job %s failed", jobName)
    }

    By updating the status, an administrator can run kubectl describe statefuldb my-db-instance and see the BackupFailed condition, allowing them to debug the job's pods. They could then manually fix the issue and perhaps delete the failed job, allowing the next reconciliation to retry creating it.

    Edge Case 2: Operator Pod Crashes During Finalization

    What if the operator pod dies after creating the backup Job but before the next reconciliation? This is where the idempotent and state-driven nature of the Operator pattern shines. When the operator restarts, its Reconcile function will be called for the terminating StatefulDB resource. The handleFinalizer logic will run again. It will use r.Get to check for the backup Job. Since the job already exists, it won't try to create a new one. It will simply proceed to check the job's status. The state is stored in the Kubernetes API server (as a Job object), not in the operator's memory, making the process resilient to crashes.

    Performance Consideration: Exponential Backoff

    If creating the backup Job fails due to a transient issue (e.g., API server is temporarily overloaded), our current code will retry immediately, potentially exacerbating the problem. The controller-runtime manager is configured with a default rate limiter that provides exponential backoff, so simple requeues are generally safe. However, for specific known transient errors, you can provide more intelligent requeue logic.

    go
    // Inside Reconcile, handling an error from r.Create(ctx, job)
    if err != nil {
        // Check if it's a transient error, e.g., an admission webhook denial
        if IsTransientError(err) {
            log.Info("Transient error creating job, will requeue after a delay")
            return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
        }
        log.Error(err, "Failed to create new StatefulSet")
        return ctrl.Result{}, err
    }

    Full Lifecycle Walkthrough

    Let's trace the entire lifecycle with our robust, finalizer-aware operator:

  • Creation: kubectl apply -f my-db.yaml is run.
  • Reconcile #1: The Operator sees a new StatefulDB with no deletionTimestamp. It adds database.my.domain/finalizer to metadata.finalizers and updates the CR. This triggers another reconciliation.
  • Reconcile #2: The Operator sees the finalizer is present. It proceeds with normal logic. It finds no StatefulSet and creates one. It updates the StatefulDB status with a Provisioning condition.
  • Reconcile #3...N: The Operator checks the status of the StatefulSet replicas. Once they are all ready, it updates the StatefulDB status, setting the Ready condition to True.
  • Deletion: kubectl delete statefuldb my-db-instance is run.
  • API Server Action: The API server receives the delete request. It sees the finalizer and, instead of deleting the object, sets the deletionTimestamp.
  • Reconcile #N+1: The Operator's reconciliation loop is triggered. It sees deletionTimestamp is not nil. It enters the finalizer logic path.
  • Reconcile #N+2 (Finalizer Logic): handleFinalizer is called. It finds no existing backup Job, so it creates one and returns a nil error, causing a requeue.
  • Reconcile #N+3...M: On subsequent reconciliations, handleFinalizer finds the running Job and checks its status. It keeps returning errors like backup job not finished, causing requeues until the job is complete.
  • Reconcile #M+1 (Job Complete): The Job finishes successfully. handleFinalizer sees job.Status.Succeeded > 0. It logs success and returns nil.
  • Finalizer Removal: Back in the main Reconcile function, since handleFinalizer returned no error, the Operator calls controllerutil.RemoveFinalizer and updates the StatefulDB object.
  • API Server Final Deletion: The API server sees the StatefulDB object has a deletionTimestamp and an empty finalizers list. It now proceeds to permanently delete the object from etcd.
  • Garbage Collection: Because our StatefulSet had an ownerReference pointing to the StatefulDB, Kubernetes' built-in garbage collector now deletes the StatefulSet, its Pods, and potentially its PVCs.
  • Conclusion

    Finalizers transform a Kubernetes Operator from a simple resource creator into a true lifecycle manager. By intercepting the deletion process, an Operator can perform critical, state-aware cleanup tasks that are impossible with standard Kubernetes resources alone. This pattern is essential for building robust, production-grade controllers for any non-trivial stateful application.

    While the implementation requires careful state management—checking for existing cleanup tasks, handling their failures, and updating status—the result is a powerful automation of complex operational knowledge. This moves your infrastructure closer to a truly self-managing system, which is the ultimate promise of the Operator pattern.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles