Managing Stateful Deletion with Kubernetes Operator Finalizers

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Deletion Fallacy in Stateful Kubernetes Management

In a stateless world, Kubernetes's default deletion mechanism is a model of efficiency. When a user executes kubectl delete my-pod, the API server marks the object with a deletionTimestamp, and various controllers (like the ReplicaSet controller or the node's kubelet) race to terminate the resource and its dependencies. This works flawlessly for ephemeral workloads.

However, for a Kubernetes Operator managing stateful, external resources—such as an AWS RDS instance, a Google Cloud Storage bucket, or a SaaS subscription provisioned via an API—this fire-and-forget approach is a direct path to resource leaks, security vulnerabilities, and runaway cloud bills. The core problem is a lifecycle mismatch: the Kubernetes object representing the external resource can be removed from etcd long before the actual external resource has been safely deprovisioned.

Consider an operator managing a ManagedDatabase Custom Resource (CR). A simplified reconciliation loop might look like this:

  • A ManagedDatabase CR is created.
  • The operator's Reconcile function is triggered.
    • It checks if an RDS instance exists for this CR.
  • If not, it calls the AWS API to create one and stores the instance ID in the CR's status.
  • Now, what happens upon deletion?

  • kubectl delete manageddatabase my-db is executed.
  • The ManagedDatabase object is marked with a deletionTimestamp.
  • The operator's Reconcile function is triggered again.
  • The operator should call the AWS API to delete the RDS instance.
  • But what if the operator pod is terminated during a deployment rollout just after the kubectl delete command is issued? What if the AWS API call fails temporarily? The Kubernetes garbage collector, seeing the deletionTimestamp, will eventually remove the ManagedDatabase object from etcd. Once the object is gone, the operator loses its source of truth and has no way to know it was responsible for an orphaned RDS instance. This is the stateful deletion fallacy: assuming in-cluster object deletion guarantees out-of-cluster resource cleanup.

    This is where Finalizers become an indispensable pattern. A finalizer is a namespaced key added to an object's metadata.finalizers list. Its presence acts as a lock, preventing Kubernetes from fully deleting the object. The object remains in a Terminating state (visible via its deletionTimestamp) until all finalizers are removed from the list. This gives your controller a guaranteed window to perform complex, multi-step, and potentially long-running cleanup operations.

    This article will dissect the implementation of a robust finalizer pattern in a Go-based operator using controller-runtime, focusing on production-grade concerns like idempotency, error handling, and asynchronous cleanup.

    Core Mechanics: The Finalizer-Aware Reconciliation Loop

    Let's architect the Reconcile function to correctly handle the finalizer lifecycle. We'll define a StatefulSetWithBucket custom resource that manages a standard StatefulSet and an associated (mocked) external object storage bucket.

    Our finalizer will be a unique string, typically following the domain/name convention, e.g., apps.mycompany.com/bucket-protection.

    First, let's define our CRD types in api/v1/statefulsetwithbucket_types.go:

    go
    // api/v1/statefulsetwithbucket_types.go
    
    package v1
    
    import (
    	appsv1 "k8s.io/api/apps/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // StatefulSetWithBucketSpec defines the desired state of StatefulSetWithBucket
    type StatefulSetWithBucketSpec struct {
    	// Replicas is the number of desired replicas.
    	Replicas *int32 `json:"replicas,omitempty"`
    
    	// Template for the pods.
    	Template corev1.PodTemplateSpec `json:"template"`
    
    	// BucketName is the name of the external bucket to create.
    	BucketName string `json:"bucketName"`
    }
    
    // StatefulSetWithBucketStatus defines the observed state of StatefulSetWithBucket
    type StatefulSetWithBucketStatus struct {
    	// BucketProvisioned indicates whether the external bucket has been created.
    	BucketProvisioned bool `json:"bucketProvisioned,omitempty"`
    
    	// BucketID is the unique identifier for the external bucket.
    	BucketID string `json:"bucketID,omitempty"`
    
    	// Conditions represent the latest available observations of an object's state.
    	Conditions []metav1.Condition `json:"conditions,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    
    // StatefulSetWithBucket is the Schema for the statefulsetwithbuckets API
    type StatefulSetWithBucket struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:",inline"`
    
    	Spec   StatefulSetWithBucketSpec   `json:"spec,omitempty"`
    	Status StatefulSetWithBucketStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // StatefulSetWithBucketList contains a list of StatefulSetWithBucket
    type StatefulSetWithBucketList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:",inline"`
    	Items           []StatefulSetWithBucket `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&StatefulSetWithBucket{}, &StatefulSetWithBucketList{})
    }

    Now, let's structure the reconciler in internal/controller/statefulsetwithbucket_controller.go. The core logic splits into two main paths based on the deletionTimestamp.

    go
    // internal/controller/statefulsetwithbucket_controller.go
    
    package controller
    
    import (
    	// ... other imports
    	"context"
    	"time"
    
    	"k8s.io/apimachinery/pkg/runtime"
    	ctrl "sigs.k8s.io/controller-runtime"
    	"sigs.k8s.io/controller-runtime/pkg/client"
    	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    	"sigs.k8s.io/controller-runtime/pkg/log"
    
    	appsv1alpha1 "my.domain/stateful-bucket-operator/api/v1alpha1"
    )
    
    const bucketFinalizer = "apps.mycompany.com/bucket-finalizer"
    
    // Mock external client for bucket management
    type MockBucketClient struct{}
    
    func (c *MockBucketClient) CreateBucket(ctx context.Context, name string) (string, error) {
    	// Simulate API call
    	log.FromContext(ctx).Info("Creating external bucket", "name", name)
    	time.Sleep(1 * time.Second)
    	return "bucket-" + name + "-12345", nil
    }
    
    func (c *MockBucketClient) DeleteBucket(ctx context.Context, bucketID string) error {
    	// Simulate API call
    	log.FromContext(ctx).Info("Deleting external bucket", "id", bucketID)
    	time.Sleep(2 * time.Second)
    	// To simulate errors, you could add logic here to return an error sometimes.
    	return nil
    }
    
    type StatefulSetWithBucketReconciler struct {
    	client.Client
    	Scheme       *runtime.Scheme
    	BucketClient *MockBucketClient // In a real app, this would be a proper client.
    }
    
    func (r *StatefulSetWithBucketReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	// 1. Fetch the StatefulSetWithBucket instance
    	instance := &appsv1alpha1.StatefulSetWithBucket{}
    	if err := r.Get(ctx, req.NamespacedName, instance); err != nil {
    		// Handle not-found errors, which can occur after an object is deleted.
    		return ctrl.Result{}, client.IgnoreNotFound(err)
    	}
    
    	// 2. The core finalizer logic
    	if instance.ObjectMeta.DeletionTimestamp.IsZero() {
    		// The object is NOT being deleted. 
    		// Ensure our finalizer is present.
    		if !controllerutil.ContainsFinalizer(instance, bucketFinalizer) {
    			logger.Info("Adding finalizer for StatefulSetWithBucket")
    			controllerutil.AddFinalizer(instance, bucketFinalizer)
    			if err := r.Update(ctx, instance); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    	} else {
    		// The object IS being deleted.
    		if controllerutil.ContainsFinalizer(instance, bucketFinalizer) {
    			// Our finalizer is present, so run cleanup logic.
    			if err := r.cleanupExternalResources(ctx, instance); err != nil {
    				// If cleanup fails, return an error to requeue the request.
    				// The finalizer is NOT removed, so the object won't be deleted.
    				logger.Error(err, "Failed to cleanup external resources")
    				return ctrl.Result{}, err
    			}
    
    			// Cleanup was successful. Remove the finalizer.
    			logger.Info("External resources cleaned up. Removing finalizer.")
    			controllerutil.RemoveFinalizer(instance, bucketFinalizer)
    			if err := r.Update(ctx, instance); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    
    		// Stop reconciliation as the item is being deleted
    		return ctrl.Result{}, nil
    	}
    
    	// 3. Main reconciliation logic (create/update resources)
    	// ... (Code to manage the StatefulSet and the external bucket)
    	if !instance.Status.BucketProvisioned {
    		bucketID, err := r.BucketClient.CreateBucket(ctx, instance.Spec.BucketName)
    		if err != nil {
    			logger.Error(err, "Failed to create external bucket")
    			return ctrl.Result{}, err
    		}
    		instance.Status.BucketProvisioned = true
    		instance.Status.BucketID = bucketID
    		if err := r.Status().Update(ctx, instance); err != nil {
    			return ctrl.Result{}, err
    		}
    		logger.Info("Successfully provisioned external bucket", "bucketID", bucketID)
    	}
    	
    	// ... (Logic to create/update the associated Kubernetes StatefulSet would go here)
    
    	return ctrl.Result{}, nil
    }
    
    func (r *StatefulSetWithBucketReconciler) cleanupExternalResources(ctx context.Context, instance *appsv1alpha1.StatefulSetWithBucket) error {
    	logger := log.FromContext(ctx)
    
    	if instance.Status.BucketID == "" {
    		logger.Info("Bucket ID not found in status, assuming it was never created.")
    		return nil
    	}
    
    	logger.Info("Starting cleanup of external bucket", "bucketID", instance.Status.BucketID)
    	if err := r.BucketClient.DeleteBucket(ctx, instance.Status.BucketID); err != nil {
    		// Here, you might want to check for specific errors. 
    		// If the error indicates the bucket is already gone, you can treat it as a success.
    		return err
    	}
    
    	logger.Info("Successfully deleted external bucket", "bucketID", instance.Status.BucketID)
    	return nil
    }
    
    // SetupWithManager sets up the controller with the Manager.
    func (r *StatefulSetWithBucketReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	// ... setup code
    }

    This structure is the foundation of the pattern:

  • Check Deletion Timestamp: The if instance.ObjectMeta.DeletionTimestamp.IsZero() block is the primary fork in the logic.
  • Add Finalizer: On a new or updated object, we idempotently add our finalizer. The controllerutil.ContainsFinalizer check prevents unnecessary updates.
  • Execute Cleanup: When the deletion timestamp is set, we check for our finalizer. If present, we execute our cleanup logic.
  • Handle Cleanup Failure: If cleanupExternalResources returns an error, we immediately return the error to the controller-runtime manager. This triggers a requeue with exponential backoff, ensuring we retry the cleanup later. The crucial point is that the finalizer is not removed.
  • Remove Finalizer: Only after cleanupExternalResources succeeds do we remove the finalizer with controllerutil.RemoveFinalizer and update the object. Once the finalizers list is empty, the Kubernetes garbage collector is free to permanently delete the object.
  • Advanced Concern: Long-Running Cleanup and Operator Responsiveness

    The previous implementation works, but has a significant production flaw: the DeleteBucket call is synchronous and blocks the reconciliation worker. If deleting the bucket takes 30 seconds, or even minutes (e.g., waiting for an S3 bucket with many objects to be emptied), the operator's worker goroutine for this CR is completely tied up. If you have a limited number of concurrent workers (MaxConcurrentReconciles in the manager options), a few slow deletions can starve the entire operator, preventing it from reconciling other new or updated CRs.

    The solution is an asynchronous cleanup pattern. Instead of performing the deletion directly, the reconciler offloads the task to a separate, out-of-band process, typically a Kubernetes Job. The reconciler's role shifts from doing the work to orchestrating it.

    Here's how we can refactor our logic:

  • When cleanup is required, the operator creates a Job configured to run a pod with the necessary logic/credentials to delete the external resource.
    • The operator adds an annotation or status condition to the CR to track that a cleanup job is in progress.
  • The Reconcile function, on subsequent runs for the terminating CR, will now check the status of the Job instead of re-attempting the deletion itself.
  • Only when the Job completes successfully does the operator remove the finalizer.
  • Let's implement this more robust pattern.

    First, we need to update our status to track the cleanup job.

    go
    // api/v1/statefulsetwithbucket_types.go
    
    // ... inside StatefulSetWithBucketStatus
    type StatefulSetWithBucketStatus struct {
        // ... other fields
    	CleanupJobName string `json:"cleanupJobName,omitempty"`
    }

    Next, we refactor the reconciler. We will need RBAC permissions to create, get, and list Jobs.

    go
    // internal/controller/statefulsetwithbucket_controller.go
    
    // +kubebuilder:rbac:groups=batch,resources=jobs,verbs=get;list;watch;create;update;patch;delete
    
    // ... other code
    
    func (r *StatefulSetWithBucketReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        // ... (fetch instance logic is the same)
    
    	if !instance.ObjectMeta.DeletionTimestamp.IsZero() {
    		if controllerutil.ContainsFinalizer(instance, bucketFinalizer) {
    			// Run async cleanup logic.
    			res, err := r.cleanupExternalResourcesAsync(ctx, instance)
    			if err != nil {
    				return ctrl.Result{}, err
    			}
    			if res.Requeue {
    				return res, nil
    			}
    
    			// Cleanup was successful (Job finished). Remove the finalizer.
    			logger.Info("Async cleanup finished. Removing finalizer.")
    			controllerutil.RemoveFinalizer(instance, bucketFinalizer)
    			if err := r.Update(ctx, instance); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		return ctrl.Result{}, nil
    	}
    
        // ... (normal reconciliation logic)
    }
    
    func (r *StatefulSetWithBucketReconciler) cleanupExternalResourcesAsync(ctx context.Context, instance *appsv1alpha1.StatefulSetWithBucket) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	// If no bucket was ever provisioned, we are done.
    	if instance.Status.BucketID == "" {
    		logger.Info("Bucket ID not found, no cleanup needed.")
    		return ctrl.Result{}, nil
    	}
    
    	// Check if the cleanup job already exists.
    	jobName := fmt.Sprintf("cleanup-%s", instance.Name)
    	job := &batchv1.Job{}
    	err := r.Get(ctx, types.NamespacedName{Name: jobName, Namespace: instance.Namespace}, job)
    
    	if err != nil && apierrors.IsNotFound(err) {
    		// Job does not exist, so let's create it.
    		logger.Info("Creating cleanup job for external bucket", "jobName", jobName)
    		cleanupJob := r.newCleanupJob(instance, jobName)
    		if err := r.Create(ctx, cleanupJob); err != nil {
    			logger.Error(err, "Failed to create cleanup job")
    			return ctrl.Result{}, err
    		}
    		// Job created, requeue to check its status later.
    		return ctrl.Result{RequeueAfter: 15 * time.Second}, nil
    
    	} else if err != nil {
    		// Some other error occurred when trying to get the job.
    		logger.Error(err, "Failed to get cleanup job")
    		return ctrl.Result{}, err
    	}
    
    	// Job already exists, let's check its status.
    	if job.Status.Succeeded > 0 {
    		logger.Info("Cleanup job succeeded.")
    		// Job is finished, we can proceed with finalizer removal.
    		return ctrl.Result{}, nil
    	}
    
    	if job.Status.Failed > 0 {
    		// Job failed. This is a critical state. 
    		// It requires manual intervention or a more complex retry strategy.
    		logger.Error(fmt.Errorf("cleanup job failed"), "The cleanup job has failed. Manual intervention may be required.", "jobName", jobName)
    		// We return an error to keep requeueing, but the finalizer will remain, blocking deletion.
    		return ctrl.Result{}, fmt.Errorf("cleanup job %s failed", jobName)
    	}
    
    	// Job is still running.
    	logger.Info("Cleanup job is still running. Requeuing.")
    	return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }
    
    func (r *StatefulSetWithBucketReconciler) newCleanupJob(instance *appsv1alpha1.StatefulSetWithBucket, jobName string) *batchv1.Job {
    	// In a real operator, this image would be your own utility image that contains
    	// the logic and credentials to talk to your external service (e.g., AWS CLI, a custom tool).
    	image := "my-company/cleanup-tool:latest"
    	bucketID := instance.Status.BucketID
    
    	return &batchv1.Job{
    		ObjectMeta: metav1.ObjectMeta{
    			Name:      jobName,
    			Namespace: instance.Namespace,
    			OwnerReferences: []metav1.OwnerReference{
    				*metav1.NewControllerRef(instance, appsv1alpha1.GroupVersion.WithKind("StatefulSetWithBucket")),
    			},
    		},
    		Spec: batchv1.JobSpec{
    			Template: corev1.PodTemplateSpec{
    				Spec: corev1.PodSpec{
    					Containers: []corev1.Container{
    						{
    							Name:    "cleaner",
    							Image:   image,
    							Args:    []string{"delete-bucket", "--id", bucketID},
    						},
    					},
    					RestartPolicy: corev1.RestartPolicyOnFailure,
    				},
    			},
    			BackoffLimit: ptr.To[int32](4),
    		},
    	}
    }

    This asynchronous approach offers several advantages:

  • Operator Responsiveness: The main controller is no longer blocked. It fires off the Job and can immediately proceed to reconcile other resources.
  • Decoupling: The cleanup logic is encapsulated in a separate container image. This is great for separation of concerns and security, as the main operator pod might not need the (potentially powerful) credentials required for deletion.
  • Robustness: Kubernetes Jobs have built-in retry mechanisms (BackoffLimit), making the cleanup process more resilient to transient failures.
  • Observability: The state of the cleanup is now a first-class Kubernetes object (Job) that can be inspected with kubectl, monitored, and alerted on.
  • Edge Case: The Stuck Finalizer

    The most dreaded problem with this pattern is the "stuck finalizer." This occurs when a finalizer is present on an object, but the controller responsible for removing it is unable to do so. The object will remain in the Terminating state indefinitely.

    Common causes include:

  • Bug in the Operator: The cleanup logic has a bug that prevents it from ever completing successfully.
  • Permanent External Error: The external API required for cleanup is down, or the resource is in a state where it cannot be deleted (e.g., an S3 bucket with a legal hold).
  • Operator Uninstalled: The operator deployment is deleted from the cluster before its custom resources are. There is no longer any code running to process the finalizer.
  • Handling this requires both proactive design and a documented operational procedure.

    Proactive Design:

  • Metrics and Alerts: Your operator should expose Prometheus metrics. A metric like operator_terminating_resources_count can be used to create an alert if a resource is stuck in the Terminating state for too long (e.g., > 1 hour).
  • Status Conditions: Use status conditions on your CR to provide detailed feedback about the cleanup process. If the Job fails, update the CR's status with a Condition of type Degraded and a clear message explaining why.
  • Operational Procedure:

    When a stuck finalizer is detected, an administrator must intervene. The process is typically:

  • Investigate: Use kubectl describe on the CR and check the operator's logs to understand why the cleanup is failing.
  • Manual Cleanup: Manually perform the action the operator was trying to do. For our example, this would mean using the AWS console or CLI to delete the orphaned S3 bucket.
  • Force Removal: Once the external resource is confirmed to be gone, you can manually remove the finalizer from the Kubernetes object. This is a powerful and potentially dangerous operation.
  • bash
    # WARNING: Only do this after ensuring the external resource is cleaned up.
    kubectl patch statefulsetwithbucket <instance-name> -n <namespace> --type 'json' -p='[{"op": "remove", "path": "/metadata/finalizers/0"}]'

    This command directly manipulates the object in etcd, removing the finalizer and allowing the garbage collector to finally delete it. It's the escape hatch for when the automated process fails irrevocably.

    Conclusion: Finalizers as a Contract

    Finalizers are more than just a Kubernetes feature; they represent a contract between your operator and the cluster. By adding a finalizer, your operator tells the API server, "Do not delete this object until I have completed my essential, stateful cleanup tasks." This contract is fundamental to building reliable systems on Kubernetes that interact with the outside world.

    By implementing a robust, idempotent, and asynchronous finalizer pattern, you elevate your operator from a simple resource provisioner to a true lifecycle manager. You ensure that the entire lifecycle of your managed resources—creation, updates, and especially deletion—is handled gracefully, preventing orphaned resources and ensuring the stability and cost-effectiveness of your platform. While the logic is more complex than a simple create/update loop, it is the non-negotiable price of admission for writing production-ready, stateful operators.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles