Stateful Cleanup: Implementing Finalizers in Custom Kubernetes Operators

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Orphaned Resource Problem in Production

In a mature Kubernetes environment, operators are the standard for managing complex, stateful applications. A common pattern is to define a Custom Resource (CR) that represents a high-level concept, like a ManagedDatabase or a DNSEntry. The operator's reconciliation loop then translates this CR into lower-level resources, often interacting with external APIs—a cloud provider's SDK to provision a database, a DNS provider's API to create a record, etc.

The creation and update logic is often straightforward. The real challenge, frequently underestimated, is deletion. When a user executes kubectl delete manageddatabase my-prod-db, Kubernetes' default behavior is to simply remove the ManagedDatabase object from the etcd data store. The API server has no inherent knowledge of the actual RDS instance or Cloud SQL database this object represents. The operator, which was previously watching this object, gets a 'delete' event, but by then, the object is gone. The result is an orphaned external resource: a running database with no corresponding Kubernetes object to manage it, leading to resource leaks, security vulnerabilities, and billing surprises.

This is where the Kubernetes finalizer pattern becomes not just a best practice, but a necessity for production-grade operators. A finalizer is a mechanism that prevents the immediate deletion of a resource, allowing controllers to execute pre-delete hooks for cleanup. It transforms the operator from a simple create/update engine into a true lifecycle manager.

This article provides a deep dive into the implementation of finalizers within a Go-based operator built with Kubebuilder, focusing on idempotency, edge case handling, and production-ready patterns.

Prerequisites

This guide assumes you are a senior engineer with a solid understanding of:

* The Kubernetes operator pattern and reconciliation loops.

* Go programming language.

* Experience with Kubebuilder or Operator-SDK for scaffolding operators.

* The basics of Custom Resource Definitions (CRDs).

We will not cover the initial setup of an operator project. We will jump directly into modifying the reconciliation logic to implement a robust finalizer.

The Finalizer Mechanism: A Deeper Look

A finalizer is simply a string key added to the metadata.finalizers list of any Kubernetes object. When the Kubernetes API server receives a delete request for an object that has finalizers, it does not delete it immediately. Instead, it updates the object by setting a metadata.deletionTimestamp. This timestamp signals to all controllers watching the object that a deletion has been requested.

The object remains in the API server in a 'terminating' state until all keys in its metadata.finalizers list have been removed. It is the sole responsibility of the controller that added a specific finalizer to also remove it after its cleanup logic has successfully completed.

Our operator's reconciliation loop will leverage this behavior:

  • On Creation/Update: If our CR does not have our operator's finalizer, we add it. This 'registers' the object for our cleanup logic.
  • On Deletion Request: The reconciler will see that deletionTimestamp is set.
  • Cleanup Execution: The reconciler executes its cleanup logic (e.g., call the cloud provider API to delete the database).
  • Finalizer Removal: Upon successful cleanup, the reconciler removes its finalizer key from the metadata.finalizers list and updates the object.
  • Final Deletion: Once our finalizer (and any others) is gone, the Kubernetes garbage collector is free to permanently delete the CR object from etcd.
  • Implementing Finalizers in a Go Operator

    Let's consider a practical scenario. We have a CloudDatabase CRD that manages a database instance on a fictional external cloud provider. Our goal is to ensure that when a CloudDatabase CR is deleted, the corresponding database in the cloud is also deleted.

    First, let's define our finalizer key as a constant in our controller file (clouddatabase_controller.go):

    go
    const cloudDatabaseFinalizer = "database.example.com/finalizer"

    Using a domain-qualified name is a best practice to avoid collisions with other controllers that might be managing the same object.

    The Modified Reconciliation Loop

    The core of our work lies in restructuring the Reconcile function. The standard scaffolded function needs to be augmented with logic to handle both the normal reconciliation path and the deletion path.

    Here is a complete, production-ready Reconcile function structure:

    go
    // CloudDatabaseReconciler reconciles a CloudDatabase object
    type CloudDatabaseReconciler struct {
    	client.Client
    	Log    logr.Logger
    	Scheme *runtime.Scheme
        // A client for our external cloud provider API
    	CloudAPIClient *cloudapi.Client
    }
    
    func (r *CloudDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := r.Log.WithValues("clouddatabase", req.NamespacedName)
    
    	// 1. Fetch the CloudDatabase instance
    	instance := &databasev1alpha1.CloudDatabase{}
    	err := r.Get(ctx, req.NamespacedName, instance)
    	if err != nil {
    		if errors.IsNotFound(err) {
    			// Request object not found, could have been deleted after reconcile request.
    			// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
    			// Return and don't requeue
    			log.Info("CloudDatabase resource not found. Ignoring since object must be deleted")
    			return ctrl.Result{}, nil
    		}
    		// Error reading the object - requeue the request.
    		log.Error(err, "Failed to get CloudDatabase")
    		return ctrl.Result{}, err
    	}
    
    	// 2. Check if the instance is being deleted
    	isMarkedForDeletion := instance.GetDeletionTimestamp() != nil
    	if isMarkedForDeletion {
    		if controllerutil.ContainsFinalizer(instance, cloudDatabaseFinalizer) {
    			// Run finalization logic. If it fails, don't remove the finalizer so we can retry.
    			if err := r.finalizeCloudDatabase(ctx, log, instance); err != nil {
                    // If finalization fails, we must return an error to trigger a retry.
    				return ctrl.Result{}, err
    			}
    
    			// Remove finalizer. Once all finalizers are removed, the object will be deleted.
    			controllerutil.RemoveFinalizer(instance, cloudDatabaseFinalizer)
    			err := r.Update(ctx, instance)
    			if err != nil {
    				return ctrl.Result{}, err
    			}
    		}
            // Stop reconciliation as the item is being deleted
    		return ctrl.Result{}, nil
    	}
    
    	// 3. Add finalizer for new or updated instances
    	if !controllerutil.ContainsFinalizer(instance, cloudDatabaseFinalizer) {
    		log.Info("Adding finalizer for CloudDatabase")
    		controllerutil.AddFinalizer(instance, cloudDatabaseFinalizer)
    		err = r.Update(ctx, instance)
    		if err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
    	// 4. Standard reconciliation logic (create/update external resource)
        // Check if the external DB exists
        db, err := r.CloudAPIClient.GetDatabase(ctx, instance.Spec.DBInstanceID)
        if err != nil {
            if cloudapi.IsNotFound(err) {
                log.Info("Creating external database", "InstanceID", instance.Spec.DBInstanceID)
                // Create logic here...
                // ... after creation, update status
                instance.Status.Provisioned = true
                instance.Status.URL = "..."
                if err := r.Status().Update(ctx, instance); err != nil {
                    log.Error(err, "Failed to update CloudDatabase status")
                    return ctrl.Result{}, err
                }
                return ctrl.Result{}, nil
            }
            return ctrl.Result{}, err
        }
    
        // Update logic here (e.g., sync specs)
        // ...
    
    	return ctrl.Result{}, nil
    }

    The `finalizeCloudDatabase` Function: Idempotency is Key

    The real work of cleanup happens in a dedicated function. A critical requirement for any finalization logic is idempotency. The reconciliation loop can be triggered multiple times for a terminating object if the cleanup fails or if the operator restarts. Your cleanup code must handle this gracefully.

    go
    // finalizeCloudDatabase performs the cleanup of external resources.
    func (r *CloudDatabaseReconciler) finalizeCloudDatabase(ctx context.Context, log logr.Logger, db *databasev1alpha1.CloudDatabase) error {
    	log.Info("Starting finalization for CloudDatabase", "instanceID", db.Spec.DBInstanceID)
    
    	// Call the external API to delete the database
    	err := r.CloudAPIClient.DeleteDatabase(ctx, db.Spec.DBInstanceID)
    	if err != nil {
            // CRITICAL: Check if the error is because the resource is already gone.
            // If so, we can consider the cleanup successful.
    		if cloudapi.IsNotFound(err) {
    			log.Info("External database already deleted. Finalization successful.")
    			return nil
    		}
    
    		// Any other error means we should retry.
    		log.Error(err, "Failed to delete external database during finalization")
    		return err
    	}
    
    	log.Info("Successfully finalized CloudDatabase")
    	return nil
    }

    The most important piece of this function is handling the IsNotFound error from the cloud API client. If a previous attempt to finalize failed midway, or if someone manually deleted the database out-of-band, our DeleteDatabase call might fail because the resource no longer exists. We must treat this specific error as a success condition for our cleanup; otherwise, the finalizer will never be removed, and the CR will be stuck in a Terminating state forever.

    Advanced Edge Cases and Production Patterns

    While the above implementation works, production environments introduce more complexity. Let's explore how to handle common edge cases.

    Edge Case: The Stuck Finalizer

    A bug in your operator or an unrecoverable issue in the external API could lead to a situation where the finalizer can never be removed. The CR will be stuck with deletionTimestamp set, and kubectl delete --force won't help.

    Debugging:

  • Check Operator Logs: The first step is always to inspect the operator logs for the specific CR. The logs should indicate why the finalizeCloudDatabase function is failing.
  • Inspect the CR Status: A well-designed operator should update the CR's status subresource with conditions reflecting the failure. For example, a Degraded condition with a message like FinalizationFailed: Cloud API returned 503 Service Unavailable.
  • Manual Intervention (The Last Resort):

    If the operator cannot be fixed, and the object must be deleted, you can manually remove the finalizer using kubectl patch. This is a dangerous operation and should only be performed when you have manually verified that the external resource has been cleaned up.

    bash
    kubectl patch clouddatabase my-prod-db --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

    This command tells the API server to remove the finalizer array, which will then allow the garbage collector to delete the object.

    Pattern: Asynchronous Cleanup for Long-Running Tasks

    What if deleting the external resource takes minutes? A synchronous DeleteDatabase call in the reconciler would block that worker goroutine, potentially starving other reconciliations. This can severely impact the operator's performance and responsiveness.

    A better pattern is to make the cleanup process asynchronous.

  • Update Status to Terminating: In the finalizeCloudDatabase function, instead of calling the blocking delete function directly, first update the CR's status to indicate cleanup is in progress.
  • Trigger Asynchronous Job: Start a goroutine or, even better, create a Kubernetes Job to perform the actual deletion.
  • Requeue and Wait: The Reconcile function returns ctrl.Result{Requeue: true, RequeueAfter: 30 * time.Second}. The reconciler will then periodically check the status of the cleanup.
  • Modified finalizeCloudDatabase for Async Operations:

    go
    func (r *CloudDatabaseReconciler) finalizeCloudDatabase(ctx context.Context, log logr.Logger, db *databasev1alpha1.CloudDatabase) error {
        // Check status to see if cleanup has already been initiated
        if db.Status.Phase == "Terminating" {
            log.Info("Cleanup job already in progress, checking status...")
            cleanupComplete, err := r.checkCleanupJobStatus(ctx, db)
            if err != nil {
                log.Error(err, "Failed to check cleanup job status")
                return err // Retry
            }
    
            if cleanupComplete {
                log.Info("Cleanup job finished successfully.")
                return nil // Success, finalizer will be removed
            }
    
            log.Info("Cleanup job still running, will check again later.")
            // The error return here is important. It tells the controller to retry the reconciliation.
            // We don't want to use RequeueAfter because if the job finishes early, we want to be notified.
            // A better approach is to watch the Job object itself.
            return fmt.Errorf("cleanup job is not yet complete")
    
        } else {
            log.Info("Initiating asynchronous cleanup job.")
            db.Status.Phase = "Terminating"
            db.Status.Message = "External resource deletion in progress."
            if err := r.Status().Update(ctx, db); err != nil {
                log.Error(err, "Failed to update status to Terminating")
                return err
            }
            
            if err := r.launchCleanupJob(ctx, db); err != nil {
                log.Error(err, "Failed to launch cleanup job")
                // Optionally update status to reflect this failure
                return err
            }
    
            // Return an error to force requeue, so we can start monitoring the job status
            return fmt.Errorf("cleanup job initiated, awaiting completion")
        }
    }

    This approach makes the operator far more resilient and performant, as the main reconciliation loop remains unblocked and can service other CRs.

    Pattern: Using the Status Subresource for Clear Feedback

    Never leave users guessing about the state of a resource. The status subresource is your primary tool for communicating the operator's actions. During finalization, this is crucial.

    Extend your CRD's Status struct:

    go
    // CloudDatabaseStatus defines the observed state of CloudDatabase
    type CloudDatabaseStatus struct {
    	// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
    	// Important: Run "make" to regenerate code after modifying this file
    
    	// Conditions represent the latest available observations of an object's state
    	Conditions []metav1.Condition `json:"conditions,omitempty"`
    	URL        string            `json:"url,omitempty"`
    	Provisioned bool             `json:"provisioned,omitempty"`
    }

    Kubebuilder encourages using the metav1.Condition type, which is a standard across many Kubernetes resources. You can define conditions like Provisioned, Ready, and Degraded.

    During finalization, you can set a condition:

    go
    import (
        "k8s.io/apimachinery/pkg/api/meta"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // In the reconcile loop, after a failed finalization attempt:
    
    err := r.finalizeCloudDatabase(ctx, log, instance)
    if err != nil {
        log.Error(err, "Finalization failed, updating status and requeuing")
        
        // Set a condition to reflect the failure
        degradedCondition := metav1.Condition{
            Type:    "Degraded",
            Status:  metav1.ConditionTrue,
            Reason:  "FinalizationFailed",
            Message: fmt.Sprintf("Failed to delete external resource: %v", err),
        }
        meta.SetStatusCondition(&instance.Status.Conditions, degradedCondition)
    
        if statusUpdateErr := r.Status().Update(ctx, instance); statusUpdateErr != nil {
            log.Error(statusUpdateErr, "Failed to update status after finalization failure")
            // Return the original error to ensure we retry the finalization
            return ctrl.Result{}, err 
        }
        
        return ctrl.Result{}, err
    }

    This provides invaluable, structured feedback to users and other automation tools, allowing them to understand precisely why a resource is stuck in a Terminating state.

    Conclusion: From Controller to Lifecycle Manager

    Implementing finalizers elevates a Kubernetes operator from a simple automation tool to a robust, production-grade lifecycle manager. By preventing the premature deletion of Custom Resources, finalizers provide the necessary hook to ensure that external, non-Kubernetes resources are cleaned up reliably and gracefully.

    The key takeaways for building a production-ready finalizer implementation are:

    * Stateful Control: The deletionTimestamp is your signal to switch from reconciliation to finalization logic.

    Idempotency is Non-Negotiable: Your cleanup logic will* be retried. Ensure it can handle cases where the external resource is already gone.

    * Handle Asynchronicity: For long-running cleanup tasks, adopt an asynchronous pattern using status updates and background jobs to keep your operator responsive.

    * Communicate via Status: Use conditions in the CR's status subresource to provide clear, actionable feedback to users about the finalization process, especially on failure.

    By mastering this pattern, you can build operators that safely manage critical stateful resources, prevent costly resource leaks, and provide the reliability expected in modern cloud-native systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles