K8s Finalizers: A Deep Dive into Stateful Resource Deletion

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Deletion Fallacy in Kubernetes

As a seasoned engineer working with Kubernetes, you understand that the platform's declarative nature is its greatest strength. You define the desired state, and a constellation of controllers works tirelessly to make it so. However, this model has a subtle but critical impedance mismatch when it comes to deletion, especially for resources that manage state outside the cluster.

A kubectl delete my-crd my-resource command sends a request to the API server to remove an object from etcd. For a stateless Deployment, this is fine; the ReplicaSet controller terminates Pods, and the object is gone. But what if your Custom Resource (CR), say a DatabaseInstance, represents a managed PostgreSQL database in AWS RDS? A simple deletion from etcd orphans the actual database, leaving you with a running, billable resource that Kubernetes no longer knows about.

This is where the standard Kubernetes deletion mechanism falls short. It's a one-way, asynchronous operation that provides no hook for pre-deletion cleanup. The solution, and the core focus of this article, is the Finalizer pattern. Finalizers are a metadata key that tells the Kubernetes garbage collector, "Do not delete this object yet. A controller is performing cleanup tasks."

This article assumes you are familiar with Go, Kubernetes controllers, and the operator pattern. We will not cover the basics of kubebuilder or CRD creation. Instead, we will focus exclusively on the advanced implementation details of using finalizers to build a production-grade, state-aware controller.

Anatomy of a Finalized Deletion

Before diving into code, let's dissect the mechanism. A finalizer is simply a string added to the metadata.finalizers array of an object. When a user requests to delete an object that has finalizers:

  • Deletion is Intercepted: The Kubernetes API server receives the DELETE request. Instead of immediately removing the object, it sees the finalizers array is not empty.
  • The DeletionTimestamp is Set: The API server updates the object, setting a metadata.deletionTimestamp to the current time. The object is now in a "terminating" state but remains visible via the API.
  • Reconciliation is Triggered: This update event triggers the reconciliation loop in any controller watching this resource type.
  • Controller's Responsibility: Inside the Reconcile function, your controller must now detect that DeletionTimestamp is set. This is the signal to begin cleanup.
  • External Cleanup Logic: The controller executes its pre-delete logic—calling the AWS API to terminate the RDS instance, unmounting a persistent volume, revoking API keys, etc.
  • Finalizer Removal: Only after the external cleanup is successfully and completely verified, the controller removes its specific finalizer string from the metadata.finalizers array and updates the object.
  • Final Deletion: Once the finalizers array is empty and the deletionTimestamp is set, the Kubernetes garbage collector is free to permanently delete the object from etcd.
  • This two-phase commit-like process ensures that your controller gets a chance to gracefully tear down all associated external resources before the Kubernetes representation disappears.

    Building a `CloudDatabase` Controller

    Let's build a practical example: a controller that manages a CloudDatabase CR. This CR will have a spec defining the database engine (e.g., postgres) and size, and a status reflecting its provisioned state and external ID.

    1. The `CloudDatabase` CRD Type Definition

    First, we define our types in Go using kubebuilder markers. The key is to have a robust status sub-resource to track the external state.

    go
    // file: api/v1alpha1/clouddatabase_types.go
    
    package v1alpha1
    
    import (
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // CloudDatabaseSpec defines the desired state of CloudDatabase
    type CloudDatabaseSpec struct {
    	Engine string `json:"engine"`
    	SizeGB int    `json:"sizeGb"`
    }
    
    // CloudDatabaseStatus defines the observed state of CloudDatabase
    type CloudDatabaseStatus struct {
    	// Represents the observations of a CloudDatabase's current state.
    	// Important: Run "make" to regenerate code after modifying this file
    
    	// +optional
    	ExternalID string `json:"externalId,omitempty"`
    
    	// +optional
    	Phase string `json:"phase,omitempty"`
    
    	// +listType=map
    	// +listMapKey=type
    	// +patchStrategy=merge
    	// +patchMergeKey=type
    	// +optional
    	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    //+kubebuilder:printcolumn:name="Engine",type="string",JSONPath=".spec.engine"
    //+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.phase"
    //+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
    
    // CloudDatabase is the Schema for the clouddatabases API
    type CloudDatabase struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   CloudDatabaseSpec   `json:"spec,omitempty"`
    	Status CloudDatabaseStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // CloudDatabaseList contains a list of CloudDatabase
    type CloudDatabaseList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []CloudDatabase `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&CloudDatabase{}, &CloudDatabaseList{})
    }

    2. The Core Reconciliation Logic with Finalizers

    Now for the heart of the implementation: the Reconcile method. We'll use the controller-runtime library, which provides excellent helpers.

    go
    // file: internal/controller/clouddatabase_controller.go
    
    package controller
    
    import (
    	"context"
    	"fmt"
    	"time"
    
    	"k8s.io/apimachinery/pkg/api/errors"
    	"k8s.io/apimachinery/pkg/runtime"
    	ctrl "sigs.k8s.io/controller-runtime"
    	"sigs.k8s.io/controller-runtime/pkg/client"
    	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    	"sigs.k8s.io/controller-runtime/pkg/log"
    
    	dbv1alpha1 "finalizer-demo/api/v1alpha1"
    )
    
    // A mock external client to simulate a cloud provider API
    type MockCloudDBClient struct{}
    
    func (c *MockCloudDBClient) CreateDatabase(name string, engine string, size int) (string, error) {
    	fmt.Printf("MOCK_API: Creating database '%s' with engine '%s'\n", name, engine)
    	// Simulate API call latency
    	time.Sleep(2 * time.Second)
    	return fmt.Sprintf("ext-%s", name), nil
    }
    
    func (c *MockCloudDBClient) GetDatabaseStatus(externalID string) (string, error) {
    	fmt.Printf("MOCK_API: Getting status for external ID '%s'\n", externalID)
    	time.Sleep(500 * time.Millisecond)
    	// In a real scenario, this would return "CREATING", "AVAILABLE", "DELETING", etc.
    	return "AVAILABLE", nil
    }
    
    func (c *MockCloudDBClient) DeleteDatabase(externalID string) error {
    	fmt.Printf("MOCK_API: Deleting database with external ID '%s'\n", externalID)
    	// Simulate a long deletion process
    	time.Sleep(5 * time.Second)
    	fmt.Printf("MOCK_API: Successfully deleted database '%s'\n", externalID)
    	return nil
    }
    
    // CloudDatabaseReconciler reconciles a CloudDatabase object
    type CloudDatabaseReconciler struct {
    	client.Client
    	Scheme          *runtime.Scheme
    	MockCloudClient *MockCloudDBClient // Our mock client
    }
    
    // The finalizer string our controller will manage
    const cloudDBFinalizer = "db.example.com/finalizer"
    
    func (r *CloudDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	// 1. Fetch the CloudDatabase instance
    	dbInstance := &dbv1alpha1.CloudDatabase{}
    	if err := r.Get(ctx, req.NamespacedName, dbInstance); err != nil {
    		if errors.IsNotFound(err) {
    			logger.Info("CloudDatabase resource not found. Ignoring since object must be deleted.")
    			return ctrl.Result{}, nil
    		}
    		logger.Error(err, "Failed to get CloudDatabase")
    		return ctrl.Result{}, err
    	}
    
    	// 2. Check if the object is being deleted
    	isMarkedForDeletion := dbInstance.GetDeletionTimestamp() != nil
    	if isMarkedForDeletion {
    		if controllerutil.ContainsFinalizer(dbInstance, cloudDBFinalizer) {
    			// Run our finalization logic. If it fails, we'll requeue the request
    			// so we can retry again later. This is the core of the pattern.
    			if err := r.finalizeCloudDatabase(ctx, dbInstance); err != nil {
    				// Don't remove the finalizer if cleanup fails.
    				// The reconciliation will be retried automatically.
    				logger.Error(err, "Failed to finalize CloudDatabase")
    				return ctrl.Result{}, err
    			}
    
    			// Cleanup was successful. Remove our finalizer.
    			logger.Info("External database deleted, removing finalizer")
    			controllerutil.RemoveFinalizer(dbInstance, cloudDBFinalizer)
    			if err := r.Update(ctx, dbInstance); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		// Stop reconciliation as the item is being deleted
    		return ctrl.Result{}, nil
    	}
    
    	// 3. The object is NOT being deleted, so we add our finalizer if it doesn't exist.
    	if !controllerutil.ContainsFinalizer(dbInstance, cloudDBFinalizer) {
    		logger.Info("Adding finalizer for CloudDatabase")
    		controllerutil.AddFinalizer(dbInstance, cloudDBFinalizer)
    		if err := r.Update(ctx, dbInstance); err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
    	// 4. Main reconciliation logic: Provision the external database if it doesn't exist
    	if dbInstance.Status.ExternalID == "" {
    		logger.Info("Provisioning a new external database")
    		externalID, err := r.MockCloudClient.CreateDatabase(dbInstance.Name, dbInstance.Spec.Engine, dbInstance.Spec.SizeGB)
    		if err != nil {
    			logger.Error(err, "Failed to create external database")
    			dbInstance.Status.Phase = "Failed"
    			_ = r.Status().Update(ctx, dbInstance) // Use status subresource
    			return ctrl.Result{}, err
    		}
    
    		dbInstance.Status.ExternalID = externalID
    		dbInstance.Status.Phase = "Provisioned"
    		if err := r.Status().Update(ctx, dbInstance); err != nil {
    			logger.Error(err, "Failed to update CloudDatabase status")
    			return ctrl.Result{}, err
    		}
    		logger.Info("Successfully provisioned external database", "ExternalID", externalID)
    		return ctrl.Result{}, nil
    	}
    
    	// TODO: Add logic to handle updates to the spec (e.g., resizing the DB)
    
    	logger.Info("Reconciliation complete, no action taken.")
    	return ctrl.Result{}, nil
    }
    
    // finalizeCloudDatabase contains the logic to clean up the external resource.
    func (r *CloudDatabaseReconciler) finalizeCloudDatabase(ctx context.Context, db *dbv1alpha1.CloudDatabase) error {
    	logger := log.FromContext(ctx)
    
    	if db.Status.ExternalID == "" {
    		logger.Info("External database not found in status, nothing to clean up.")
    		return nil
    	}
    
    	logger.Info("Starting finalizer cleanup for external database", "ExternalID", db.Status.ExternalID)
    	if err := r.MockCloudClient.DeleteDatabase(db.Status.ExternalID); err != nil {
    		// This is a critical error. The finalizer will NOT be removed, and this
    		// function will be called again on the next reconciliation.
    		return fmt.Errorf("failed to delete external database %s: %w", db.Status.ExternalID, err)
    	}
    
    	logger.Info("Successfully finalized CloudDatabase")
    	return nil
    }
    
    // SetupWithManager sets up the controller with the Manager.
    func (r *CloudDatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&dbv1alpha1.CloudDatabase{}).
    		Complete(r)
    }

    Advanced Edge Cases and Production Patterns

    The simple implementation above works for the happy path, but production environments are never that simple. Here's how to harden your finalizer logic.

    Edge Case 1: Idempotent and Resumable Cleanup

    Problem: The DeleteDatabase call to the cloud provider might be a long-running operation. What if the controller pod crashes after initiating the deletion but before it completes and removes the finalizer? On restart, the controller will reconcile the same object again. If your cleanup logic is not idempotent, you might try to delete an already-deleting resource, causing an API error from the cloud provider.

    Solution: Enhance the status sub-resource to track the cleanup state. Use Kubernetes Conditions for this, as they are the standard pattern.

    First, add a Deleting condition type to your status constants:

    go
    const (
        ConditionTypeReady      = "Ready"
        ConditionTypeDeleting   = "Deleting"
    )

    Next, modify the finalizeCloudDatabase function to be state-aware:

    go
    import (
    	"k8s.io/apimachinery/pkg/api/meta"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    func (r *CloudDatabaseReconciler) finalizeCloudDatabase(ctx context.Context, db *dbv1alpha1.CloudDatabase) error {
    	logger := log.FromContext(ctx)
    
    	// Check if deletion has already been initiated
    	if meta.IsStatusConditionTrue(db.Status.Conditions, ConditionTypeDeleting) {
    		logger.Info("External database deletion is already in progress.")
    		
    		// Here you would check the status from the cloud provider
    		// For this mock, we assume it's done after a delay.
    		// In a real implementation, you'd poll the provider's API.
    		status, err := r.MockCloudClient.GetDatabaseStatus(db.Status.ExternalID)
    		if err != nil {
    			// If the resource is truly gone, the API might return a 404. That's a success for us.
    			if isCloudResourceNotFound(err) { // isCloudResourceNotFound is a hypothetical function
    				logger.Info("External database confirmed deleted by provider API.")
    				return nil // Success! The finalizer can be removed.
    			}
    			return err // Some other API error, retry.
    		}
    
    		if status == "DELETING" {
    			logger.Info("Deletion still in progress according to cloud provider. Requeuing.")
    			// We return a nil error but ask to requeue, preventing exponential backoff for a normal wait.
    			// This is a performance optimization.
    			return &RequeueError{RequeueAfter: 30 * time.Second}
    		}
    
    		logger.Info("External database confirmed deleted.")
    		return nil // Success
    	}
    
    	// Deletion not yet initiated. Start it now.
    	logger.Info("Starting finalizer cleanup for external database", "ExternalID", db.Status.ExternalID)
    	if err := r.MockCloudClient.DeleteDatabase(db.Status.ExternalID); err != nil {
    		// Update status to reflect failure
    		meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
    			Type:    ConditionTypeDeleting,
    			Status:  metav1.ConditionFalse,
    			Reason:  "DeletionFailed",
    			Message: fmt.Sprintf("Failed to initiate deletion: %v", err),
    		})
    		if updateErr := r.Status().Update(ctx, db); updateErr != nil {
    			return updateErr
    		}
    		return err
    	}
    
    	// Update status to indicate deletion is in progress
    	logger.Info("Successfully initiated external database deletion. Updating status.")
    	meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
    		Type:   ConditionTypeDeleting,
    		Status: metav1.ConditionTrue,
    		Reason: "DeletionInProgress",
    		Message: "External database deletion has been initiated.",
    	})
    	if err := r.Status().Update(ctx, db); err != nil {
    		return err
    	}
    
    	// Requeue to check on deletion progress later.
    	return &RequeueError{RequeueAfter: 30 * time.Second}
    }
    
    // Custom error type for controlled requeueing
    type RequeueError struct {
    	RequeueAfter time.Duration
    }
    
    func (e *RequeueError) Error() string {
    	return fmt.Sprintf("requeue after %s", e.RequeueAfter)
    }
    
    // In your main Reconcile function, you'd handle this custom error:
    if err := r.finalizeCloudDatabase(ctx, dbInstance); err != nil {
        if requeueErr, ok := err.(*RequeueError); ok {
            return ctrl.Result{RequeueAfter: requeueErr.RequeueAfter}, nil
        }
        return ctrl.Result{}, err
    }

    This approach is robust. It uses the status as the source of truth for the external operation, making the process resumable and idempotent.

    Edge Case 2: The Stuck `Terminating` Resource

    Problem: What if your cleanup logic has a permanent bug, or the external API is down indefinitely? The finalizeCloudDatabase function will always return an error, the finalizer will never be removed, and the object will be stuck in the Terminating state forever. This is a common operational headache.

    Solution: There is no magic bullet here, but the solution involves robust monitoring and clear operational procedures.

  • Monitoring and Alerting: Set up alerts (e.g., with Prometheus) that fire if a resource of your CRD's kind has been in a Terminating state for too long (e.g., > 1 hour). The query would look something like:
  • sum(kube_customresource_metadata_deletion_timestamp{group="db.example.com", version="v1alpha1", kind="CloudDatabase"}) by (namespace, customresource) > 0

  • Manual Intervention: An operator must be able to manually resolve the situation. This usually involves:
  • a. Manually cleaning up the external resource (e.g., deleting the RDS instance via the AWS console).

    b. Forcing the removal of the finalizer from the Kubernetes object. This can be done with kubectl patch:

    bash
        kubectl patch clouddatabase my-db-instance --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

    This is a powerful and dangerous command. It should only be used when the external resource has been confirmed to be gone.

  • Controller-side Timeouts: For non-critical cleanup, you could program a timeout into the controller. If cleanup has been failing for over 24 hours, the controller could log a critical error, emit a Kubernetes Event, and remove the finalizer anyway to unblock the system. This is a design choice that depends on whether orphaning the resource is better or worse than having a stuck object.
  • Edge Case 3: Multiple Finalizers and Coordination

    Problem: It's possible for multiple, independent controllers to place finalizers on the same object. For example, a BackupController might add a finalizer to our CloudDatabase object to ensure it takes a final snapshot before deletion. The CloudDatabaseController knows nothing about this other controller.

    Solution: This is handled elegantly by the finalizer mechanism itself, provided each controller is well-behaved.

  • Unique Finalizer Keys: Each controller MUST use a unique, domain-scoped finalizer key (e.g., db.example.com/finalizer, backup.example.com/finalizer). This prevents collisions.
  • Independent Removal: Each controller is responsible only for removing its own finalizer. It should never touch finalizers belonging to other controllers.
  • No Guaranteed Order: The order in which finalizers are processed is not guaranteed. When an object is deleted, both the CloudDatabaseController and the BackupController will be reconciled. They will perform their cleanup in parallel. The object will only be deleted after both controllers have removed their respective finalizers.
  • Your code should always use controllerutil.RemoveFinalizer which safely removes just your specific finalizer from the slice, rather than overwriting the whole slice.

    Performance and API Server Load

    In a busy cluster with thousands of CRs, an inefficient finalizer implementation can cause significant performance degradation.

    Problem: During a long cleanup (e.g., a database taking 10 minutes to delete), what's the right requeue strategy? Naively returning an error causes controller-runtime to use an exponential backoff queue, which is great for genuine failures but not for waiting. Constantly updating the status every few seconds can flood the API server.

    Solution: A combination of intelligent requeueing and judicious status updates.

  • Controlled Requeueing: As shown in the RequeueError example, when you are simply waiting for an external system, return ctrl.Result{RequeueAfter: duration} with a nil error. This puts the item back in the work queue to be processed after a specific delay (e.g., 30 seconds) without treating it as a failure and without the exponential backoff.
  • Minimize Status Updates: Do not update the status on every single check. Update it only when the state changes meaningfully:
  • - When deletion is first initiated (DeletionInProgress).

    - If a transient error occurs (DeletionFailedTransient).

    - If a permanent error is confirmed (DeletionFailedPermanent).

    Polling an external API every 5 seconds and writing status back to the Kubernetes API every 5 seconds is an anti-pattern. The polling can be frequent, but the writes to the API server should be sparse.

  • Concurrency: Tune the MaxConcurrentReconciles option when setting up your controller manager. If your cleanup logic is I/O-bound (calling cloud APIs), you can likely increase this value. If it's CPU-bound, keep it low. A high concurrency could lead to rate limiting from your cloud provider during a mass deletion event.
  • go
        // in main.go
        err = ctrl.NewControllerManagedBy(mgr).
            For(&dbv1alpha1.CloudDatabase{}).
            WithOptions(controller.Options{MaxConcurrentReconciles: 5}). // Default is 1
            Complete(r)

    Conclusion

    The Finalizer pattern is not just a feature; it's the cornerstone of building robust operators for stateful applications on Kubernetes. By intercepting the deletion process, it allows controllers to extend Kubernetes's declarative model to resources living outside the cluster, ensuring that 'delete' means 'gracefully tear down and then delete.'

    A production-grade implementation moves beyond the basic happy path. It requires idempotent, resumable cleanup logic, often tracked via the object's status sub-resource. It demands careful consideration of failure modes, like stuck terminations, and requires robust monitoring and operational playbooks. Finally, by using intelligent requeue strategies and minimizing API server chatter, you can ensure your stateful controller is not only correct but also a well-behaved, performant citizen of the cluster.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles