Kubernetes Operators: Finalizers for Stateful Workload Deletion

14 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Deletion Race Condition in Stateful Operators

In the realm of Kubernetes Operators, managing the creation and update lifecycle of a Custom Resource (CR) is often the primary focus. The reconciliation loop, a cornerstone of the operator pattern, excels at converging the current state toward the desired state defined in a CR. However, the deletion lifecycle presents a fundamentally different and more perilous challenge, especially for operators managing stateful, external resources.

When a user executes kubectl delete my-resource, the default behavior of the Kubernetes API server is swift and decisive: the object is removed from etcd. For a stateless application, this is often sufficient. The garbage collector will handle dependent objects like Pods and ReplicaSets. But for an operator managing a stateful workload—such as a cloud-hosted PostgreSQL instance, a message queue topic, or an object storage bucket—this default behavior is catastrophic. The CR, which represents the source of truth for the external resource, vanishes before the operator's controller has a chance to perform critical cleanup tasks:

  • Take a final backup or snapshot of a database.
  • Drain active connections gracefully.
  • De-register the service from a discovery mechanism.
  • De-provision the underlying IaaS/PaaS resource to prevent billing leaks.

This creates a classic race condition. The controller's reconciliation loop might be triggered by the deletion event, but there's no guarantee it will complete its cleanup logic before the CR is gone. Once the CR is deleted, the operator loses all information about the external resource it was managing, leading to orphaned infrastructure and potential data integrity issues.

This is where Finalizers become an indispensable tool. A finalizer is a metadata key that tells the Kubernetes API server to block the physical deletion of a resource until specific conditions are met. It effectively transforms the deletion process from a single, immediate action into a two-phase, controller-managed workflow.

The Finalizer Mechanism: A Deconstructive Look

At its core, a Kubernetes finalizer is simply a string added to the metadata.finalizers list of an object. Its presence fundamentally alters the object's deletion lifecycle. Let's dissect the precise sequence of events:

  • Deletion Request: A user or process issues a DELETE request to the API server for an object that has a finalizer in its metadata.finalizers array.
  • Deletion Timestamp Addition: The API server receives the request. Instead of deleting the object from etcd, it performs an UPDATE operation, setting the metadata.deletionTimestamp field to the current time. The object is now in a terminating state, but it still exists and is visible via the API.
  • Controller Reconciliation: This update event triggers the operator's reconciliation loop for the object. Inside the Reconcile function, the controller's logic must now check for the presence of this deletionTimestamp.
  • go
        // In the Reconcile function
        if managedResource.GetDeletionTimestamp() != nil {
            // Object is being deleted, execute finalizer logic
            // ...
        }
  • Cleanup Execution: Upon detecting the deletionTimestamp, the controller executes its pre-defined cleanup logic. This is the critical phase where it interacts with external systems to de-provision resources, take backups, etc. This process must be idempotent, as the reconciliation could be triggered multiple times if an error occurs.
  • Finalizer Removal: Once the controller has successfully completed all cleanup tasks, its final action is to remove its specific finalizer string from the metadata.finalizers array and issue an UPDATE call to the API server for the object.
  • go
        import "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    
        // After successful cleanup
        controllerutil.RemoveFinalizer(&managedResource, myFinalizerName)
        if err := r.Update(ctx, &managedResource); err != nil {
            // Handle update error
            return ctrl.Result{}, err
        }
  • Garbage Collection: The Kubernetes API server sees this final UPDATE. It checks the object again. Now, the deletionTimestamp is still set, but the finalizers list is empty (or at least, the specific finalizer it was blocking on is gone). The API server proceeds with the final step: garbage collecting the object and removing it from etcd.
  • This two-phase commit process provides the hook necessary for controllers to execute complex, potentially long-running cleanup logic before the authoritative CR is lost.

    Production Implementation: `ManagedDatabase` Operator

    Let's build a practical, production-oriented example. We'll create an operator that manages a ManagedDatabase CRD. When a ManagedDatabase is deleted, our operator must perform a graceful shutdown: place the database in a read-only maintenance mode, trigger a final snapshot, wait for completion, and then de-provision the database instance from a (mocked) cloud provider API.

    Step 1: Defining the CRD with Status and Finalizer

    Our api/v1/manageddatabase_types.go needs status fields to track the state of our resource, including during deletion.

    go
    // api/v1/manageddatabase_types.go
    
    package v1
    
    import (
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // The name of our custom finalizer
    const ManagedDatabaseFinalizer = "database.example.com/finalizer"
    
    // ManagedDatabaseSpec defines the desired state of ManagedDatabase
    type ManagedDatabaseSpec struct {
    	Engine   string `json:"engine"` // e.g., "postgres"
    	Version  string `json:"version"`
    	Replicas int32  `json:"replicas"`
    }
    
    // ManagedDatabaseStatus defines the observed state of ManagedDatabase
    type ManagedDatabaseStatus struct {
    	Conditions []metav1.Condition `json:"conditions,omitempty"`
    	InstanceID string             `json:"instanceID,omitempty"`
    	Phase      string             `json:"phase,omitempty"` // e.g., Provisioning, Available, Terminating, Snapshotting
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    
    // ManagedDatabase is the Schema for the manageddatabases API
    type ManagedDatabase struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   ManagedDatabaseSpec   `json:"spec,omitempty"`
    	Status ManagedDatabaseStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // ManagedDatabaseList contains a list of ManagedDatabase
    type ManagedDatabaseList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []ManagedDatabase `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&ManagedDatabase{}, &ManagedDatabaseList{})
    }

    Step 2: Implementing the Controller's Reconciliation Logic

    The core logic resides in internal/controller/manageddatabase_controller.go. We'll structure the Reconcile function to handle the deletion path explicitly.

    go
    // internal/controller/manageddatabase_controller.go
    
    package controller
    
    import (
    	"context"
    	"fmt"
    	"time"
    
    	databasev1 "example.com/managed-db-operator/api/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/client-go/tools/record"
    	ctrl "sigs.k8s.io/controller-runtime"
    	"sigs.k8s.io/controller-runtime/pkg/client"
    	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    	"sigs.k8s.io/controller-runtime/pkg/log"
    )
    
    // Mock Cloud API Client
    type MockCloudAPI struct{}
    
    func (m *MockCloudAPI) ProvisionDatabase(spec databasev1.ManagedDatabaseSpec) (string, error) {
    	// Simulate provisioning, return a unique ID
    	return fmt.Sprintf("db-%d", time.Now().UnixNano()), nil
    }
    
    func (m *MockCloudAPI) SetMaintenanceMode(instanceID string) error {
    	// Simulate API call
    	fmt.Printf("API: Setting maintenance mode for %s\n", instanceID)
    	time.Sleep(2 * time.Second)
    	return nil
    }
    
    func (m *MockCloudAPI) TriggerSnapshot(instanceID string) (string, error) {
    	// Simulate async snapshot, return snapshot ID
    	fmt.Printf("API: Triggering snapshot for %s\n", instanceID)
    	snapshotID := fmt.Sprintf("snap-%d", time.Now().UnixNano())
    	time.Sleep(1 * time.Second)
    	return snapshotID, nil
    }
    
    func (m *MockCloudAPI) GetSnapshotStatus(snapshotID string) (string, error) {
    	// Simulate checking status. For this example, we'll just make it take a few seconds.
    	fmt.Printf("API: Checking snapshot status for %s\n", snapshotID)
    	time.Sleep(5 * time.Second) 
    	return "Completed", nil // or "InProgress"
    }
    
    func (m *MockCloudAPI) DeprovisionDatabase(instanceID string) error {
    	// Simulate deprovisioning
    	fmt.Printf("API: Deprovisioning database %s\n", instanceID)
    	time.Sleep(3 * time.Second)
    	return nil
    }
    
    // ManagedDatabaseReconciler reconciles a ManagedDatabase object
    type ManagedDatabaseReconciler struct {
    	client.Client
    	Scheme   *runtime.Scheme
    	Recorder record.EventRecorder
    	CloudAPI *MockCloudAPI
    }
    
    func (r *ManagedDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	// Fetch the ManagedDatabase instance
    	db := &databasev1.ManagedDatabase{}
    	if err := r.Get(ctx, req.NamespacedName, db); err != nil {
    		return ctrl.Result{}, client.IgnoreNotFound(err)
    	}
    
    	// Check if the object is being deleted
    	isMarkedForDeletion := db.GetDeletionTimestamp() != nil
    	if isMarkedForDeletion {
    		if controllerutil.ContainsFinalizer(db, databasev1.ManagedDatabaseFinalizer) {
    			// Run finalization logic. If it fails, we'll requeue the request.
    			if err := r.finalizeManagedDatabase(ctx, db); err != nil {
    				logger.Error(err, "Failed to finalize ManagedDatabase")
    				r.Recorder.Event(db, "Warning", "FinalizationFailed", err.Error())
    				return ctrl.Result{}, err
    			}
    
    			// Remove finalizer. Once all finalizers are removed, the object will be deleted.
    			logger.Info("Database finalized successfully, removing finalizer")
    			controllerutil.RemoveFinalizer(db, databasev1.ManagedDatabaseFinalizer)
    			if err := r.Update(ctx, db); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		return ctrl.Result{}, nil
    	}
    
    	// Add finalizer for this CR if it doesn't exist
    	if !controllerutil.ContainsFinalizer(db, databasev1.ManagedDatabaseFinalizer) {
    		logger.Info("Adding finalizer for ManagedDatabase")
    		controllerutil.AddFinalizer(db, databasev1.ManagedDatabaseFinalizer)
    		if err := r.Update(ctx, db); err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
    	// --- Normal Reconciliation Logic ---
    	if db.Status.InstanceID == "" {
    		logger.Info("Provisioning new database")
    		db.Status.Phase = "Provisioning"
    		if err := r.Status().Update(ctx, db); err != nil {
    			return ctrl.Result{}, err
    		}
    
    		instanceID, err := r.CloudAPI.ProvisionDatabase(db.Spec)
    		if err != nil {
    			logger.Error(err, "Failed to provision database")
    			r.Recorder.Event(db, "Warning", "ProvisioningFailed", err.Error())
    			return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    		}
    
    		db.Status.InstanceID = instanceID
    		db.Status.Phase = "Available"
    		if err := r.Status().Update(ctx, db); err != nil {
    			return ctrl.Result{}, err
    		}
    		r.Recorder.Event(db, "Normal", "Provisioned", fmt.Sprintf("Database %s provisioned", instanceID))
    	}
    
    	return ctrl.Result{}, nil
    }
    
    // SetupWithManager sets up the controller with the Manager.
    func (r *ManagedDatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	r.CloudAPI = &MockCloudAPI{}
    	r.Recorder = mgr.GetEventRecorderFor("manageddatabase-controller")
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&databasev1.ManagedDatabase{}).
    		Complete(r)
    }

    Step 3: Implementing the Idempotent Finalization Logic

    The finalizeManagedDatabase function is where the stateful cleanup occurs. It must be designed to be idempotent—that is, it can be safely executed multiple times without adverse effects. This is crucial because an error at any step will cause the Reconcile function to be called again later.

    We'll use the Status.Phase field to track our progress through the multi-step cleanup.

    go
    // internal/controller/manageddatabase_controller.go (continued)
    
    func (r *ManagedDatabaseReconciler) finalizeManagedDatabase(ctx context.Context, db *databasev1.ManagedDatabase) error {
    	logger := log.FromContext(ctx)
    
    	if db.Status.InstanceID == "" {
    		logger.Info("Database instance not found, nothing to finalize")
    		return nil
    	}
    
    	// Step 1: Set Maintenance Mode
    	if db.Status.Phase != "Terminating-Maintenance" && db.Status.Phase != "Terminating-Snapshotting" && db.Status.Phase != "Terminating-Deprovisioning" {
    		logger.Info("Step 1: Setting maintenance mode", "InstanceID", db.Status.InstanceID)
    		if err := r.CloudAPI.SetMaintenanceMode(db.Status.InstanceID); err != nil {
    			return fmt.Errorf("failed to set maintenance mode: %w", err)
    		}
    		db.Status.Phase = "Terminating-Maintenance"
    		if err := r.Status().Update(ctx, db); err != nil {
    			return err
    		}
    		r.Recorder.Event(db, "Normal", "FinalizeStep", "Maintenance mode enabled")
    	}
    
    	// Step 2: Trigger and wait for final snapshot
    	if db.Status.Phase != "Terminating-Snapshotting" && db.Status.Phase != "Terminating-Deprovisioning" {
    		logger.Info("Step 2: Triggering final snapshot", "InstanceID", db.Status.InstanceID)
    		snapshotID, err := r.CloudAPI.TriggerSnapshot(db.Status.InstanceID)
    		if err != nil {
    			return fmt.Errorf("failed to trigger snapshot: %w", err)
    		}
    
    		db.Status.Phase = "Terminating-Snapshotting"
    		if err := r.Status().Update(ctx, db); err != nil {
    			return err
    		}
    
    		// This is a simplified polling loop. In a real-world scenario, you'd requeue.
    		for {
    			status, err := r.CloudAPI.GetSnapshotStatus(snapshotID)
    			if err != nil {
    				return fmt.Errorf("failed to get snapshot status: %w", err)
    			}
    			if status == "Completed" {
    				logger.Info("Snapshot completed", "SnapshotID", snapshotID)
    				r.Recorder.Event(db, "Normal", "FinalizeStep", "Final snapshot completed")
    				break
    			}
    			logger.Info("Waiting for snapshot to complete...", "SnapshotID", snapshotID)
    			// Instead of sleeping, a better pattern is to return a requeue result.
    			// For simplicity here, we sleep. See advanced patterns below.
    			time.Sleep(5 * time.Second)
    		}
    	}
    
    	// Step 3: Deprovision the database instance
    	if db.Status.Phase != "Terminating-Deprovisioning" {
    		logger.Info("Step 3: Deprovisioning database instance", "InstanceID", db.Status.InstanceID)
    		if err := r.CloudAPI.DeprovisionDatabase(db.Status.InstanceID); err != nil {
    			return fmt.Errorf("failed to deprovision database: %w", err)
    		}
    
    		db.Status.Phase = "Terminating-Deprovisioning"
    		if err := r.Status().Update(ctx, db); err != nil {
    			return err
    		}
    		r.Recorder.Event(db, "Normal", "FinalizeStep", "Database instance deprovisioned")
    	}
    
    	logger.Info("All finalization steps completed successfully")
    	return nil
    }

    This implementation demonstrates the core pattern: using the Status subresource as a state machine to ensure idempotency. If the operator crashes after setting maintenance mode but before triggering the snapshot, the next reconciliation will see Phase is Terminating-Maintenance and skip directly to the snapshot step.

    Advanced Patterns and Edge Case Management

    Simple finalizer logic can fail in complex production environments. Senior engineers must anticipate and handle these scenarios.

    1. Handling Long-Running Finalization Tasks

    The polling loop (for {}) in our finalizeManagedDatabase function is a dangerous anti-pattern. It blocks the controller's worker goroutine, preventing it from processing other resources. A production-grade operator must never block.

    The correct pattern is to return a ctrl.Result that instructs the manager to requeue the request after a delay.

    Refactored Snapshot Logic:

    go
    // In finalizeManagedDatabase...
    
    // We need a way to store the snapshot ID across reconciliations.
    // Let's add it to the status.
    // type ManagedDatabaseStatus struct { ...; SnapshotIDForDeletion string; ... }
    
    // Step 2: Trigger and wait for final snapshot
    if db.Status.Phase == "Terminating-Maintenance" {
        logger.Info("Triggering final snapshot...")
        snapshotID, err := r.CloudAPI.TriggerSnapshot(db.Status.InstanceID)
        if err != nil { /* ... */ }
    
        db.Status.Phase = "Terminating-Snapshotting"
        db.Status.SnapshotIDForDeletion = snapshotID
        if err := r.Status().Update(ctx, db); err != nil { return err }
        
        // Requeue immediately to check status
        return nil // Let the main reconcile loop requeue us, or we can force it:
        // return ctrl.Result{Requeue: true}, nil
    }
    
    if db.Status.Phase == "Terminating-Snapshotting" {
        logger.Info("Checking snapshot status...", "SnapshotID", db.Status.SnapshotIDForDeletion)
        status, err := r.CloudAPI.GetSnapshotStatus(db.Status.SnapshotIDForDeletion)
        if err != nil { /* ... */ }
    
        if status == "Completed" {
            logger.Info("Snapshot complete, moving to deprovisioning")
            db.Status.Phase = "Terminating-Deprovisioning"
            if err := r.Status().Update(ctx, db); err != nil { return err }
            // Requeue immediately to proceed to the next step
            return nil // The next reconciliation will hit the deprovisioning logic
        } else {
            logger.Info("Snapshot still in progress, will check again in 30 seconds")
            // The error in the main Reconcile function should return a requeue result
            // We signal this by returning a specific error type or just an error
            return fmt.Errorf("snapshot in progress") // This will cause a requeue
        }
    }
    
    // In the main Reconcile function's error handling for finalizeManagedDatabase:
    // if err := r.finalizeManagedDatabase(ctx, db); err != nil {
    //     if err.Error() == "snapshot in progress" {
    //         return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    //     }
    //     // Handle other errors
    //     return ctrl.Result{}, err
    // }

    This non-blocking approach is critical for operator performance and scalability.

    2. The "Stuck Finalizer" Problem

    A common failure mode is a bug in the finalizer logic that prevents it from ever completing and removing the finalizer string. This leaves the object in a permanent Terminating state, unable to be deleted. This is sometimes called "Finalizer Hell."

    Debugging and Mitigation:

    * Observability: Implement metrics. A Prometheus gauge operator_stuck_finalizers that tracks the number of objects with a deletionTimestamp older than a threshold (e.g., 1 hour) is essential for detection.

    * Controller Logs: The first step is always to inspect the operator logs for the specific resource to understand why the finalization logic is failing or getting stuck.

    * Manual Intervention (The "Break Glass" Procedure): In an emergency, an administrator can manually remove the finalizer from the object. This is a dangerous operation as it can lead to orphaned resources, but it may be necessary to unblock a system.

    bash
        # Get the object in YAML format
        kubectl get manageddatabase my-db -o yaml > my-db.yaml
    
        # Edit my-db.yaml and remove the finalizer line:
        # metadata:
        #   finalizers:
        #   - database.example.com/finalizer  <-- DELETE THIS LINE
    
        # Replace the object. This is often blocked by the API server.
        # The most reliable way is to use `patch` or `edit`.
    
        # Use patch to remove the finalizer
        kubectl patch manageddatabase my-db --type json -p='[{"op": "remove", "path": "/metadata/finalizers/0"}]'
        
        # Or use kubectl edit and manually delete the line
        kubectl edit manageddatabase my-db

    3. Concurrency and Multiple Finalizers

    An object can have multiple finalizers, managed by different controllers. For example, a backup operator might add a finalizer to our ManagedDatabase to ensure it archives the final snapshot. The Kubernetes API server will not delete the object until all finalizers have been removed from the list. Your controller's logic must be robust to this; it should only ever add and remove its own, uniquely-named finalizer and not interfere with others.

    Observability for Finalizer Workflows

    To run stateful operators in production, you must monitor their behavior, especially during deletion.

    Key Metrics:

    * finalizer_latency_seconds (Histogram): Measures the time from when deletionTimestamp is set to when the finalizer is removed. This helps identify slow cleanup processes.

    * finalizer_failures_total (Counter): A counter, labeled by finalization step (e.g., step="snapshot"), that increments each time a cleanup step fails and needs to be retried.

    Implementation with Prometheus:

    go
    // In your controller setup
    import (
    	"github.com/prometheus/client_golang/prometheus"
    	"sigs.k8s.io/controller-runtime/pkg/metrics"
    )
    
    var (
    	finalizerLatency = prometheus.NewHistogramVec(
    		prometheus.HistogramOpts{
    			Name: "manageddatabase_finalizer_latency_seconds",
    			Help: "Latency of ManagedDatabase finalization",
    		},
    		[]string{"name"},
    	)
    	finalizerFailures = prometheus.NewCounterVec(
    		prometheus.CounterOpts{
    			Name: "manageddatabase_finalizer_failures_total",
    			Help: "Total number of failures during finalization",
    		},
    		[]string{"step"},
    	)
    )
    
    func init() {
    	metrics.Registry.MustRegister(finalizerLatency, finalizerFailures)
    }
    
    // In your Reconcile function, when deletion starts:
    startTime := time.Now()
    
    // When finalizer is removed:
    finalizerLatency.WithLabelValues(db.Name).Observe(time.Since(startTime).Seconds())
    
    // When an error occurs in a step:
    // if err := r.CloudAPI.SetMaintenanceMode(...); err != nil {
    //     finalizerFailures.WithLabelValues("maintenance_mode").Inc()
    //     return err
    // }

    Conclusion

    The finalizer pattern is not an optional enhancement for stateful operators; it is a fundamental requirement for their correctness and safety. By intercepting the deletion process, finalizers empower operators to perform graceful, multi-step teardowns of the external resources they manage, preventing orphaned infrastructure and ensuring data integrity. A production-grade implementation demands more than just adding and removing a string; it requires a deep understanding of idempotency, state management via the CR's status, non-blocking asynchronous operations, and robust observability. Mastering finalizers is a critical step in transitioning from building simple Kubernetes controllers to engineering truly resilient, production-ready, autonomous systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles