K8s Finalizers: A Deep Dive into Stateful Resource Deletion
The Deletion Fallacy in Kubernetes
As a seasoned engineer working with Kubernetes, you understand that the platform's declarative nature is its greatest strength. You define the desired state, and a constellation of controllers works tirelessly to make it so. However, this model has a subtle but critical impedance mismatch when it comes to deletion, especially for resources that manage state outside the cluster.
A kubectl delete my-crd my-resource command sends a request to the API server to remove an object from etcd. For a stateless Deployment, this is fine; the ReplicaSet controller terminates Pods, and the object is gone. But what if your Custom Resource (CR), say a DatabaseInstance, represents a managed PostgreSQL database in AWS RDS? A simple deletion from etcd orphans the actual database, leaving you with a running, billable resource that Kubernetes no longer knows about.
This is where the standard Kubernetes deletion mechanism falls short. It's a one-way, asynchronous operation that provides no hook for pre-deletion cleanup. The solution, and the core focus of this article, is the Finalizer pattern. Finalizers are a metadata key that tells the Kubernetes garbage collector, "Do not delete this object yet. A controller is performing cleanup tasks."
This article assumes you are familiar with Go, Kubernetes controllers, and the operator pattern. We will not cover the basics of kubebuilder or CRD creation. Instead, we will focus exclusively on the advanced implementation details of using finalizers to build a production-grade, state-aware controller.
Anatomy of a Finalized Deletion
Before diving into code, let's dissect the mechanism. A finalizer is simply a string added to the metadata.finalizers array of an object. When a user requests to delete an object that has finalizers:
finalizers array is not empty.DeletionTimestamp is Set: The API server updates the object, setting a metadata.deletionTimestamp to the current time. The object is now in a "terminating" state but remains visible via the API.Reconcile function, your controller must now detect that DeletionTimestamp is set. This is the signal to begin cleanup.metadata.finalizers array and updates the object.finalizers array is empty and the deletionTimestamp is set, the Kubernetes garbage collector is free to permanently delete the object from etcd.This two-phase commit-like process ensures that your controller gets a chance to gracefully tear down all associated external resources before the Kubernetes representation disappears.
Building a `CloudDatabase` Controller
Let's build a practical example: a controller that manages a CloudDatabase CR. This CR will have a spec defining the database engine (e.g., postgres) and size, and a status reflecting its provisioned state and external ID.
1. The `CloudDatabase` CRD Type Definition
First, we define our types in Go using kubebuilder markers. The key is to have a robust status sub-resource to track the external state.
// file: api/v1alpha1/clouddatabase_types.go
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// CloudDatabaseSpec defines the desired state of CloudDatabase
type CloudDatabaseSpec struct {
Engine string `json:"engine"`
SizeGB int `json:"sizeGb"`
}
// CloudDatabaseStatus defines the observed state of CloudDatabase
type CloudDatabaseStatus struct {
// Represents the observations of a CloudDatabase's current state.
// Important: Run "make" to regenerate code after modifying this file
// +optional
ExternalID string `json:"externalId,omitempty"`
// +optional
Phase string `json:"phase,omitempty"`
// +listType=map
// +listMapKey=type
// +patchStrategy=merge
// +patchMergeKey=type
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Engine",type="string",JSONPath=".spec.engine"
//+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// CloudDatabase is the Schema for the clouddatabases API
type CloudDatabase struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec CloudDatabaseSpec `json:"spec,omitempty"`
Status CloudDatabaseStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// CloudDatabaseList contains a list of CloudDatabase
type CloudDatabaseList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []CloudDatabase `json:"items"`
}
func init() {
SchemeBuilder.Register(&CloudDatabase{}, &CloudDatabaseList{})
}
2. The Core Reconciliation Logic with Finalizers
Now for the heart of the implementation: the Reconcile method. We'll use the controller-runtime library, which provides excellent helpers.
// file: internal/controller/clouddatabase_controller.go
package controller
import (
"context"
"fmt"
"time"
"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
"sigs.k8s.io/controller-runtime/pkg/log"
dbv1alpha1 "finalizer-demo/api/v1alpha1"
)
// A mock external client to simulate a cloud provider API
type MockCloudDBClient struct{}
func (c *MockCloudDBClient) CreateDatabase(name string, engine string, size int) (string, error) {
fmt.Printf("MOCK_API: Creating database '%s' with engine '%s'\n", name, engine)
// Simulate API call latency
time.Sleep(2 * time.Second)
return fmt.Sprintf("ext-%s", name), nil
}
func (c *MockCloudDBClient) GetDatabaseStatus(externalID string) (string, error) {
fmt.Printf("MOCK_API: Getting status for external ID '%s'\n", externalID)
time.Sleep(500 * time.Millisecond)
// In a real scenario, this would return "CREATING", "AVAILABLE", "DELETING", etc.
return "AVAILABLE", nil
}
func (c *MockCloudDBClient) DeleteDatabase(externalID string) error {
fmt.Printf("MOCK_API: Deleting database with external ID '%s'\n", externalID)
// Simulate a long deletion process
time.Sleep(5 * time.Second)
fmt.Printf("MOCK_API: Successfully deleted database '%s'\n", externalID)
return nil
}
// CloudDatabaseReconciler reconciles a CloudDatabase object
type CloudDatabaseReconciler struct {
client.Client
Scheme *runtime.Scheme
MockCloudClient *MockCloudDBClient // Our mock client
}
// The finalizer string our controller will manage
const cloudDBFinalizer = "db.example.com/finalizer"
func (r *CloudDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
// 1. Fetch the CloudDatabase instance
dbInstance := &dbv1alpha1.CloudDatabase{}
if err := r.Get(ctx, req.NamespacedName, dbInstance); err != nil {
if errors.IsNotFound(err) {
logger.Info("CloudDatabase resource not found. Ignoring since object must be deleted.")
return ctrl.Result{}, nil
}
logger.Error(err, "Failed to get CloudDatabase")
return ctrl.Result{}, err
}
// 2. Check if the object is being deleted
isMarkedForDeletion := dbInstance.GetDeletionTimestamp() != nil
if isMarkedForDeletion {
if controllerutil.ContainsFinalizer(dbInstance, cloudDBFinalizer) {
// Run our finalization logic. If it fails, we'll requeue the request
// so we can retry again later. This is the core of the pattern.
if err := r.finalizeCloudDatabase(ctx, dbInstance); err != nil {
// Don't remove the finalizer if cleanup fails.
// The reconciliation will be retried automatically.
logger.Error(err, "Failed to finalize CloudDatabase")
return ctrl.Result{}, err
}
// Cleanup was successful. Remove our finalizer.
logger.Info("External database deleted, removing finalizer")
controllerutil.RemoveFinalizer(dbInstance, cloudDBFinalizer)
if err := r.Update(ctx, dbInstance); err != nil {
return ctrl.Result{}, err
}
}
// Stop reconciliation as the item is being deleted
return ctrl.Result{}, nil
}
// 3. The object is NOT being deleted, so we add our finalizer if it doesn't exist.
if !controllerutil.ContainsFinalizer(dbInstance, cloudDBFinalizer) {
logger.Info("Adding finalizer for CloudDatabase")
controllerutil.AddFinalizer(dbInstance, cloudDBFinalizer)
if err := r.Update(ctx, dbInstance); err != nil {
return ctrl.Result{}, err
}
}
// 4. Main reconciliation logic: Provision the external database if it doesn't exist
if dbInstance.Status.ExternalID == "" {
logger.Info("Provisioning a new external database")
externalID, err := r.MockCloudClient.CreateDatabase(dbInstance.Name, dbInstance.Spec.Engine, dbInstance.Spec.SizeGB)
if err != nil {
logger.Error(err, "Failed to create external database")
dbInstance.Status.Phase = "Failed"
_ = r.Status().Update(ctx, dbInstance) // Use status subresource
return ctrl.Result{}, err
}
dbInstance.Status.ExternalID = externalID
dbInstance.Status.Phase = "Provisioned"
if err := r.Status().Update(ctx, dbInstance); err != nil {
logger.Error(err, "Failed to update CloudDatabase status")
return ctrl.Result{}, err
}
logger.Info("Successfully provisioned external database", "ExternalID", externalID)
return ctrl.Result{}, nil
}
// TODO: Add logic to handle updates to the spec (e.g., resizing the DB)
logger.Info("Reconciliation complete, no action taken.")
return ctrl.Result{}, nil
}
// finalizeCloudDatabase contains the logic to clean up the external resource.
func (r *CloudDatabaseReconciler) finalizeCloudDatabase(ctx context.Context, db *dbv1alpha1.CloudDatabase) error {
logger := log.FromContext(ctx)
if db.Status.ExternalID == "" {
logger.Info("External database not found in status, nothing to clean up.")
return nil
}
logger.Info("Starting finalizer cleanup for external database", "ExternalID", db.Status.ExternalID)
if err := r.MockCloudClient.DeleteDatabase(db.Status.ExternalID); err != nil {
// This is a critical error. The finalizer will NOT be removed, and this
// function will be called again on the next reconciliation.
return fmt.Errorf("failed to delete external database %s: %w", db.Status.ExternalID, err)
}
logger.Info("Successfully finalized CloudDatabase")
return nil
}
// SetupWithManager sets up the controller with the Manager.
func (r *CloudDatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&dbv1alpha1.CloudDatabase{}).
Complete(r)
}
Advanced Edge Cases and Production Patterns
The simple implementation above works for the happy path, but production environments are never that simple. Here's how to harden your finalizer logic.
Edge Case 1: Idempotent and Resumable Cleanup
Problem: The DeleteDatabase call to the cloud provider might be a long-running operation. What if the controller pod crashes after initiating the deletion but before it completes and removes the finalizer? On restart, the controller will reconcile the same object again. If your cleanup logic is not idempotent, you might try to delete an already-deleting resource, causing an API error from the cloud provider.
Solution: Enhance the status sub-resource to track the cleanup state. Use Kubernetes Conditions for this, as they are the standard pattern.
First, add a Deleting condition type to your status constants:
const (
ConditionTypeReady = "Ready"
ConditionTypeDeleting = "Deleting"
)
Next, modify the finalizeCloudDatabase function to be state-aware:
import (
"k8s.io/apimachinery/pkg/api/meta"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func (r *CloudDatabaseReconciler) finalizeCloudDatabase(ctx context.Context, db *dbv1alpha1.CloudDatabase) error {
logger := log.FromContext(ctx)
// Check if deletion has already been initiated
if meta.IsStatusConditionTrue(db.Status.Conditions, ConditionTypeDeleting) {
logger.Info("External database deletion is already in progress.")
// Here you would check the status from the cloud provider
// For this mock, we assume it's done after a delay.
// In a real implementation, you'd poll the provider's API.
status, err := r.MockCloudClient.GetDatabaseStatus(db.Status.ExternalID)
if err != nil {
// If the resource is truly gone, the API might return a 404. That's a success for us.
if isCloudResourceNotFound(err) { // isCloudResourceNotFound is a hypothetical function
logger.Info("External database confirmed deleted by provider API.")
return nil // Success! The finalizer can be removed.
}
return err // Some other API error, retry.
}
if status == "DELETING" {
logger.Info("Deletion still in progress according to cloud provider. Requeuing.")
// We return a nil error but ask to requeue, preventing exponential backoff for a normal wait.
// This is a performance optimization.
return &RequeueError{RequeueAfter: 30 * time.Second}
}
logger.Info("External database confirmed deleted.")
return nil // Success
}
// Deletion not yet initiated. Start it now.
logger.Info("Starting finalizer cleanup for external database", "ExternalID", db.Status.ExternalID)
if err := r.MockCloudClient.DeleteDatabase(db.Status.ExternalID); err != nil {
// Update status to reflect failure
meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
Type: ConditionTypeDeleting,
Status: metav1.ConditionFalse,
Reason: "DeletionFailed",
Message: fmt.Sprintf("Failed to initiate deletion: %v", err),
})
if updateErr := r.Status().Update(ctx, db); updateErr != nil {
return updateErr
}
return err
}
// Update status to indicate deletion is in progress
logger.Info("Successfully initiated external database deletion. Updating status.")
meta.SetStatusCondition(&db.Status.Conditions, metav1.Condition{
Type: ConditionTypeDeleting,
Status: metav1.ConditionTrue,
Reason: "DeletionInProgress",
Message: "External database deletion has been initiated.",
})
if err := r.Status().Update(ctx, db); err != nil {
return err
}
// Requeue to check on deletion progress later.
return &RequeueError{RequeueAfter: 30 * time.Second}
}
// Custom error type for controlled requeueing
type RequeueError struct {
RequeueAfter time.Duration
}
func (e *RequeueError) Error() string {
return fmt.Sprintf("requeue after %s", e.RequeueAfter)
}
// In your main Reconcile function, you'd handle this custom error:
if err := r.finalizeCloudDatabase(ctx, dbInstance); err != nil {
if requeueErr, ok := err.(*RequeueError); ok {
return ctrl.Result{RequeueAfter: requeueErr.RequeueAfter}, nil
}
return ctrl.Result{}, err
}
This approach is robust. It uses the status as the source of truth for the external operation, making the process resumable and idempotent.
Edge Case 2: The Stuck `Terminating` Resource
Problem: What if your cleanup logic has a permanent bug, or the external API is down indefinitely? The finalizeCloudDatabase function will always return an error, the finalizer will never be removed, and the object will be stuck in the Terminating state forever. This is a common operational headache.
Solution: There is no magic bullet here, but the solution involves robust monitoring and clear operational procedures.
Terminating state for too long (e.g., > 1 hour). The query would look something like: sum(kube_customresource_metadata_deletion_timestamp{group="db.example.com", version="v1alpha1", kind="CloudDatabase"}) by (namespace, customresource) > 0
a. Manually cleaning up the external resource (e.g., deleting the RDS instance via the AWS console).
b. Forcing the removal of the finalizer from the Kubernetes object. This can be done with kubectl patch:
kubectl patch clouddatabase my-db-instance --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
This is a powerful and dangerous command. It should only be used when the external resource has been confirmed to be gone.
Edge Case 3: Multiple Finalizers and Coordination
Problem: It's possible for multiple, independent controllers to place finalizers on the same object. For example, a BackupController might add a finalizer to our CloudDatabase object to ensure it takes a final snapshot before deletion. The CloudDatabaseController knows nothing about this other controller.
Solution: This is handled elegantly by the finalizer mechanism itself, provided each controller is well-behaved.
db.example.com/finalizer, backup.example.com/finalizer). This prevents collisions.CloudDatabaseController and the BackupController will be reconciled. They will perform their cleanup in parallel. The object will only be deleted after both controllers have removed their respective finalizers.Your code should always use controllerutil.RemoveFinalizer which safely removes just your specific finalizer from the slice, rather than overwriting the whole slice.
Performance and API Server Load
In a busy cluster with thousands of CRs, an inefficient finalizer implementation can cause significant performance degradation.
Problem: During a long cleanup (e.g., a database taking 10 minutes to delete), what's the right requeue strategy? Naively returning an error causes controller-runtime to use an exponential backoff queue, which is great for genuine failures but not for waiting. Constantly updating the status every few seconds can flood the API server.
Solution: A combination of intelligent requeueing and judicious status updates.
RequeueError example, when you are simply waiting for an external system, return ctrl.Result{RequeueAfter: duration} with a nil error. This puts the item back in the work queue to be processed after a specific delay (e.g., 30 seconds) without treating it as a failure and without the exponential backoff.status on every single check. Update it only when the state changes meaningfully: - When deletion is first initiated (DeletionInProgress).
- If a transient error occurs (DeletionFailedTransient).
- If a permanent error is confirmed (DeletionFailedPermanent).
Polling an external API every 5 seconds and writing status back to the Kubernetes API every 5 seconds is an anti-pattern. The polling can be frequent, but the writes to the API server should be sparse.
MaxConcurrentReconciles option when setting up your controller manager. If your cleanup logic is I/O-bound (calling cloud APIs), you can likely increase this value. If it's CPU-bound, keep it low. A high concurrency could lead to rate limiting from your cloud provider during a mass deletion event. // in main.go
err = ctrl.NewControllerManagedBy(mgr).
For(&dbv1alpha1.CloudDatabase{}).
WithOptions(controller.Options{MaxConcurrentReconciles: 5}). // Default is 1
Complete(r)
Conclusion
The Finalizer pattern is not just a feature; it's the cornerstone of building robust operators for stateful applications on Kubernetes. By intercepting the deletion process, it allows controllers to extend Kubernetes's declarative model to resources living outside the cluster, ensuring that 'delete' means 'gracefully tear down and then delete.'
A production-grade implementation moves beyond the basic happy path. It requires idempotent, resumable cleanup logic, often tracked via the object's status sub-resource. It demands careful consideration of failure modes, like stuck terminations, and requires robust monitoring and operational playbooks. Finally, by using intelligent requeue strategies and minimizing API server chatter, you can ensure your stateful controller is not only correct but also a well-behaved, performant citizen of the cluster.