Idempotent Kubernetes Operators with Finalizers for Stateful Apps

September 28, 2025

17 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Idempotency Imperative in Operator Design

For senior engineers working with Kubernetes, the concept of a reconciliation loop is fundamental. An operator continuously compares the desired state (defined in a Custom Resource, or CR) with the actual state of the cluster and takes action to converge the two. However, the true complexity lies not in the if desired != actual check, but in ensuring the convergence logic is idempotent. An idempotent operation, when applied multiple times, yields the same result as applying it once. In the chaotic, asynchronous world of a distributed system like Kubernetes, this is not a 'nice-to-have'; it is a non-negotiable requirement for stability.

Reconciliation loops can be triggered at any time: a change to the CR, a change to a managed resource (like a Pod crashing), or a periodic resync. If your operator's logic for creating a ConfigMap simply calls clientset.CoreV1().ConfigMaps(ns).Create(...) without checking for its existence, a resync could cause the operator to panic and crashloop when the Create call fails with an AlreadyExists error.

A naive, non-idempotent reconciliation might look like this:

// WARNING: NON-IDEMPOTENT AND FLAWED LOGIC
func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // ... fetch MyResource CR ...

    // Flaw 1: Blindly creates a StatefulSet
    // This will fail on every subsequent reconciliation if the StatefulSet already exists.
    err := r.Client.Create(ctx, desiredStatefulSet)
    if err != nil {
        // This will create a hot loop of errors if the resource exists.
        return ctrl.Result{}, err
    }

    // Flaw 2: Blindly creates a Service
    // Same issue as above.
    err = r.Client.Create(ctx, desiredService)
    if err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

The correct approach involves a CreateOrUpdate pattern, often implemented by first attempting to Get the resource. If it's not found (errors.IsNotFound(err)), you Create it. If it is found, you compare its spec with the desired spec and Update if necessary. This ensures that repeated reconciliations don't cause errors or unintended side effects.

While idempotency during creation and updates is a well-understood problem, the real challenge arises during deletion, especially for stateful applications. A simple kubectl delete my-crd triggers a garbage collection cascade, but Kubernetes has no native concept of the complex, ordered teardown a stateful service requires. This is where finalizers become the critical tool for building truly robust operators.

The Stateful Deletion Challenge: A Case Study with `ShardDB`

Let's consider a practical, complex scenario. We're building an operator to manage ShardDB, a custom sharded database. Each ShardDB instance consists of a StatefulSet with PersistentVolumeClaims (PVCs) for data storage. Critically, our database has an off-cluster dependency: a cloud backup service where all data must be archived before the infrastructure is decommissioned to prevent data loss.

The ShardDB Custom Resource Definition (CRD) might look like this:

yaml

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: sharddbs.db.example.com
spec:
  group: db.example.com
  names:
    kind: ShardDB
    plural: sharddbs
    singular: sharddb
  scope: Namespaced
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              replicas:
                type: integer
                minimum: 1
              storageClassName:
                type: string
              backupServiceURL:
                type: string
          status:
            type: object
            properties:
              phase:
                type: string
              conditions:
                type: array
                items: # ... standard condition types
    served: true
    storage: true

When a user issues kubectl delete sharddb my-database, the desired teardown sequence is:

Block Deletion: The ShardDB object should not be immediately removed from the API server.

Trigger Backup: The operator must call the external backupServiceURL to initiate a final, complete backup of the data on the PVCs.

Wait for Confirmation: The operator must wait for the backup service to confirm the archival was successful.

Delete StatefulSet: Once the data is secure, the operator can safely delete the StatefulSet, which terminates the database pods.

Delete PVCs: After the pods are gone, the underlying persistent volumes can be released by deleting the PVCs.

Allow Deletion: Only after all these steps are complete can the ShardDB object be removed from the API server.

Standard Kubernetes garbage collection fails this requirement spectacularly. When the ShardDB object is deleted, its owner references would cause the StatefulSet to be deleted immediately. The pods would terminate, and the PVCs might be orphaned or deleted depending on the ReclaimPolicy, but the critical backup step would be skipped entirely, leading to catastrophic data loss.

Introducing Finalizers for Graceful Deletion

Finalizers are the Kubernetes mechanism to solve this exact problem. A finalizer is simply a string key added to an object's metadata.finalizers list. When an object has one or more finalizers in this list, a kubectl delete command does not immediately delete it. Instead, the API server sets the metadata.deletionTimestamp field to the current time and leaves the object in a Terminating state.

The object will remain in this state, visible via the API, until all keys are removed from its metadata.finalizers list. This gives controllers a chance to execute pre-delete cleanup logic.

The modified reconciliation flow becomes:

Reconcile an active object: The deletionTimestamp is zero. The operator ensures its finalizer key is present in the metadata.finalizers list. If not, it adds it and updates the object. This is a critical first step. Then, it proceeds with normal reconciliation (creating/updating the StatefulSet, etc.).

Reconcile a terminating object: The deletionTimestamp is non-zero. The operator now knows the user wants to delete the resource. It checks if its finalizer key is still present.

* If yes, it executes the cleanup logic (call backup service, delete PVCs, etc.). Upon successful completion, it removes its finalizer key from the list and updates the object.

* If no, it means its cleanup is done, and it does nothing.

Once the metadata.finalizers list is empty, the Kubernetes garbage collector is free to permanently delete the object.

Here is what the core logic branch in our reconciler will look like:

import (
	"context"
	"time"

	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"
	"sigs.k8s.io/controller-runtime/pkg/reconcile"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)

const shardDBFinalizer = "db.example.com/finalizer"

func (r *ShardDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	// Fetch the ShardDB instance
	shardDB := &dbv1alpha1.ShardDB{}
	err := r.Get(ctx, req.NamespacedName, shardDB)
	// ... handle not found error ...

	// Check if the object is being deleted
	if shardDB.ObjectMeta.DeletionTimestamp.IsZero() {
		// The object is not being deleted, so we add our finalizer if it doesn't exist.
		if !controllerutil.ContainsFinalizer(shardDB, shardDBFinalizer) {
			logger.Info("Adding finalizer for ShardDB")
			controllerutil.AddFinalizer(shardDB, shardDBFinalizer)
			if err := r.Update(ctx, shardDB); err != nil {
				return ctrl.Result{}, err
			}
		}
	} else {
		// The object is being deleted.
		if controllerutil.ContainsFinalizer(shardDB, shardDBFinalizer) {
			logger.Info("Performing finalizer logic for ShardDB")
			// Run our finalizer logic. If it fails, we'll try again later.
			if err := r.handleFinalizer(ctx, shardDB); err != nil {
                // Don't remove the finalizer if cleanup fails. The next reconciliation will retry.
				return ctrl.Result{}, err
			}

			// Finalizer logic succeeded. Remove the finalizer so the object can be deleted.
			logger.Info("Finalizer logic successful. Removing finalizer.")
			controllerutil.RemoveFinalizer(shardDB, shardDBFinalizer)
			if err := r.Update(ctx, shardDB); err != nil {
				return ctrl.Result{}, err
			}
		}

		// Stop reconciliation as the item is being deleted
		return ctrl.Result{}, nil
	}

	// ... your normal reconciliation logic for creating/updating the StatefulSet, Service, etc. ...

	return ctrl.Result{}, nil
}

Production-Grade Implementation with Operator SDK

Let's build out the handleFinalizer function with production-level considerations. This involves idempotent, multi-step cleanup.

Our ShardDB controller needs a client for the external backup service. For this example, we'll mock it.

// MOCK Backup Service Client

type BackupServiceClient struct {
    // In a real implementation, this would hold an http.Client, auth tokens, etc.
}

// Represents the state of a backup in the external service.
type BackupStatus string
const (
    BackupNotFound    BackupStatus = "NotFound"
    BackupInProgress  BackupStatus = "InProgress"
    BackupCompleted   BackupStatus = "Completed"
    BackupFailed      BackupStatus = "Failed"
)

func (c *BackupServiceClient) TriggerBackup(instanceID string) error {
    fmt.Printf("[BackupClient] Triggering backup for %s\n", instanceID)
    // Simulates an API call that starts a backup job
    return nil
}

func (c *BackupServiceClient) GetBackupStatus(instanceID string) (BackupStatus, error) {
    fmt.Printf("[BackupClient] Getting backup status for %s\n", instanceID)
    // This mock would be replaced by a real API call.
    // Here we can simulate different states for testing.
    return BackupCompleted, nil 
}

The handleFinalizer function orchestrates the cleanup sequence. It must be idempotent at every step.

// In sharddb_controller.go

func (r *ShardDBReconciler) handleFinalizer(ctx context.Context, shardDB *dbv1alpha1.ShardDB) error {
    logger := log.FromContext(ctx)
    instanceID := string(shardDB.UID) // Use UID for a unique identifier

    // Step 1: Trigger and verify backup with the external service.
    status, err := r.BackupClient.GetBackupStatus(instanceID)
    if err != nil {
        logger.Error(err, "Failed to get backup status from external service")
        return err // Requeue and try again
    }

    switch status {
    case BackupNotFound:
        // Idempotency: If we haven't even started the backup, trigger it.
        logger.Info("Backup not found, triggering now.", "instanceID", instanceID)
        if err := r.BackupClient.TriggerBackup(instanceID); err != nil {
            logger.Error(err, "Failed to trigger backup")
            return err
        }
        // Requeue immediately to check status on the next loop.
        return fmt.Errorf("backup triggered, waiting for completion")
    case BackupInProgress:
        // Backup is running, we need to wait. Requeue after a delay.
        logger.Info("Backup is in progress, waiting...", "instanceID", instanceID)
        // Returning an error forces a requeue with exponential backoff, which is what we want.
        return fmt.Errorf("backup in progress, requeuing")
    case BackupFailed:
        // The backup failed. This requires manual intervention.
        // We can update the CR status to reflect this and stop trying.
        logger.Error(err, "Backup has failed permanently. Manual intervention required.")
        // TODO: Update ShardDB status to a 'DeletionFailed' state.
        return err // Keep retrying unless a permanent error is confirmed.
    case BackupCompleted:
        // The backup is complete. We can proceed.
        logger.Info("Backup completed successfully.", "instanceID", instanceID)
    }

    // Step 2: Delete the StatefulSet.
    // The deletion of the StatefulSet will cascade to its Pods.
    foundSts := &appsv1.StatefulSet{}
    err = r.Get(ctx, client.ObjectKey{Name: shardDB.Name, Namespace: shardDB.Namespace}, foundSts)
    if err != nil && errors.IsNotFound(err) {
        // StatefulSet is already gone, move to the next step.
        logger.Info("StatefulSet already deleted.")
    } else if err == nil {
        // StatefulSet found, delete it.
        logger.Info("Deleting StatefulSet.")
        if err := r.Delete(ctx, foundSts); err != nil {
            logger.Error(err, "Failed to delete StatefulSet")
            return err
        }
        // Wait for it to be fully deleted.
        return fmt.Errorf("waiting for statefulset to be deleted")
    } else {
        logger.Error(err, "Failed to get StatefulSet")
        return err
    }

    // Step 3: Delete the PersistentVolumeClaims.
    // This is often a critical step that requires care.
    pvcList := &corev1.PersistentVolumeClaimList{}
    opts := []client.ListOption{
        client.InNamespace(shardDB.Namespace),
        client.MatchingLabels{"app": shardDB.Name}, // Use labels to find associated PVCs
    }
    if err := r.List(ctx, pvcList, opts...); err != nil {
        logger.Error(err, "Failed to list PVCs")
        return err
    }

    if len(pvcList.Items) > 0 {
        logger.Info("Deleting PersistentVolumeClaims", "count", len(pvcList.Items))
        for _, pvc := range pvcList.Items {
            if err := r.Delete(ctx, &pvc); err != nil {
                // If one PVC fails to delete, we still try the others, but will ultimately retry.
                logger.Error(err, "Failed to delete PVC", "pvcName", pvc.Name)
                return err
            }
        }
        return fmt.Errorf("waiting for PVCs to be deleted")
    }

    logger.Info("All cleanup steps completed.")
    return nil // Success!
}

This implementation demonstrates several key production patterns:

* State Machine Logic: The switch statement on the backup status acts as a simple state machine, ensuring we don't re-trigger a backup that's already in progress or completed.

* Idempotent Checks: Before each deletion, we check if the resource (StatefulSet, PVC) still exists. If it's already gone (perhaps from a previous, partially successful finalizer run), we simply move on.

* Requeueing for Waits: Instead of blocking the reconciliation loop with time.Sleep, we return an error or a reconcile.Result{RequeueAfter: ...}. This releases the worker thread and allows the controller-runtime to manage retries, typically with exponential backoff. Returning fmt.Errorf(...) is a common and effective way to trigger a requeue.

Advanced Edge Cases and Performance Considerations

A robust operator must be designed to handle failure, not just the happy path.

1. Operator Crash During Finalization

Imagine the operator pod crashes after successfully triggering the backup but before deleting the StatefulSet. The ShardDB CR still has its finalizer. When the operator restarts, a new reconciliation is triggered for the ShardDB object.

Because our handleFinalizer logic is idempotent, it will:

Call GetBackupStatus. The service reports BackupCompleted.

Proceed to check for the StatefulSet. It finds it.

It issues a Delete command for the StatefulSet.

The system gracefully recovers and picks up exactly where it left off. The ShardDB object simply remains in the Terminating state until the operator comes back online and finishes the job.

2. External Service Unavailability

What if the backupServiceURL is down? The r.BackupClient.GetBackupStatus() call will fail. Our function returns this error. The controller-runtime manager will see the error and requeue the reconciliation request. By default, it uses an exponential backoff algorithm, so it won't hammer the failing service. It might retry after 1s, then 2s, 4s, 8s, and so on. This is a crucial behavior for being a good citizen in a microservices ecosystem.

For more control, you can return a specific requeue time:

return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

3. The Stuck Finalizer Problem

This is a classic operational issue. A bug in the operator's finalizer logic (e.g., it never successfully completes) or a misconfiguration means the finalizer is never removed. The result is an object that can never be deleted. kubectl delete will hang indefinitely.

As an administrator, you must have a procedure for this. The 'break glass' solution is to manually remove the finalizer from the object:

bash

kubectl patch sharddb my-database -n my-namespace --type='json' -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

# Or, if you want to remove all finalizers:
kubectl patch sharddb my-database -n my-namespace --type merge -p '{"metadata":{"finalizers":[]}}'

WARNING: This is a dangerous operation. Manually removing a finalizer bypasses all cleanup logic. In our ShardDB example, this would lead to orphaned PVCs and, more importantly, a database deleted without a final backup. This command should only be used when you are certain the operator is broken and have performed manual cleanup.

4. Performance of Long-Running Cleanup Tasks

Our TriggerBackup call was asynchronous. But what if a cleanup step is synchronous and takes several minutes? For example, waiting for a large data volume to be snapshotted. A long-running reconciliation loop is an anti-pattern. It holds a worker goroutine, reducing the operator's ability to process other events.

The advanced pattern here is to manage the state via the CR's status subresource:

In handleFinalizer, when a long task begins, update the CR's status: shardDB.Status.Phase = "Archiving" and r.Status().Update(ctx, shardDB).

Return ctrl.Result{RequeueAfter: 30 * time.Second}, nil immediately. Do not block.

On the next reconciliation (after 30 seconds), the handleFinalizer function is entered again. It should first check shardDB.Status.Phase. If it's Archiving, it polls the external service for completion. If not yet complete, it returns another RequeueAfter. If complete, it updates the status to DeletingStatefulSet and proceeds to the next step.

This keeps the reconciliation loops short and fast, making the operator more scalable and responsive.

Conclusion: Beyond Deletion Hooks

Finalizers are far more than a simple pre-delete hook. They are a fundamental control mechanism that enables operators to fully manage the lifecycle of a resource, transforming Kubernetes from a stateless application orchestrator into a platform capable of handling complex, stateful workloads with guarantees against data loss.

By combining the finalizer pattern with strict idempotency in the reconciliation logic, we build operators that are resilient to failure, predictable in their behavior, and capable of performing the kind of complex, ordered operations that stateful systems demand. For senior engineers building cloud-native platforms, mastering this pattern is not just an implementation detail—it's the cornerstone of creating reliable, production-grade automation on Kubernetes.