Idempotent K8s Operators: Finalizers & Leader Election for Stateful Apps

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Flaw in Naive Reconciliation for Stateful Systems

As a senior engineer building on Kubernetes, you're likely familiar with the operator pattern. The core is the reconciliation loop: observe the state of a Custom Resource (CR), and make the world (e.g., pods, services, external resources) match that state. For stateless applications managed entirely within the cluster, a simple CreateOrUpdate logic in the Reconcile function often suffices.

However, this approach shatters when the operator manages stateful resources external to Kubernetes—a cloud database, a storage bucket, a SaaS subscription. Consider a Database operator that provisions a PostgreSQL instance on AWS RDS. A naive reconciliation might look like this:

  • Get the Database CR.
    • Check if an RDS instance with a corresponding tag exists.
    • If not, create it.
  • If it exists, ensure its configuration (instance size, version) matches the CR's spec.
  • This workflow has two critical failure modes:

  • Orphaned Resources on Deletion: When a user runs kubectl delete database my-prod-db, the Database object is removed from the Kubernetes API. The operator receives a deletion event, but the naive reconciler has no logic to handle it. The reconciliation loop for this object simply stops. The external AWS RDS instance, which may be costing hundreds of dollars a day, is now orphaned—it exists without a corresponding Kubernetes resource to manage it.
  • Split-Brain with Multiple Replicas: To make the operator highly available, you run multiple pods. Without coordination, all pods will receive events for the same Database CR. All three might simultaneously check for the RDS instance, find it missing, and issue a CreateDBInstance API call to AWS. This race condition leads to duplicate resources, API rate limiting, and a chaotic external state.
  • To build production-grade, reliable operators for stateful systems, we must solve these problems with robust engineering patterns. This article provides a deep dive into two such patterns: Finalizers for guaranteeing graceful cleanup and Leader Election for safe high availability.


    Deep Dive: The Finalizer Pattern for Graceful Deletion

    A Finalizer is not a piece of code, but rather a piece of data: a list of strings in the metadata.finalizers field of any Kubernetes object. When you add a string to this list, you are creating a pre-deletion hook. The Kubernetes garbage collector is now aware of your controller's interest in this object.

    When a user attempts to delete an object with a finalizer, the API server does not immediately delete it. Instead, it sets the metadata.deletionTimestamp field to the current time and leaves the object in the API. The object is now in a "terminating" state. Your operator, watching for changes, will receive an update event for this object. It is now your operator's responsibility to perform any necessary cleanup and, only when cleanup is complete, remove its finalizer string from the list. Once the finalizers list is empty and deletionTimestamp is set, the Kubernetes garbage collector will finally delete the object.

    Implementing a Finalizer in a Go Operator

    Let's implement this for our Database operator. We'll use the controller-runtime library, which is the standard for building operators in Go.

    First, we define a unique finalizer name for our controller. This prevents conflicts with other controllers that might also be managing the same object.

    go
    // controllers/database_controller.go
    const databaseFinalizer = "database.example.com/finalizer"

    Now, we modify our Reconcile function. The logic branches based on the presence of the deletionTimestamp.

    go
    // controllers/database_controller.go
    
    import (
    	"context"
    	"time"
    
    	"github.com/go-logr/logr"
    	corev1 "k8s.io/api/core/v1"
    	kerrors "k8s.io/apimachinery/pkg/api/errors"
    	"k8s.io/apimachinery/pkg/runtime"
    	ctrl "sigs.k8s.io/controller-runtime"
    	"sigs.k8s.io/controller-runtime/pkg/client"
    	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    
    	dbv1alpha1 "github.com/example/database-operator/api/v1alpha1"
    )
    
    // DatabaseReconciler reconciles a Database object
    type DatabaseReconciler struct {
    	client.Client
    	Log    logr.Logger
    	Scheme *runtime.Scheme
    	// A mock external client for demonstration
    	ExternalDBProvider *ExternalProviderClient
    }
    
    func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := r.Log.WithValues("database", req.NamespacedName)
    
    	// 1. Fetch the Database instance
    	db := &dbv1alpha1.Database{}
    	if err := r.Get(ctx, req.NamespacedName, db); err != nil {
    		if kerrors.IsNotFound(err) {
    			// Object not found, probably deleted. Nothing to do.
    			log.Info("Database resource not found. Ignoring since object must be deleted")
    			return ctrl.Result{}, nil
    		}
    		log.Error(err, "Failed to get Database resource")
    		return ctrl.Result{}, err
    	}
    
    	// 2. Examine deletion timestamp to determine if object is under deletion
    	if db.ObjectMeta.DeletionTimestamp.IsZero() {
    		// The object is not being deleted, so we add our finalizer if it doesn't exist.
    		// This ensures our cleanup logic is run before the object is deleted.
    		if !controllerutil.ContainsFinalizer(db, databaseFinalizer) {
    			log.Info("Adding finalizer for Database")
    			controllerutil.AddFinalizer(db, databaseFinalizer)
    			if err := r.Update(ctx, db); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    	} else {
    		// The object is being deleted
    		if controllerutil.ContainsFinalizer(db, databaseFinalizer) {
    			log.Info("Performing cleanup for Database")
    			// Our finalizer is present, so we should perform cleanup
    			if err := r.cleanupExternalResources(ctx, db); err != nil {
    				// If cleanup fails, we don't remove the finalizer. 
    				// The reconciliation will be retried with exponential backoff.
    				log.Error(err, "Failed to cleanup external resources")
    				return ctrl.Result{}, err
    			}
    
    			// Cleanup was successful, so we can remove our finalizer.
    			log.Info("External resources cleaned up, removing finalizer")
    			controllerutil.RemoveFinalizer(db, databaseFinalizer)
    			if err := r.Update(ctx, db); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    
    		// Stop reconciliation as the item is being deleted
    		return ctrl.Result{}, nil
    	}
    
    	// --- Main Reconciliation Logic for Creation/Update ---
    	// This is where you would check if the external DB exists and create/update it.
    	log.Info("Reconciling Database creation/update")
    	externalID, err := r.ExternalDBProvider.FindInstance(ctx, db.Name)
    	if err != nil {
    		if IsNotFound(err) { // Assuming a custom error type
    			log.Info("External database not found, creating it")
    			newID, createErr := r.ExternalDBProvider.CreateInstance(ctx, db.Spec.InstanceSize)
    			if createErr != nil {
    				log.Error(createErr, "Failed to create external database")
    				// Update status with failure condition
    				db.Status.State = "Failed"
    				db.Status.Message = createErr.Error()
    				_ = r.Status().Update(ctx, db)
    				return ctrl.Result{}, createErr
    			}
    			// Update status with the new ID and state
    			db.Status.State = "Provisioned"
    			db.Status.DBInstanceID = newID
    			if statusErr := r.Status().Update(ctx, db); statusErr != nil {
    				return ctrl.Result{}, statusErr
    			}
    			return ctrl.Result{}, nil
    		}
    		return ctrl.Result{}, err // Some other FindInstance error
    	}
    
    	log.Info("External database already exists", "ID", externalID)
    	// Add logic here to check if the existing instance matches the spec
    	// and update it if necessary.
    
    	return ctrl.Result{}, nil
    }
    
    func (r *DatabaseReconciler) cleanupExternalResources(ctx context.Context, db *dbv1alpha1.Database) error {
    	// This function must be idempotent.
    	log := r.Log.WithValues("database", client.ObjectKeyFromObject(db).String())
    
    	// We use the status field to find the external resource ID.
    	// If the status or ID is empty, the resource was likely never created.
    	if db.Status.DBInstanceID == "" {
    		log.Info("DBInstanceID is empty, assuming external resource was never created or already deleted.")
    		return nil
    	}
    
    	log.Info("Deleting external database instance", "ID", db.Status.DBInstanceID)
    	err := r.ExternalDBProvider.DeleteInstance(ctx, db.Status.DBInstanceID)
    	if err != nil {
    		if IsNotFound(err) {
    			// If the external resource is already gone, we can consider cleanup successful.
    			log.Info("External database instance already deleted.")
    			return nil
    		}
    		return err
    	}
    
    	return nil
    }

    Edge Cases and Production Patterns for Finalizers

  • Idempotent Cleanup: The cleanupExternalResources function must be idempotent. If the reconciliation loop fails after the external resource is deleted but before the finalizer is removed, the function will be called again. Your code must handle the case where the resource it's trying to delete is already gone. Notice how we check for a IsNotFound error from our provider and treat it as a success.
  • Handling Cleanup Failures: If the external API call to delete the resource fails for a transient reason (e.g., network error, API throttling), our code returns an error. controller-runtime will requeue the reconciliation request with exponential backoff. The finalizer remains, acting as a lock that prevents the Database object's deletion until the external resource is confirmed to be gone. This is the core of the pattern's reliability.
  • Finalizer Becomes Stuck: What if the cleanup logic has a permanent bug, or the external API credentials become invalid? The finalizer will be stuck, and kubectl delete will hang indefinitely. This is a common operational issue. You must have robust monitoring and alerting on reconciliation errors. An administrator with sufficient permissions can manually patch the object to remove the finalizer (kubectl patch database my-prod-db --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'), but this should be a last resort after manually verifying the external resource is gone.

  • High Availability with Leader Election

    Running a single pod of your operator is a single point of failure. The obvious solution is to run multiple replicas. But as discussed, this leads to a "split-brain" or "thundering herd" problem where multiple controllers reconcile the same object simultaneously.

    The solution is Leader Election. All operator pods start up, but only one acquires a "lease" to become the leader. Only the leader pod will start and run the reconciliation loops. The other pods remain on standby, periodically attempting to acquire the lease. If the leader pod crashes or loses network connectivity to the API server, its lease expires, and one of the standby pods will acquire the lease and become the new leader.

    controller-runtime makes configuring this remarkably simple in your main.go file.

    Code Example: Configuring Leader Election

    go
    // main.go
    
    import (
    	// ... other imports
    	"flag"
    	"os"
    
    	"k8s.io/apimachinery/pkg/runtime"
    	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
    	clientgoscheme "k8s.io/client-go/kubernetes/scheme"
    	ctrl "sigs.k8s.io/controller-runtime"
    	"sigs.k8s.io/controller-runtime/pkg/healthz"
    	"sigs.k8s.io/controller-runtime/pkg/log/zap"
    )
    
    func main() {
    	var metricsAddr string
    	var enableLeaderElection bool
    	var probeAddr string
    	flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
    	flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
    	flag.BoolVar(&enableLeaderElection, "leader-elect", false,
    		"Enable leader election for controller manager. "+
    			"Enabling this will ensure there is only one active controller manager.")
    	// ...
    
    	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    		Scheme:                 scheme,
    		MetricsBindAddress:     metricsAddr,
    		Port:                   9443,
    		HealthProbeBindAddress: probeAddr,
    		LeaderElection:         enableLeaderElection,
    		LeaderElectionID:       "a642157c.example.com", // Must be unique per operator
    		// LeaderElectionNamespace can be used to scope leases to a single namespace.
    		// If not set, it uses the namespace of the operator pod.
    	})
    	if err != nil {
    		setupLog.Error(err, "unable to start manager")
    		os.Exit(1)
    	}
    
    	if err = (&controllers.DatabaseReconciler{
    		Client: mgr.GetClient(),
    		Log:    ctrl.Log.WithName("controllers").WithName("Database"),
    		Scheme: mgr.GetScheme(),
    		// ...
    	}).SetupWithManager(mgr); err != nil {
    		setupLog.Error(err, "unable to create controller", "controller", "Database")
    		os.Exit(1)
    	}
    
    	// ... setup webhooks, etc.
    
    	setupLog.Info("starting manager")
    	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
    		setupLog.Error(err, "problem running manager")
    		os.Exit(1)
    	}
    }

    By simply setting LeaderElection: true and providing a unique LeaderElectionID, you have enabled a robust, battle-tested HA mechanism. Under the hood, controller-runtime uses the client-go/tools/leaderelection package. It will create a Lease object (or a ConfigMap/Endpoint in older versions) in the operator's namespace. All pods will attempt to atomically update this object to claim leadership. The Kubernetes API server's optimistic locking guarantees that only one pod can succeed.

    Advanced Considerations for Leader Election

    While the default settings are sensible, production environments demand fine-tuning.

  • LeaseDuration vs. RenewDeadline: The ctrl.Options struct allows you to configure LeaseDuration, RenewDeadline, and RetryPeriod.
  • * LeaseDuration: How long a non-leader will wait before trying to acquire the lease again. Default: 15s.

    * RenewDeadline: How long the leader has to renew its lease before it's considered expired. Default: 10s.

    * RetryPeriod: How often a non-leader pod will try to acquire the lease. Default: 2s.

    The relationship is critical: LeaseDuration must be greater than RenewDeadline. The gap (LeaseDuration - RenewDeadline) is the window for a successful renewal. A shorter RenewDeadline means faster failover if the leader dies, but it also means the leader must update the Lease object more frequently, increasing load on the API server. For most operators, the defaults are fine. For operators managing thousands of CRs where failover time is critical, you might tune these down to RenewDeadline: 7s, LeaseDuration: 10s.

  • Cache Synchronization on Leadership Change: When a new pod becomes leader, it cannot start reconciling immediately. Its local cache (an in-memory copy of watched resources) is empty. If it acted immediately, it might make incorrect decisions based on this empty cache. controller-runtime handles this gracefully. The mgr.Start() call internally ensures that before any reconciler is started, the new leader's mgr.GetCache() blocks until it has successfully synced with the API server. This is a critical detail that prevents acting on stale data during a failover event.
  • Monitoring Leader Election: How do you know who the leader is? You can inspect the Lease object:
  • kubectl get lease a642157c.example.com -n my-operator-namespace -o yaml

    The holderIdentity field in the spec will show the pod name of the current leader. Furthermore, controller-runtime exposes Prometheus metrics for this, including controller_runtime_leaderelection_leader_info, which you should absolutely include in your monitoring dashboards.


    Tying It All Together: Performance and Scalability

    Now that we have a robust and highly available operator, we need to ensure it performs at scale. An operator managing 10,000 CRs has very different performance characteristics from one managing 10.

    Technique 1: MaxConcurrentReconciles

    By default, a controller will process multiple reconciliation requests concurrently. The default value in controller-runtime is 1. For an operator that makes blocking API calls to an external service, a single worker can become a bottleneck. You can increase this in SetupWithManager:

    go
    // controllers/database_controller.go
    
    func (r *DatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&dbv1alpha1.Database{}).
    		WithOptions(controller.Options{MaxConcurrentReconciles: 5}). // Tune this value
    		Complete(r)
    }

    Increasing MaxConcurrentReconciles to 5 allows the controller to work on up to 5 different Database objects at the same time. This is not multi-threading the reconciliation of a single object. A reconciliation for db-A will always run to completion before another reconciliation for db-A is started. This setting increases throughput by processing different objects in parallel.

    Warning: Be a good citizen. If your reconciler is hammering an external API, increasing concurrency can get you rate-limited. Tune this value carefully based on the limitations of the systems you integrate with.

    Technique 2: Filtering Events with Predicates

    Your controller will receive an event for every single change to a Database object. This includes changes you make yourself, such as updating the status subresource. A common anti-pattern is an infinite reconciliation loop:

  • Reconciler runs for db-A.
  • At the end, it updates db-A.status.
    • This update triggers a new event.
  • The reconciler runs again for db-A, even though nothing in the spec changed.
  • This wastes CPU and can cause unnecessary API calls. We can filter these redundant events using predicates. predicate.GenerationChangedPredicate is perfect for this; it only allows events through if the metadata.generation field has changed. This field is only incremented by the API server when the spec of an object changes. Status updates do not affect it.

    go
    // controllers/database_controller.go
    import "sigs.k8s.io/controller-runtime/pkg/predicate"
    
    func (r *DatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&dbv1alpha1.Database{}).
    		WithEventFilter(predicate.GenerationChangedPredicate{}).
    		WithOptions(controller.Options{MaxConcurrentReconciles: 5}).
    		Complete(r)
    }

    This simple addition dramatically reduces unnecessary reconciliations, saving resources and preventing noise.

    Conclusion

    Moving from a basic operator to a production-grade, stateful operator requires a fundamental shift in thinking. We must move beyond the simple "make world match spec" model and architect for the full lifecycle of both the Kubernetes resource and its external counterpart.

  • Finalizers are not optional for stateful operators; they are the required mechanism for ensuring that external resource cleanup is idempotent, reliable, and tied directly to the lifecycle of the Kubernetes object.
  • Leader Election is the standard, battle-tested solution for providing high availability without introducing dangerous race conditions or split-brain scenarios.
  • Performance Tuning via MaxConcurrentReconciles and event filtering with Predicates is essential for ensuring your operator can scale to manage thousands of resources without overwhelming the Kubernetes API server or the external systems it integrates with.
  • By deeply understanding and correctly implementing these advanced patterns, you can build Kubernetes operators that automate complex, stateful workloads with the same reliability and robustness expected of a cloud-native control plane.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles