Implementing Finalizers and Leader Election in Go K8s Operators

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

Beyond the Basic Reconcile: Production-Ready Operator Patterns

For senior engineers building on Kubernetes, creating a basic Custom Resource Definition (CRD) and a corresponding operator is a solved problem. The scaffolding from tools like Kubebuilder or the Operator SDK gets you a Reconcile loop that can create or update resources. However, the chasm between a scaffolded operator and a production-grade, resilient controller is vast. This gap is bridged by correctly implementing two critical patterns: Finalizers and Leader Election.

  • Finalizers address the resource lifecycle problem: What happens when a user runs kubectl delete my-cr? Without a finalizer, the Custom Resource (CR) vanishes from the API server, but the external resources it managed—a cloud database, an S3 bucket, a DNS entry—are orphaned, leading to resource leaks and manual cleanup nightmares.
  • Leader Election addresses the high-availability problem: In production, you run multiple replicas of your operator for fault tolerance. But how do you prevent all replicas from reconciling the same CR simultaneously, leading to race conditions, API throttling from external providers, and inconsistent state?
  • This article assumes you are already familiar with the basics of the operator pattern, Go, and the controller-runtime library. We will not cover setting up a project. Instead, we will focus exclusively on the advanced implementation details, edge cases, and tuning of finalizers and leader election to build controllers that are safe, reliable, and scalable.


    Deep Dive: Finalizers for Graceful Deletion and Resource Cleanup

    A finalizer is a namespaced key added to an object's metadata.finalizers array. Its presence acts as a lock, signaling to the Kubernetes garbage collector that the object cannot be fully deleted until the finalizer is removed. The responsibility for removing the finalizer lies with the controller that added it, but only after it has successfully performed all necessary cleanup tasks.

    The Core Problem: Orphaned External Resources

    Imagine an operator managing S3Bucket CRs. A typical Reconcile function might look like this:

  • Fetch the S3Bucket CR.
    • Check if a corresponding bucket exists in AWS S3.
    • If not, create it using the AWS SDK.
  • Update the S3Bucket CR's status with the bucket name and ARN.
  • When kubectl delete s3bucket my-test-bucket is executed, the Kubernetes API server receives the request. The object is removed from etcd, and the reconcile loop might not even get a chance to run. The S3Bucket CR is gone, but the actual S3 bucket in your AWS account remains, silently accruing costs.

    The Finalizer-Driven Reconcile Flow

    By integrating a finalizer, we fundamentally change the deletion process into a two-phase commit orchestrated by our controller.

  • Deletion Request: A user requests deletion. The API server, seeing a finalizer on the object, does not delete it. Instead, it sets the metadata.deletionTimestamp field to the current time and updates the object. The object is now in a Terminating state.
  • Reconciliation Trigger: This update triggers our Reconcile loop.
  • Cleanup Logic: Our controller detects the non-nil deletionTimestamp and executes its cleanup logic (e.g., deleting the S3 bucket).
  • Finalizer Removal: Upon successful cleanup, the controller removes its finalizer from the metadata.finalizers array and updates the object.
  • Garbage Collection: The API server sees the object's deletionTimestamp is set and its finalizers array is now empty. It proceeds with the final deletion of the object from etcd.
  • Implementation with `controller-runtime`

    Let's implement this for a hypothetical WebApp CR that manages an external monitoring service entry. We'll define a unique finalizer name to avoid conflicts with other controllers.

    go
    // MyWebApp reconciler
    const webAppFinalizer = "webapp.my.domain/finalizer"
    
    func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        log := log.FromContext(ctx)
    
        // 1. Fetch the WebApp instance
        webapp := &webappv1.WebApp{}
        if err := r.Get(ctx, req.NamespacedName, webapp); err != nil {
            if errors.IsNotFound(err) {
                // Request object not found, could have been deleted after reconcile request.
                // Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
                // Return and don't requeue
                log.Info("WebApp resource not found. Ignoring since object must be deleted")
                return ctrl.Result{}, nil
            }
            // Error reading the object - requeue the request.
            log.Error(err, "Failed to get WebApp")
            return ctrl.Result{}, err
        }
    
        // 2. Examine DeletionTimestamp to determine if the object is under deletion.
        if webapp.ObjectMeta.DeletionTimestamp.IsZero() {
            // The object is not being deleted, so if it does not have our finalizer,
            // then lets add it and update the object. This is equivalent to registering our finalizer.
            if !controllerutil.ContainsFinalizer(webapp, webAppFinalizer) {
                log.Info("Adding Finalizer for the WebApp")
                controllerutil.AddFinalizer(webapp, webAppFinalizer)
                if err := r.Update(ctx, webapp); err != nil {
                    return ctrl.Result{}, err
                }
            }
        } else {
            // The object is being deleted
            if controllerutil.ContainsFinalizer(webapp, webAppFinalizer) {
                log.Info("Performing Finalizer Operations for WebApp before deletion")
    
                // Perform all cleanup tasks before removing the finalizer.
                if err := r.cleanupExternalResources(ctx, webapp); err != nil {
                    // If cleanup fails, do not remove the finalizer so that we can retry during the next reconciliation.
                    log.Error(err, "Failed to cleanup external resources")
                    return ctrl.Result{}, err
                }
    
                // Once all finalizer operations are successful, remove the finalizer.
                log.Info("Removing Finalizer for WebApp after successful cleanup")
                controllerutil.RemoveFinalizer(webapp, webAppFinalizer)
                if err := r.Update(ctx, webapp); err != nil {
                    return ctrl.Result{}, err
                }
            }
            // Stop reconciliation as the item is being deleted
            return ctrl.Result{}, nil
        }
    
        // --- Your normal reconciliation logic for create/update goes here --- 
        // For example, ensuring the external monitoring service is configured.
        if err := r.ensureMonitoringService(ctx, webapp); err != nil {
            return ctrl.Result{}, err
        }
    
        return ctrl.Result{}, nil
    }
    
    // cleanupExternalResources simulates deleting an entry from an external monitoring service
    func (r *WebAppReconciler) cleanupExternalResources(ctx context.Context, webapp *webappv1.WebApp) error {
        // This is a placeholder for real-world cleanup logic.
        // It must be idempotent. For example, if the external service API
        // returns a 'NotFound' error, we should consider that a success.
        log := log.FromContext(ctx)
        log.Info("Deleting external monitoring entry for WebApp", "WebAppName", webapp.Name)
    
        // _, err := monitoringClient.DeleteEntry(webapp.Status.MonitoringID)
        // if err != nil && !monitoringClient.IsNotFound(err) {
        //     return err
        // }
        
        // Simulating success for this example
        return nil
    }
    
    // ensureMonitoringService is part of the normal reconcile logic
    func (r *WebAppReconciler) ensureMonitoringService(ctx context.Context, webapp *webappv1.WebApp) error {
        // Placeholder for creating/updating the monitoring entry
        log := log.FromContext(ctx)
        log.Info("Ensuring monitoring service is configured for WebApp", "WebAppName", webapp.Name)
        return nil
    }

    Production Considerations & Edge Cases

  • Idempotency is Non-Negotiable: Your cleanup function (cleanupExternalResources in our example) must be idempotent. The Reconcile loop could be triggered multiple times for a deleting object if the finalizer removal step fails. If your cleanup logic tries to delete a resource that's already gone, it should not return an error. Handle NotFound errors from your cloud provider SDKs as success conditions.
  • Stuck Finalizers: The most common failure mode is a bug in the controller that prevents it from removing its finalizer. The object will be stuck in the Terminating state forever. Debugging involves inspecting the controller logs to see why cleanupExternalResources is failing or why the finalizer removal Update call is not succeeding. As a last resort, an administrator can manually patch the object to remove the finalizer, but this is dangerous as it can lead to orphaned resources.
  • bash
        # DANGEROUS: Use only when the controller is confirmed broken and resources are cleaned up manually.
        kubectl patch webapp my-webapp --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
  • Handling Multiple Finalizers: An object can have multiple finalizers from different controllers. For example, a database CR might have a finalizer from your operator for deleting the database instance and another from a backup operator for deleting backup archives. Each controller is responsible only for its own finalizer. The object will only be deleted after all finalizers are removed.

  • Deep Dive: Leader Election for High Availability

    Running a single replica of your operator is a single point of failure. The standard practice is to run two or more replicas. However, this introduces a new challenge: preventing multiple pods from acting as the controller for the same resource at the same time, a situation known as "split-brain."

    controller-runtime solves this with a built-in leader election mechanism. When enabled, all controller-manager pods will contend to acquire a lock on a shared resource object—typically a Lease object in the cluster. Only one pod, the "leader," will succeed. The leader will actively run the Reconcile loops, while the other pods remain on standby, periodically attempting to acquire the lock.

    If the leader pod crashes, stops responding, or its lease expires, one of the standby pods will acquire the lock and become the new leader, ensuring the operator remains available.

    Implementation in `main.go`

    Leader election is configured at the Manager level in your main.go file. The Kubebuilder scaffold provides the necessary flags and options out of the box.

    go
    // main.go
    
    var (
        // ... other variables
        enableLeaderElection          bool
        leaderElectionLeaseDuration   time.Duration
        leaderElectionRenewDeadline   time.Duration
        leaderElectionRetryPeriod     time.Duration
    )
    
    func init() {
        // ... other flag definitions
        flag.BoolVar(&enableLeaderElection, "leader-elect", false,
            "Enable leader election for controller manager. "+
                "Enabling this will ensure there is only one active controller manager.")
        flag.DurationVar(&leaderElectionLeaseDuration, "leader-elect-lease-duration", 60*time.Second,
            "Duration that non-leader candidates will wait to force acquire leadership.")
        flag.DurationVar(&leaderElectionRenewDeadline, "leader-elect-renew-deadline", 40*time.Second,
            "Duration that the acting controlplane will retry refreshing leadership before giving up.")
        flag.DurationVar(&leaderElectionRetryPeriod, "leader-elect-retry-period", 5*time.Second,
            "Duration the LeaderElector clients should wait between tries of actions.")
    }
    
    func main() {
        // ... flag parsing and setup ...
    
        mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
            Scheme:                        scheme,
            MetricsBindAddress:            metricsAddr,
            Port:                          9443,
            HealthProbeBindAddress:        probeAddr,
            LeaderElection:                enableLeaderElection,
            LeaderElectionID:              "d6a20333.my.domain", // Must be unique per operator
            LeaderElectionReleaseOnCancel: true, // Recommended for faster failover on graceful shutdown
            LeaseDuration:                 &leaderElectionLeaseDuration,
            RenewDeadline:                 &leaderElectionRenewDeadline,
            RetryPeriod:                   &leaderElectionRetryPeriod,
        })
    
        // ... setup reconcilers with manager ...
    
        // ... start manager ...
    }

    Tuning Leader Election Parameters for Performance and Reliability

    The default values are a safe starting point, but in a production environment, tuning these parameters is crucial. They control the trade-off between failover speed and the load placed on the Kubernetes API server.

  • --leader-elect: The master switch. Set to true in your production deployment manifests.
  • --leader-elect-lease-duration: The total duration of the lease. A non-leader will wait this long before attempting to take over. If the current leader fails to renew its lease within this period, it is considered deposed. Lowering this value leads to faster failover.
  • --leader-elect-renew-deadline: The duration within which the current leader must successfully renew its lease. This value must be less than LeaseDuration. The leader will attempt to renew its lease continuously, but if it fails for this entire duration (e.g., due to network partition from the API server), it will relinquish leadership. This is the primary knob for controlling failover time.
  • --leader-elect-retry-period: The interval at which both the leader and non-leaders attempt their respective actions (renewing or acquiring the lease). This should be significantly smaller than RenewDeadline to allow for multiple renewal attempts before the deadline is hit.
  • Tuning Strategy:

    The fundamental relationship is: LeaseDuration > RenewDeadline > RetryPeriod.

  • Scenario 1: High-Stakes Operator (e.g., managing production databases)
  • * Goal: Very fast failover to minimize downtime.

    * Configuration:

    * LeaseDuration: 15s

    * RenewDeadline: 10s

    * RetryPeriod: 2s

    * Trade-off: This configuration puts more load on the API server, as all operator replicas will be trying to update the Lease object more frequently. It's suitable for clusters with a healthy API server and a small number of high-stakes operators.

  • Scenario 2: Batch Job or Low-Priority Operator
  • * Goal: Reduce API server load; failover time is less critical.

    * Configuration:

    * LeaseDuration: 60s

    * RenewDeadline: 40s

    * RetryPeriod: 5s

    * Trade-off: Failover can take up to 60 seconds. This is acceptable for controllers where a minute of reconciliation pause has no significant impact.

    Observability and Debugging

    When things go wrong, you need to know who the leader is and why a failover might be happening.

  • Check the Lease Object: The Lease object is the source of truth. It's created in the same namespace as your operator pods.
  • bash
        # The lease ID is specified by --leader-election-id in main.go
        kubectl get lease d6a20333.my.domain -n my-operator-namespace -o yaml

    The output will show you the holderIdentity (the pod name of the current leader), acquireTime, and renewTime.

  • Inspect Controller Logs:
  • - Leader Pod: Will log messages like "msg":"Successfully acquired lease" or "msg":"became leader".

    - Standby Pods: Will continuously log messages like "msg":"leader election lost" or "msg":"attempting to acquire leader lease...".

    If you see a pod repeatedly winning and losing the election (leader flapping), it's often a sign that the leader is too busy to renew its lease in time, indicating that your Reconcile loop might be blocked or taking too long. This could be a reason to increase the RenewDeadline.


    Tying It All Together: A Production Lifecycle Scenario

    Let's trace the complete lifecycle of a DatabaseCluster CR that uses both patterns. Our operator manages a primary and replica PostgreSQL instance in a cloud provider.

  • Deployment: The DatabaseCluster operator is deployed with 3 replicas and leader-elect=true.
  • - operator-pod-0, operator-pod-1, and operator-pod-2 start.

    - They all contend for the Lease object.

    - operator-pod-1 wins and becomes the leader. pod-0 and pod-2 become standbys.

  • Creation: A user applies a DatabaseCluster manifest.
  • - The Reconcile loop in operator-pod-1 is triggered.

    - It sees the object is new and adds the databasecluster.my.domain/finalizer.

    - It calls the cloud provider's API to provision the primary and replica DB instances.

    - It updates the CR's status with the instance IDs and connection strings.

  • Leader Failure: The node running operator-pod-1 fails.
  • - operator-pod-1 stops renewing its lease.

    - After the RenewDeadline passes, its lease is considered expired.

    - operator-pod-0 and operator-pod-2 attempt to acquire the lease. operator-pod-2 succeeds and becomes the new leader.

    - The new leader's Reconcile loop runs for the existing DatabaseCluster CR. It checks the cloud provider and sees the DB instances already exist. Because its logic is idempotent, it simply confirms the state and updates the status if needed, then waits for the next change.

  • Deletion: A user runs kubectl delete databasecluster my-prod-db.
  • - The API server sets deletionTimestamp on the CR. This does not delete the object because the finalizer is present.

    - The update triggers reconciliation in the current leader, operator-pod-2.

    - The Reconcile function detects the deletionTimestamp and the finalizer.

    - It executes its cleanup logic: it calls the cloud provider API to terminate the primary and replica DB instances.

    - Edge Case: The API call to terminate the replica fails due to a transient network error. The cleanup function returns an error.

    - The Reconcile loop returns the error, and controller-runtime requeues the request. The finalizer is not removed. The DatabaseCluster CR remains in the Terminating state.

    - A few seconds later, the Reconcile loop runs again. This time, the cleanup logic successfully terminates both instances.

    - The controller removes its finalizer from the CR and updates it.

    - The Kubernetes garbage collector now sees a Terminating object with no finalizers and permanently deletes it from etcd.

    Conclusion

    Finalizers and leader election are not optional features for any serious Kubernetes operator. They are the fundamental building blocks of reliability and resilience. Finalizers provide the transactional control needed to manage the lifecycle of external resources, preventing costly leaks and ensuring clean system state. Leader election provides the high availability required for production workloads, allowing your operator to survive node failures and deployments without compromising control or causing destructive race conditions.

    By mastering the implementation details, tuning parameters, and edge cases of these two patterns, you elevate your controllers from simple automation scripts to robust, enterprise-grade components of the Kubernetes ecosystem.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles