Implementing Finalizers and Leader Election in Go K8s Operators
Beyond the Basic Reconcile: Production-Ready Operator Patterns
For senior engineers building on Kubernetes, creating a basic Custom Resource Definition (CRD) and a corresponding operator is a solved problem. The scaffolding from tools like Kubebuilder or the Operator SDK gets you a Reconcile loop that can create or update resources. However, the chasm between a scaffolded operator and a production-grade, resilient controller is vast. This gap is bridged by correctly implementing two critical patterns: Finalizers and Leader Election.
kubectl delete my-cr? Without a finalizer, the Custom Resource (CR) vanishes from the API server, but the external resources it managed—a cloud database, an S3 bucket, a DNS entry—are orphaned, leading to resource leaks and manual cleanup nightmares.This article assumes you are already familiar with the basics of the operator pattern, Go, and the controller-runtime library. We will not cover setting up a project. Instead, we will focus exclusively on the advanced implementation details, edge cases, and tuning of finalizers and leader election to build controllers that are safe, reliable, and scalable.
Deep Dive: Finalizers for Graceful Deletion and Resource Cleanup
A finalizer is a namespaced key added to an object's metadata.finalizers array. Its presence acts as a lock, signaling to the Kubernetes garbage collector that the object cannot be fully deleted until the finalizer is removed. The responsibility for removing the finalizer lies with the controller that added it, but only after it has successfully performed all necessary cleanup tasks.
The Core Problem: Orphaned External Resources
Imagine an operator managing S3Bucket CRs. A typical Reconcile function might look like this:
S3Bucket CR.- Check if a corresponding bucket exists in AWS S3.
- If not, create it using the AWS SDK.
S3Bucket CR's status with the bucket name and ARN.When kubectl delete s3bucket my-test-bucket is executed, the Kubernetes API server receives the request. The object is removed from etcd, and the reconcile loop might not even get a chance to run. The S3Bucket CR is gone, but the actual S3 bucket in your AWS account remains, silently accruing costs.
The Finalizer-Driven Reconcile Flow
By integrating a finalizer, we fundamentally change the deletion process into a two-phase commit orchestrated by our controller.
metadata.deletionTimestamp field to the current time and updates the object. The object is now in a Terminating state.Reconcile loop.deletionTimestamp and executes its cleanup logic (e.g., deleting the S3 bucket).metadata.finalizers array and updates the object.deletionTimestamp is set and its finalizers array is now empty. It proceeds with the final deletion of the object from etcd.Implementation with `controller-runtime`
Let's implement this for a hypothetical WebApp CR that manages an external monitoring service entry. We'll define a unique finalizer name to avoid conflicts with other controllers.
// MyWebApp reconciler
const webAppFinalizer = "webapp.my.domain/finalizer"
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the WebApp instance
webapp := &webappv1.WebApp{}
if err := r.Get(ctx, req.NamespacedName, webapp); err != nil {
if errors.IsNotFound(err) {
// Request object not found, could have been deleted after reconcile request.
// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
// Return and don't requeue
log.Info("WebApp resource not found. Ignoring since object must be deleted")
return ctrl.Result{}, nil
}
// Error reading the object - requeue the request.
log.Error(err, "Failed to get WebApp")
return ctrl.Result{}, err
}
// 2. Examine DeletionTimestamp to determine if the object is under deletion.
if webapp.ObjectMeta.DeletionTimestamp.IsZero() {
// The object is not being deleted, so if it does not have our finalizer,
// then lets add it and update the object. This is equivalent to registering our finalizer.
if !controllerutil.ContainsFinalizer(webapp, webAppFinalizer) {
log.Info("Adding Finalizer for the WebApp")
controllerutil.AddFinalizer(webapp, webAppFinalizer)
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
}
} else {
// The object is being deleted
if controllerutil.ContainsFinalizer(webapp, webAppFinalizer) {
log.Info("Performing Finalizer Operations for WebApp before deletion")
// Perform all cleanup tasks before removing the finalizer.
if err := r.cleanupExternalResources(ctx, webapp); err != nil {
// If cleanup fails, do not remove the finalizer so that we can retry during the next reconciliation.
log.Error(err, "Failed to cleanup external resources")
return ctrl.Result{}, err
}
// Once all finalizer operations are successful, remove the finalizer.
log.Info("Removing Finalizer for WebApp after successful cleanup")
controllerutil.RemoveFinalizer(webapp, webAppFinalizer)
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
}
// Stop reconciliation as the item is being deleted
return ctrl.Result{}, nil
}
// --- Your normal reconciliation logic for create/update goes here ---
// For example, ensuring the external monitoring service is configured.
if err := r.ensureMonitoringService(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
// cleanupExternalResources simulates deleting an entry from an external monitoring service
func (r *WebAppReconciler) cleanupExternalResources(ctx context.Context, webapp *webappv1.WebApp) error {
// This is a placeholder for real-world cleanup logic.
// It must be idempotent. For example, if the external service API
// returns a 'NotFound' error, we should consider that a success.
log := log.FromContext(ctx)
log.Info("Deleting external monitoring entry for WebApp", "WebAppName", webapp.Name)
// _, err := monitoringClient.DeleteEntry(webapp.Status.MonitoringID)
// if err != nil && !monitoringClient.IsNotFound(err) {
// return err
// }
// Simulating success for this example
return nil
}
// ensureMonitoringService is part of the normal reconcile logic
func (r *WebAppReconciler) ensureMonitoringService(ctx context.Context, webapp *webappv1.WebApp) error {
// Placeholder for creating/updating the monitoring entry
log := log.FromContext(ctx)
log.Info("Ensuring monitoring service is configured for WebApp", "WebAppName", webapp.Name)
return nil
}
Production Considerations & Edge Cases
cleanupExternalResources in our example) must be idempotent. The Reconcile loop could be triggered multiple times for a deleting object if the finalizer removal step fails. If your cleanup logic tries to delete a resource that's already gone, it should not return an error. Handle NotFound errors from your cloud provider SDKs as success conditions.Terminating state forever. Debugging involves inspecting the controller logs to see why cleanupExternalResources is failing or why the finalizer removal Update call is not succeeding. As a last resort, an administrator can manually patch the object to remove the finalizer, but this is dangerous as it can lead to orphaned resources. # DANGEROUS: Use only when the controller is confirmed broken and resources are cleaned up manually.
kubectl patch webapp my-webapp --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
Deep Dive: Leader Election for High Availability
Running a single replica of your operator is a single point of failure. The standard practice is to run two or more replicas. However, this introduces a new challenge: preventing multiple pods from acting as the controller for the same resource at the same time, a situation known as "split-brain."
controller-runtime solves this with a built-in leader election mechanism. When enabled, all controller-manager pods will contend to acquire a lock on a shared resource object—typically a Lease object in the cluster. Only one pod, the "leader," will succeed. The leader will actively run the Reconcile loops, while the other pods remain on standby, periodically attempting to acquire the lock.
If the leader pod crashes, stops responding, or its lease expires, one of the standby pods will acquire the lock and become the new leader, ensuring the operator remains available.
Implementation in `main.go`
Leader election is configured at the Manager level in your main.go file. The Kubebuilder scaffold provides the necessary flags and options out of the box.
// main.go
var (
// ... other variables
enableLeaderElection bool
leaderElectionLeaseDuration time.Duration
leaderElectionRenewDeadline time.Duration
leaderElectionRetryPeriod time.Duration
)
func init() {
// ... other flag definitions
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager. "+
"Enabling this will ensure there is only one active controller manager.")
flag.DurationVar(&leaderElectionLeaseDuration, "leader-elect-lease-duration", 60*time.Second,
"Duration that non-leader candidates will wait to force acquire leadership.")
flag.DurationVar(&leaderElectionRenewDeadline, "leader-elect-renew-deadline", 40*time.Second,
"Duration that the acting controlplane will retry refreshing leadership before giving up.")
flag.DurationVar(&leaderElectionRetryPeriod, "leader-elect-retry-period", 5*time.Second,
"Duration the LeaderElector clients should wait between tries of actions.")
}
func main() {
// ... flag parsing and setup ...
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
MetricsBindAddress: metricsAddr,
Port: 9443,
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "d6a20333.my.domain", // Must be unique per operator
LeaderElectionReleaseOnCancel: true, // Recommended for faster failover on graceful shutdown
LeaseDuration: &leaderElectionLeaseDuration,
RenewDeadline: &leaderElectionRenewDeadline,
RetryPeriod: &leaderElectionRetryPeriod,
})
// ... setup reconcilers with manager ...
// ... start manager ...
}
Tuning Leader Election Parameters for Performance and Reliability
The default values are a safe starting point, but in a production environment, tuning these parameters is crucial. They control the trade-off between failover speed and the load placed on the Kubernetes API server.
--leader-elect: The master switch. Set to true in your production deployment manifests.--leader-elect-lease-duration: The total duration of the lease. A non-leader will wait this long before attempting to take over. If the current leader fails to renew its lease within this period, it is considered deposed. Lowering this value leads to faster failover.--leader-elect-renew-deadline: The duration within which the current leader must successfully renew its lease. This value must be less than LeaseDuration. The leader will attempt to renew its lease continuously, but if it fails for this entire duration (e.g., due to network partition from the API server), it will relinquish leadership. This is the primary knob for controlling failover time.--leader-elect-retry-period: The interval at which both the leader and non-leaders attempt their respective actions (renewing or acquiring the lease). This should be significantly smaller than RenewDeadline to allow for multiple renewal attempts before the deadline is hit.Tuning Strategy:
The fundamental relationship is: LeaseDuration > RenewDeadline > RetryPeriod.
* Goal: Very fast failover to minimize downtime.
* Configuration:
* LeaseDuration: 15s
* RenewDeadline: 10s
* RetryPeriod: 2s
* Trade-off: This configuration puts more load on the API server, as all operator replicas will be trying to update the Lease object more frequently. It's suitable for clusters with a healthy API server and a small number of high-stakes operators.
* Goal: Reduce API server load; failover time is less critical.
* Configuration:
* LeaseDuration: 60s
* RenewDeadline: 40s
* RetryPeriod: 5s
* Trade-off: Failover can take up to 60 seconds. This is acceptable for controllers where a minute of reconciliation pause has no significant impact.
Observability and Debugging
When things go wrong, you need to know who the leader is and why a failover might be happening.
Lease object is the source of truth. It's created in the same namespace as your operator pods. # The lease ID is specified by --leader-election-id in main.go
kubectl get lease d6a20333.my.domain -n my-operator-namespace -o yaml
The output will show you the holderIdentity (the pod name of the current leader), acquireTime, and renewTime.
- Leader Pod: Will log messages like "msg":"Successfully acquired lease" or "msg":"became leader".
- Standby Pods: Will continuously log messages like "msg":"leader election lost" or "msg":"attempting to acquire leader lease...".
If you see a pod repeatedly winning and losing the election (leader flapping), it's often a sign that the leader is too busy to renew its lease in time, indicating that your Reconcile loop might be blocked or taking too long. This could be a reason to increase the RenewDeadline.
Tying It All Together: A Production Lifecycle Scenario
Let's trace the complete lifecycle of a DatabaseCluster CR that uses both patterns. Our operator manages a primary and replica PostgreSQL instance in a cloud provider.
DatabaseCluster operator is deployed with 3 replicas and leader-elect=true. - operator-pod-0, operator-pod-1, and operator-pod-2 start.
- They all contend for the Lease object.
- operator-pod-1 wins and becomes the leader. pod-0 and pod-2 become standbys.
DatabaseCluster manifest. - The Reconcile loop in operator-pod-1 is triggered.
- It sees the object is new and adds the databasecluster.my.domain/finalizer.
- It calls the cloud provider's API to provision the primary and replica DB instances.
- It updates the CR's status with the instance IDs and connection strings.
operator-pod-1 fails. - operator-pod-1 stops renewing its lease.
- After the RenewDeadline passes, its lease is considered expired.
- operator-pod-0 and operator-pod-2 attempt to acquire the lease. operator-pod-2 succeeds and becomes the new leader.
- The new leader's Reconcile loop runs for the existing DatabaseCluster CR. It checks the cloud provider and sees the DB instances already exist. Because its logic is idempotent, it simply confirms the state and updates the status if needed, then waits for the next change.
kubectl delete databasecluster my-prod-db. - The API server sets deletionTimestamp on the CR. This does not delete the object because the finalizer is present.
- The update triggers reconciliation in the current leader, operator-pod-2.
- The Reconcile function detects the deletionTimestamp and the finalizer.
- It executes its cleanup logic: it calls the cloud provider API to terminate the primary and replica DB instances.
- Edge Case: The API call to terminate the replica fails due to a transient network error. The cleanup function returns an error.
- The Reconcile loop returns the error, and controller-runtime requeues the request. The finalizer is not removed. The DatabaseCluster CR remains in the Terminating state.
- A few seconds later, the Reconcile loop runs again. This time, the cleanup logic successfully terminates both instances.
- The controller removes its finalizer from the CR and updates it.
- The Kubernetes garbage collector now sees a Terminating object with no finalizers and permanently deletes it from etcd.
Conclusion
Finalizers and leader election are not optional features for any serious Kubernetes operator. They are the fundamental building blocks of reliability and resilience. Finalizers provide the transactional control needed to manage the lifecycle of external resources, preventing costly leaks and ensuring clean system state. Leader election provides the high availability required for production workloads, allowing your operator to survive node failures and deployments without compromising control or causing destructive race conditions.
By mastering the implementation details, tuning parameters, and edge cases of these two patterns, you elevate your controllers from simple automation scripts to robust, enterprise-grade components of the Kubernetes ecosystem.