Stateful Kubernetes Operators: Finalizers & Leader Election Patterns
The State Management Gap in Standard Controllers
As senior engineers operating in cloud-native ecosystems, we appreciate the declarative power of Kubernetes. We define the desired state, and controllers work tirelessly to make it a reality. While built-in resources like StatefulSet provide foundational blocks for stateful applications, they fall short when operational logic becomes complex. A StatefulSet can manage a Pod's lifecycle, but it cannot provision an external cloud database, perform a complex data migration on termination, or integrate with a corporate backup API.
This is the domain of the Operator Pattern. An operator encodes this human operational knowledge into a custom controller. However, building a production-ready stateful operator requires moving beyond the simple 'create-if-not-exists' reconciliation loop. The true complexity lies in managing the resource's entire lifecycle, especially termination, and ensuring the controller itself is fault-tolerant.
This article bypasses the introductory material. We assume you understand the basics of custom resources (CRs) and controllers. Instead, we will focus on two advanced, mission-critical patterns:
deletionTimestamp to create a two-phase delete process.controller-runtime manager uses Lease objects to prevent the 'thundering herd' problem and how to configure its parameters for optimal performance and resilience.We will build a ManagedCache operator using Go and Kubebuilder to illustrate these concepts in a practical, production-oriented context.
Initial Scaffolding: The `ManagedCache` Operator
We'll begin by scaffolding our project. This is the only introductory-level step, and we'll move through it quickly. Ensure you have Go, Docker, and Kubebuilder installed.
# 1. Initialize the project
mkdir managedcache-operator && cd managedcache-operator
kubebuilder init --domain my.domain --repo my.domain/managedcache-operator
# 2. Create the API for our ManagedCache resource
kubebuilder create api --group cache --version v1alpha1 --kind ManagedCache
This generates the CRD definition, controller logic boilerplate, and other necessary files. Our ManagedCache spec will be simple, defining the size of a Redis cache deployment.
In api/v1alpha1/managedcache_types.go, we define our Spec and Status:
// api/v1alpha1/managedcache_types.go
package v1alpha1
import (
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// ManagedCacheSpec defines the desired state of ManagedCache
type ManagedCacheSpec struct {
// Size is the number of Redis replicas
// +kubebuilder:validation:Minimum=1
Size int32 `json:"size"`
// Image is the Redis image to use
// +kubebuilder:default="redis:6.2.5"
Image string `json:"image,omitempty"`
}
// ManagedCacheStatus defines the observed state of ManagedCache
type ManagedCacheStatus struct {
// ReadyReplicas is the number of ready Redis pods.
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
// Conditions represent the latest available observations of an object's state
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
// ManagedCache is the Schema for the managedcaches API
type ManagedCache struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ManagedCacheSpec `json:"spec,omitempty"`
Status ManagedCacheStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// ManagedCacheList contains a list of ManagedCache
type ManagedCacheList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []ManagedCache `json:"items"`
}
func init() {
SchemeBuilder.Register(&ManagedCache{}, &ManagedCacheList{})
}
Note the //+kubebuilder:subresource:status marker. This is critical. It tells the API server to treat the status field as a separate subresource, preventing race conditions where spec and status updates conflict. We will leverage this later.
Deep Dive 1: Graceful Deletion with Finalizers
When a user executes kubectl delete managedcache my-cache, the default behavior is for Kubernetes to perform cascading deletion on any objects owned by my-cache. This is fine for stateless resources. But what if our operator needed to perform a critical cleanup task first? For example:
* Flush the cache to a persistent storage backend.
* Deregister the cache endpoint from a service discovery system.
* Notify an external monitoring service.
If the ManagedCache object is deleted immediately, our controller loses its trigger. The reconciliation loop for that object will no longer run. This is where finalizers provide a powerful control mechanism.
A finalizer is simply a string in the metadata.finalizers list of an object. When an object has finalizers, a kubectl delete command does not immediately remove the object. Instead, it sets the metadata.deletionTimestamp field to the current time and updates the object. The object remains in a Terminating state. The controller, upon seeing a non-nil deletionTimestamp, knows it must now execute its cleanup logic.
Once the cleanup is complete, the controller is responsible for removing its finalizer from the list. Only when the metadata.finalizers list is empty will the Kubernetes garbage collector actually delete the object from etcd.
Let's implement this two-phase delete process in our ManagedCache controller.
Implementing the Finalizer Logic
First, we define our finalizer string. It's a best practice to use a domain-qualified name to avoid collisions with other controllers.
// controllers/managedcache_controller.go
const managedCacheFinalizer = "cache.my.domain/finalizer"
Now, we'll modify our Reconcile function. The core logic branches based on the presence of the deletionTimestamp.
// controllers/managedcache_controller.go
func (r *ManagedCacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the ManagedCache instance
cache := &cachev1alpha1.ManagedCache{}
err := r.Get(ctx, req.NamespacedName, cache)
if err != nil {
if errors.IsNotFound(err) {
log.Info("ManagedCache resource not found. Ignoring since object must be deleted")
return ctrl.Result{}, nil
}
log.Error(err, "Failed to get ManagedCache")
return ctrl.Result{}, err
}
// 2. Check if the object is being deleted
isMarkedForDeletion := cache.GetDeletionTimestamp() != nil
if isMarkedForDeletion {
if controllerutil.ContainsFinalizer(cache, managedCacheFinalizer) {
// Run our finalization logic. If it fails, we return the error
// so we can retry.
if err := r.finalizeManagedCache(ctx, log, cache); err != nil {
return ctrl.Result{}, err
}
// Remove the finalizer from the list and update it.
controllerutil.RemoveFinalizer(cache, managedCacheFinalizer)
if err := r.Update(ctx, cache); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// 3. Add finalizer for this CR if it doesn't exist
if !controllerutil.ContainsFinalizer(cache, managedCacheFinalizer) {
controllerutil.AddFinalizer(cache, managedCacheFinalizer)
if err := r.Update(ctx, cache); err != nil {
return ctrl.Result{}, err
}
}
// ... Normal reconciliation logic (create/update Deployment, etc.) goes here ...
// We'll implement this part later
// For now, let's just create the deployment
deployment, err := r.deploymentForManagedCache(cache)
if err != nil {
log.Error(err, "Failed to define deployment for ManagedCache")
return ctrl.Result{}, err
}
// Set ManagedCache instance as the owner and controller
if err := controllerutil.SetControllerReference(cache, deployment, r.Scheme); err != nil {
return ctrl.Result{}, err
}
// Check if this Deployment already exists
found := &appsv1.Deployment{}
err = r.Get(ctx, types.NamespacedName{Name: deployment.Name, Namespace: deployment.Namespace}, found)
if err != nil && errors.IsNotFound(err) {
log.Info("Creating a new Deployment", "Deployment.Namespace", deployment.Namespace, "Deployment.Name", deployment.Name)
err = r.Create(ctx, deployment)
if err != nil {
log.Error(err, "Failed to create new Deployment")
return ctrl.Result{}, err
}
// Deployment created successfully - return and requeue
return ctrl.Result{Requeue: true}, nil
} else if err != nil {
log.Error(err, "Failed to get Deployment")
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
// finalizeManagedCache performs cleanup actions before the CR is deleted.
func (r *ManagedCacheReconciler) finalizeManagedCache(ctx context.Context, log logr.Logger, cache *cachev1alpha1.ManagedCache) error {
// This is where you would put your real-world cleanup logic.
// For this example, we'll just log a message.
log.Info("Performing finalization tasks for ManagedCache", "name", cache.Name)
// Example: Call an external API to deregister the cache instance.
// Simulating this with a sleep and log.
log.Info("Simulating call to external backup service...")
time.Sleep(2 * time.Second)
log.Info("Backup service call complete.")
// If your cleanup logic fails, you should return an error here.
// The reconciliation will be retried, and the finalizer will not be removed.
// This prevents the object from being deleted until cleanup is successful.
log.Info("Finalization successful for ManagedCache", "name", cache.Name)
return nil
}
// A helper function to create the deployment
func (r *ManagedCacheReconciler) deploymentForManagedCache(m *cachev1alpha1.ManagedCache) (*appsv1.Deployment, error) {
// ... implementation to build a Redis deployment spec ...
// This is standard boilerplate code, omitted for brevity but available in the full example repo.
ls := map[string]string{"app": "managedcache", "managedcache_cr": m.Name}
depl := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: m.Name + "-redis",
Namespace: m.Namespace,
},
Spec: appsv1.DeploymentSpec{
Replicas: &m.Spec.Size,
Selector: &metav1.LabelSelector{
MatchLabels: ls,
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: ls,
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Image: m.Spec.Image,
Name: "redis",
Ports: []corev1.ContainerPort{{
ContainerPort: 6379,
Name: "redis",
}},
}},
},
},
},
}
return depl, nil
}
Edge Case: Finalizer Logic Failure
What happens if finalizeManagedCache returns an error? The Reconcile function will return this error, and controller-runtime will requeue the request with exponential backoff. Crucially, the finalizer is not removed. The ManagedCache object will be stuck in the Terminating state. A kubectl get managedcache will show its status, and a kubectl describe will show the events indicating repeated reconciliation failures.
This is the desired behavior. It prevents data loss or orphaned external resources by blocking deletion until the operator can successfully complete its cleanup duties. This resilience is a hallmark of a production-grade operator.
Deep Dive 2: High Availability via Leader Election
For production deployments, you will run multiple replicas of your operator for high availability. If one pod crashes, another can take over. This introduces a new challenge: how do you prevent all operator pods from reconciling the same ManagedCache object simultaneously? This could lead to race conditions, conflicting API calls, and duplicated resource creation.
This is solved by leader election. Only one operator pod—the leader—is active at any given time. The other pods remain on standby, ready to take over if the leader fails.
Kubebuilder and controller-runtime handle this for you out-of-the-box. When you scaffold a project, the main.go file includes code to enable leader election.
// main.go
// ... imports
func main() {
// ... setup code
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
MetricsBindAddress: metricsAddr,
Port: 9443,
HealthProbeBindAddress: probeAddr,
LeaderElection: true, // This is the magic flag!
LeaderElectionID: "a6d7123.my.domain",
})
// ... controller setup
}
How Leader Election Works Internally
Under the hood, controller-runtime uses the Kubernetes API to coordinate. It creates a resource (by default, a Lease object in the kube-system namespace, but configurable) named after the LeaderElectionID. All operator pods attempt to acquire a lock on this Lease object by updating it with their identity.
* The first pod to successfully update the Lease becomes the leader.
* The leader continuously renews its lease before it expires.
* The other pods periodically try to acquire the lease. They will only succeed if the current leader fails to renew it (e.g., because the pod crashed or became network-partitioned).
Tuning Leader Election for Performance and Resilience
The default settings for leader election are a safe starting point, but for a stateful operator managing critical infrastructure, you need to understand the trade-offs and tune the parameters. These are configured in ctrl.Options.
* LeaderElectionLeaseDuration (default: 15s): The duration that non-leader candidates will wait to force acquire leadership. This is the primary knob for failover time. A lower value means faster failover.
* LeaderElectionRenewDeadline (default: 10s): The duration that the leader will retry refreshing its leadership before giving up. This must be less than LeaseDuration.
* LeaderElectionRetryPeriod (default: 2s): The duration that non-leader candidates will wait between attempts to acquire leadership.
The relationship is critical: LeaseDuration > RenewDeadline > RetryPeriod.
Scenario 1: High-Speed Failover Required
Imagine our ManagedCache operator is managing a critical in-memory cache for a real-time bidding system. Downtime of even a few seconds is costly. We need to prioritize fast failover.
// main.go - Aggressive tuning
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
// ... other options
LeaderElection: true,
LeaderElectionID: "a6d7123.my.domain",
LeaderElectionLeaseDuration: &metav1.Duration{Duration: 10 * time.Second},
LeaderElectionRenewDeadline: &metav1.Duration{Duration: 7 * time.Second},
LeaderElectionRetryPeriod: &metav1.Duration{Duration: 2 * time.Second},
})
* Pros: If the leader pod dies, a new leader will be elected in approximately 7-10 seconds.
* Cons: This configuration increases the load on the Kubernetes API server. The leader renews its lease every ~2 seconds. With hundreds of operators in a cluster, this can contribute to API server performance degradation. This is a classic availability vs. performance trade-off.
Scenario 2: Reducing API Server Load
Now imagine the operator manages a long-term archival data store. A failover time of a minute is perfectly acceptable. Here, we can prioritize reducing the operational chatter.
// main.go - Conservative tuning
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
// ... other options
LeaderElection: true,
LeaderElectionID: "a6d7123.my.domain",
LeaderElectionLeaseDuration: &metav1.Duration{Duration: 60 * time.Second},
LeaderElectionRenewDeadline: &metav1.Duration{Duration: 45 * time.Second},
LeaderElectionRetryPeriod: &metav1.Duration{Duration: 10 * time.Second},
})
* Pros: The leader only writes to the API server every ~10 seconds, significantly reducing load.
* Cons: Failover will take much longer, between 45 and 60 seconds.
Choosing the right values requires a deep understanding of your application's SLOs and the operational capacity of your Kubernetes control plane.
Tying It Together: The Status Subresource
Our final piece is to provide clear feedback to the user about the state of the ManagedCache resource. We will use the status subresource we defined earlier.
Using the status subresource is not just good practice; it's essential for preventing conflicts. The Reconcile loop might be trying to update the status (e.g., ReadyReplicas) at the same time a user or another system is modifying the spec (e.g., changing the size). Because they are separate API endpoints (/api/v1/.../managedcache/my-cache vs /api/v1/.../managedcache/my-cache/status), these updates do not conflict, preventing frustrating 409 Conflict errors.
Here is a more complete Reconcile loop incorporating status updates.
// controllers/managedcache_controller.go
func (r *ManagedCacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// ... (1) Fetch instance, (2) Handle finalizer logic as before ...
// (3) Add finalizer logic as before ...
// 4. Reconcile the Deployment
deployment := &appsv1.Deployment{}
err = r.Get(ctx, types.NamespacedName{Name: cache.Name + "-redis", Namespace: cache.Namespace}, deployment)
// If the deployment does not exist, create it
if err != nil && errors.IsNotFound(err) {
dep, err := r.deploymentForManagedCache(cache)
if err != nil {
log.Error(err, "Failed to build new Deployment spec")
return ctrl.Result{}, err
}
if err := controllerutil.SetControllerReference(cache, dep, r.Scheme); err != nil {
return ctrl.Result{}, err
}
log.Info("Creating new Deployment")
if err := r.Create(ctx, dep); err != nil {
log.Error(err, "Failed to create new Deployment")
return ctrl.Result{}, err
}
// Requeue to update status after deployment is created
return ctrl.Result{Requeue: true}, nil
} else if err != nil {
log.Error(err, "Failed to get Deployment")
return ctrl.Result{}, err
}
// 5. Ensure the deployment size matches the spec
size := cache.Spec.Size
if *deployment.Spec.Replicas != size {
deployment.Spec.Replicas = &size
if err = r.Update(ctx, deployment); err != nil {
log.Error(err, "Failed to update Deployment size")
return ctrl.Result{}, err
}
// Requeue to check status after update
return ctrl.Result{Requeue: true}, nil
}
// 6. Update the status
// IMPORTANT: Use a deep copy to avoid modifying the cache object from the informer cache.
cacheCopy := cache.DeepCopy()
cacheCopy.Status.ReadyReplicas = deployment.Status.ReadyReplicas
// Use the status writer to update the status subresource.
if err := r.Status().Update(ctx, cacheCopy); err != nil {
log.Error(err, "Failed to update ManagedCache status")
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
// In SetupWithManager, ensure you own Deployments to trigger reconciliation on their changes.
func (r *ManagedCacheReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&cachev1alpha1.ManagedCache{}).
Owns(&appsv1.Deployment{}). // This is key!
Complete(r)
}
The most important line here is r.Status().Update(ctx, cacheCopy). We are using the specialized status client (r.Status()) to patch only the /status subresource. This prevents write conflicts and adheres to Kubernetes API conventions.
Conclusion: Beyond the Basics
We've moved far beyond a simple controller. By implementing finalizers, we've guaranteed that our operator can perform critical cleanup tasks, preventing orphaned resources and data loss even when its custom resource is deleted. By understanding and tuning leader election, we've ensured our operator can be deployed in a highly available configuration, with predictable failover behavior that can be tailored to meet specific SLOs.
These patterns are not optional extras; they are the bedrock of reliable, production-grade stateful operators. They transform a simple reconciliation loop into a robust lifecycle management system that can be trusted with critical stateful workloads. The next steps in this journey would involve implementing admission webhooks for advanced validation, managing more complex external dependencies, and developing more sophisticated status condition reporting, but the foundation for all of that advanced work lies in the correct implementation of finalizers and a deep understanding of the operator's own high-availability model.