Idempotent Kubernetes Operators with Finalizers for Stateful Services
The Fragility of State in a Declarative World
In the Kubernetes ecosystem, the control plane's primary directive is to converge the cluster's actual state with a user-defined desired state. For stateless applications managed by Deployments, this model is remarkably effective. However, when managing stateful services—databases, message queues, distributed caches—the convergence model reveals its sharp edges. A simple kubectl delete can trigger a cascade of ungraceful terminations, leaving orphaned persistent volumes, dangling DNS entries in an external service discovery, or corrupted cluster states.
This is the problem domain where Kubernetes Operators excel. By extending the Kubernetes API with Custom Resource Definitions (CRDs) and implementing custom control loops, we can encode complex, domain-specific operational logic directly into the cluster. But building an operator is not enough. A poorly designed operator can be more dangerous than manual management, introducing race conditions and non-deterministic behavior.
This article bypasses introductory concepts and dives directly into two architectural patterns that are fundamental to building production-grade, reliable operators: idempotency and finalizers. We will construct an operator for a fictional replicated database, RepliDB, to illustrate these concepts with production-ready Go code using the controller-runtime library.
Our goal is to build a controller that can:
StatefulSet and Service to match our RepliDB CR specification.- Survive controller restarts and intermittent failures without creating duplicate resources or entering error loops.
RepliDB resource to be removed from the API server.Defining the Custom Resource: `RepliDB`
First, let's define the API for our stateful service. This CRD specifies the desired state for our replicated database. We'll use Kubebuilder markers to generate the CRD manifest and boilerplate Go code.
api/v1alpha1/replidb_types.go
package v1alpha1
import (
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// RepliDBSpec defines the desired state of RepliDB
type RepliDBSpec struct {
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=9
// Number of database replicas.
Replicas int32 `json:"replicas"`
// Image is the container image for the database node.
Image string `json:"image"`
// StorageClassName is the name of the StorageClass for the PersistentVolumeClaims.
StorageClassName string `json:"storageClassName"`
}
// RepliDBStatus defines the observed state of RepliDB
type RepliDBStatus struct {
// Conditions represent the latest available observations of the RepliDB's state.
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty"`
// ReadyReplicas is the number of database nodes that are considered ready.
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=".spec.replicas"
//+kubebuilder:printcolumn:name="Ready",type=string,JSONPath=".status.readyReplicas"
//+kubebuilder:printcolumn:name="Age",type=date,JSONPath=".metadata.creationTimestamp"
// RepliDB is the Schema for the replidbs API
type RepliDB struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec RepliDBSpec `json:"spec,omitempty"`
Status RepliDBStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// RepliDBList contains a list of RepliDB
type RepliDBList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []RepliDB `json:"items"`
}
func init() {
SchemeBuilder.Register(&RepliDB{}, &RepliDBList{})
}
This CRD allows a user to specify the number of replicas, the container image, and the storage class. The status subresource will track readiness and conditions, which is critical for observability.
The Core Principle: Idempotent Reconciliation
The heart of any operator is the Reconcile function. It's invoked whenever a change occurs to the custom resource or any of its owned resources. A common novice mistake is to write procedural logic: "if the StatefulSet doesn't exist, create it". This approach is fragile. The reconciliation loop can be triggered multiple times for the same state, and it can fail or be restarted at any point.
Idempotency is the property where an operation can be applied multiple times without changing the result beyond the initial application. In our operator, the reconciliation must be idempotent. Whether it runs once or five times in a row, the outcome on the cluster should be identical: the actual state should converge to the desired state.
This is achieved by structuring the Reconcile function as a state comparison and convergence engine:
RepliDB).StatefulSet, Service) in memory based on the RepliDB spec.Implementing the Idempotent Loop
Let's implement this pattern for our RepliDB's StatefulSet. We'll create a helper function that builds the desired StatefulSet object. This separates state definition from state enforcement.
controllers/replidb_controller.go (inside the RepliDBReconciler struct)
import (
// ... other imports
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)
// desiredStatefulSetForRepliDB constructs the desired StatefulSet object.
func (r *RepliDBReconciler) desiredStatefulSetForRepliDB(db *repldbv1alpha1.RepliDB) *appsv1.StatefulSet {
labels := map[string]string{
"app": "replidb",
"replidb_cr": db.Name,
}
sts := &appsv1.StatefulSet{
ObjectMeta: metav1.ObjectMeta{
Name: db.Name,
Namespace: db.Namespace,
Labels: labels,
},
Spec: appsv1.StatefulSetSpec{
Replicas: &db.Spec.Replicas,
Selector: &metav1.LabelSelector{MatchLabels: labels},
ServiceName: db.Name, // Headless service name
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{Labels: labels},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "database",
Image: db.Spec.Image,
Ports: []corev1.ContainerPort{{
ContainerPort: 5432,
Name: "db-port",
}},
}},
},
},
},
}
// Set RepliDB instance as the owner and controller.
// This is crucial for garbage collection and for the reconciliation loop to be triggered on StatefulSet changes.
controllerutil.SetControllerReference(db, sts, r.Scheme)
return sts
}
Now, the Reconcile function uses this to enforce the state.
controllers/replidb_controller.go (inside the Reconcile method, initial implementation)
func (r *RepliDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the RepliDB instance
var replidb repldbv1alpha1.RepliDB
if err := r.Get(ctx, req.NamespacedName, &replidb); err != nil {
if errors.IsNotFound(err) {
log.Info("RepliDB resource not found. Ignoring since object must be deleted.")
return ctrl.Result{}, nil
}
log.Error(err, "Failed to get RepliDB")
return ctrl.Result{}, err
}
// 2. Reconcile the StatefulSet
foundSts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: replidb.Name, Namespace: replidb.Namespace}, foundSts)
desiredSts := r.desiredStatefulSetForRepliDB(&replidb)
if err != nil && errors.IsNotFound(err) {
// StatefulSet does not exist, create it.
log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
if err := r.Create(ctx, desiredSts); err != nil {
log.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
return ctrl.Result{}, err
}
// StatefulSet created successfully - return and requeue
return ctrl.Result{Requeue: true}, nil
} else if err != nil {
log.Error(err, "Failed to get StatefulSet")
return ctrl.Result{}, err
}
// 3. Ensure the StatefulSet spec is up to date.
// A simple deep equality check is often too naive. Real-world controllers need more sophisticated diffing.
// For this example, we'll check key fields.
if *foundSts.Spec.Replicas != replidb.Spec.Replicas || foundSts.Spec.Template.Spec.Containers[0].Image != replidb.Spec.Image {
log.Info("StatefulSet spec out of sync, updating...")
foundSts.Spec.Replicas = &replidb.Spec.Replicas
foundSts.Spec.Template.Spec.Containers[0].Image = replidb.Spec.Image
if err := r.Update(ctx, foundSts); err != nil {
log.Error(err, "Failed to update StatefulSet", "StatefulSet.Namespace", foundSts.Namespace, "StatefulSet.Name", foundSts.Name)
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
}
// ... update status and finish reconciliation ...
return ctrl.Result{}, nil
}
This logic handles creation and updates idempotently. If you run kubectl apply with the same manifest multiple times, the controller will find that the StatefulSet spec matches the desired state and will simply exit the loop without performing any writes to the API server.
The Deletion Dilemma: Why Finalizers are Essential
Our current operator correctly manages the lifecycle of the StatefulSet. But what happens when a user executes kubectl delete replidb my-db?
RepliDB object is marked for deletion by the Kubernetes API server.SetControllerReference, Kubernetes's built-in garbage collector will see that the StatefulSet is owned by the deleted RepliDB and will proceed to delete it.StatefulSet controller then terminates its Pods.This process is abrupt. It provides no opportunity for our operator to perform pre-deletion cleanup. For our RepliDB, we might need to:
* Notify a primary node to gracefully hand over leadership.
* Flush in-memory buffers to disk.
* De-register the database nodes from an external service discovery endpoint.
* Backup the data from the PersistentVolumes before they are released.
This is where finalizers become the critical tool. A finalizer is a key in an object's metadata.finalizers list. When a finalizer is present, a kubectl delete command does not immediately delete the object. Instead, it sets the metadata.deletionTimestamp field to the current time. The object remains visible to the API and to our controller, but in a "deleting" state.
It is now the responsibility of the controller that added the finalizer to perform its cleanup tasks and, once complete, remove the finalizer key from the list. Only when the finalizers list is empty will the API server actually delete the object.
Implementing a Finalizer for Graceful Shutdown
Let's integrate a finalizer into our RepliDBReconciler.
Step 1: Define the finalizer name.
It's a best practice to use a domain-qualified name to avoid collisions.
controllers/replidb_controller.go
const replidbFinalizer = "db.example.com/finalizer"
Step 2: Modify the Reconcile function to manage the finalizer.
The reconciliation logic now splits into two main paths:
DeletionTimestamp is not set: The object is not being deleted. This is the normal reconciliation path. Our first job here is to ensure our finalizer is present.DeletionTimestamp is set: The object is being deleted. We must execute our cleanup logic. If cleanup is successful, we remove the finalizer.Here is the complete, production-grade Reconcile function structure:
controllers/replidb_controller.go
func (r *RepliDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the RepliDB instance
var replidb repldbv1alpha1.RepliDB
if err := r.Get(ctx, req.NamespacedName, &replidb); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
// 2. Check if the object is being deleted
isMarkedForDeletion := replidb.GetDeletionTimestamp() != nil
if isMarkedForDeletion {
if controllerutil.ContainsFinalizer(&replidb, replidbFinalizer) {
// Run our finalizer logic. If it fails, we'll try again later.
if err := r.finalizeRepliDB(ctx, &replidb); err != nil {
log.Error(err, "Failed to finalize RepliDB")
return ctrl.Result{}, err
}
// If successful, remove the finalizer and update the object.
controllerutil.RemoveFinalizer(&replidb, replidbFinalizer)
if err := r.Update(ctx, &replidb); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// 3. Add finalizer for this CR if it doesn't exist
if !controllerutil.ContainsFinalizer(&replidb, replidbFinalizer) {
log.Info("Adding finalizer for RepliDB")
controllerutil.AddFinalizer(&replidb, replidbFinalizer)
if err := r.Update(ctx, &replidb); err != nil {
return ctrl.Result{}, err
}
}
// 4. Run the regular reconciliation logic (create/update StatefulSet, etc.)
// ... [The idempotent reconciliation logic from before] ...
foundSts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: replidb.Name, Namespace: replidb.Namespace}, foundSts)
// ... etc ...
return ctrl.Result{}, nil
}
// finalizeRepliDB contains the logic to run before deleting the CR.
func (r *RepliDBReconciler) finalizeRepliDB(ctx context.Context, db *repldbv1alpha1.RepliDB) error {
log := log.FromContext(ctx)
log.Info("Starting finalization for RepliDB")
// In a real-world scenario, you would add complex cleanup logic here.
// For example, calling an external API to de-register the database.
// We'll simulate this with a log message.
log.Info("De-registering database from external discovery service", "dbName", db.Name)
// http.Post("https://discovery.example.com/deregister", ...)
// The built-in garbage collection will handle the owned StatefulSet, but if you had
// other external resources not managed by owner references, you would clean them up here.
// For example, deleting a DNS record in Route53 or a load balancer in a cloud provider.
log.Info("Finalization for RepliDB successful")
return nil
}
This structure is robust. When a RepliDB is created, the reconciler's first action is to add the finalizer. From that point on, the object cannot be fully deleted from the cluster until our finalizeRepliDB function runs successfully and the finalizer is removed. This guarantees our cleanup logic will execute.
Advanced Edge Cases and Performance Considerations
Building a simple operator is straightforward. Building one that withstands the chaos of a production environment requires thinking about failure modes.
Edge Case 1: Partial Failure During Finalization
What if our finalizeRepliDB function fails? For instance, the external discovery service it calls is unavailable. In our current implementation, the function returns an error, and the Reconcile call is requeued with exponential backoff by controller-runtime. This is the desired behavior. The RepliDB object will remain in a Terminating state, with the deletionTimestamp set, until the discovery service becomes available and our finalizer logic succeeds. This prevents the system from entering an inconsistent state where the database pods are gone but are still registered as active in the discovery service.
Edge Case 2: Controller Restarts
The operator Pod can be evicted, crash, or be rescheduled at any time. How does this affect our logic?
* During normal reconciliation: State is stored in etcd (the CR itself). When the new controller pod starts, it will receive events for all RepliDB objects and its idempotent reconciliation loop will simply verify the state, making no changes if everything is already converged.
During finalization: The state is also stored in etcd (the deletionTimestamp and the finalizer key). If the controller crashes mid-cleanup, the new instance will see the object is still marked for deletion and still has the finalizer, and it will re-trigger the finalizeRepliDB function. This is why the cleanup logic itself must also be idempotent*. For example, the external API call should be a DELETE request, which is typically idempotent, rather than a state-toggling POST.
Performance: Watching, Predicates, and Caching
By default, controller-runtime sets up watches that trigger reconciliation on any change to the primary resource (RepliDB) or secondary resources (StatefulSet, Service). For an operator managing thousands of CRs, this can be inefficient.
status subresource of our RepliDB, which our own controller is writing, shouldn't trigger another reconciliation loop. We can use predicates to filter these events. controllers/replidb_controller.go (in SetupWithManager)
import "sigs.k8s.io/controller-runtime/pkg/predicate"
func (r *RepliDBReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&repldbv1alpha1.RepliDB{}).
Owns(&appsv1.StatefulSet{}).
WithEventFilter(predicate.GenerationChangedPredicate{}).
Complete(r)
}
GenerationChangedPredicate filters out events where only the metadata.generation field has not changed. Since status updates don't increment the generation number but spec changes do, this effectively ignores status-only updates, significantly reducing unnecessary reconciliations.
controller-runtime's client is, by default, a caching client. r.Get() calls read from a local in-memory cache (an informer) that is kept in sync with the API server. This is highly efficient and avoids overwhelming etcd. However, it's crucial to remember that r.Update() and r.Create() calls go directly to the API server. This cache-on-read, direct-on-write pattern is fundamental to operator performance. main.go
// ...
controller.New("replidb-controller", mgr, controller.Options{
Reconciler: &controllers.RepliDBReconciler{ /* ... */ },
MaxConcurrentReconciles: 5, // Default is 1
})
// ...
This must be used with caution, as it can introduce race conditions if two reconciliations for the same object were to run concurrently (which controller-runtime prevents), or if reconciliations for different objects interact with the same external, non-atomic systems.
Conclusion: From Logic to Reliability
We have moved beyond a basic operator implementation to a robust, production-oriented architecture. The key takeaways are not about the specific code, but the patterns they represent:
Idempotency is non-negotiable. Structure every reconciliation as a stateless comparison of desired vs. actual* state. Your operator must be able to crash and restart at any moment without corrupting the system it manages.
* Finalizers are the canonical mechanism for safe deletion. They provide the essential hook for performing graceful shutdown and cleaning up any external resources that Kubernetes's garbage collector is unaware of. They transform deletion from an abrupt event into a managed, observable process.
By combining these two patterns, you provide a powerful abstraction that allows users to manage the entire lifecycle of a complex, stateful application with the same declarative kubectl apply and kubectl delete commands they use for simple, stateless services. This is the true power of the Operator Framework: encoding deep, resilient operational knowledge into software that runs as a first-class citizen on your Kubernetes cluster.