Advanced Operator Patterns: StatefulSet Management with Finalizers
The Fragility of Default Deletion in Stateful Systems
In the world of Kubernetes, the default controller patterns excel at managing stateless applications. A Deployment can be deleted, and its ReplicaSet and Pods are garbage collected with minimal consequence. The system is designed for ephemeral workloads. However, when managing stateful applications like databases, caches, or message queues with an Operator, this default behavior is not just insufficient—it's dangerous. A naive kubectl delete my-database
could trigger a cascading deletion that instantly terminates pods, potentially corrupting data, orphaning Persistent Volume Claims (PVCs), and leaving the cluster in an inconsistent state.
The core issue is that Kubernetes's garbage collection is unaware of the application-specific logic required for a graceful shutdown. It doesn't know it needs to flush a write-ahead log, quiesce connections, take a final backup, or deregister from a discovery service. This is where the Operator pattern must evolve beyond simple resource creation and reconciliation.
This article dissects an advanced, production-critical pattern: using Finalizers to intercept the deletion process of a Custom Resource (CR) and inject stateful, application-aware cleanup logic. We will build an operator for a fictional ShardDB
database, demonstrating how to manage its underlying StatefulSet
's lifecycle with precision, ensuring data safety and resource hygiene.
The Scenario: Managing `ShardDB`
Imagine a ShardDB
Custom Resource Definition (CRD) that provisions a distributed database. The Operator's basic reconciliation loop creates a StatefulSet
and a Service
. When a user deletes the ShardDB
CR, our goal is to prevent immediate resource deletion and instead execute a controlled shutdown sequence:
- Initiate a backup for each pod's data.
StatefulSet
down to zero replicas, one pod at a time.- Verify that all backup jobs are complete.
- Clean up any external resources (e.g., monitoring dashboards, DNS entries).
StatefulSet
, Service
, and the CR itself.This controlled demolition is impossible without a mechanism to pause Kubernetes's garbage collector. That mechanism is the Finalizer.
The Finalizer Mechanism: An Operator's Deletion Hook
A Finalizer is simply a string added to an object's metadata.finalizers
list. When a user requests to delete an object that has finalizers, Kubernetes does not immediately delete it. Instead, it updates the object's metadata.deletionTimestamp
to the current time and leaves the object in a Terminating
state.
The object will remain in this state, fully accessible via the API, until its metadata.finalizers
list is empty. It is the responsibility of the controller (our Operator) that added the finalizer to perform its cleanup logic and then remove its own finalizer from the list. Once the list is empty and the deletionTimestamp
is set, Kubernetes completes the deletion.
This provides the exact hook we need. Our Operator's reconciliation loop will now have two primary paths:
* Reconciliation Path: If deletionTimestamp
is nil
, execute the normal logic: ensure the StatefulSet
exists, matches the spec, and update the status.
* Finalization Path: If deletionTimestamp
is not nil
, execute the cleanup logic. Once complete, remove the finalizer.
Let's implement this pattern.
Initial Operator Setup and CRD Definition
We assume you have a working Go environment and have initialized an operator project using operator-sdk init
. We'll define the API for our ShardDB
resource.
api/v1/sharddb_types.go
This file defines the schema for our ShardDB
CR. Note the detailed Spec
and, crucially, the Status
subresource. The Status
will be essential for making our cleanup logic idempotent and fault-tolerant.
package v1
import (
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// ShardDBSpec defines the desired state of ShardDB
type ShardDBSpec struct {
// Number of desired pods. Defaults to 3.
// +kubebuilder:validation:Minimum=1
// +kubebuilder:default=3
Replicas *int32 `json:"replicas"`
// The database version. Example: "14.2"
Version string `json:"version"`
// StorageClassName for the PersistentVolumeClaims.
StorageClassName string `json:"storageClassName"`
// Volume size for each replica. Example: "10Gi"
VolumeSize string `json:"volumeSize"`
}
// ShardDBStatus defines the observed state of ShardDB
type ShardDBStatus struct {
// The current state of the database cluster. Can be Creating, Ready, Deleting, Failed.
Phase string `json:"phase,omitempty"`
// Total number of non-terminated pods targeted by this deployment (their labels match the selector).
Replicas int32 `json:"replicas,omitempty"`
// Conditions represent the latest available observations of an object's state.
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
//+kubebuilder:printcolumn:name="Version",type="string",JSONPath=".spec.version"
//+kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// ShardDB is the Schema for the sharddbs API
type ShardDB struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ShardDBSpec `json:"spec,omitempty"`
Status ShardDBStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// ShardDBList contains a list of ShardDB
type ShardDBList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []ShardDB `json:"items"`
}
func init() {
SchemeBuilder.Register(&ShardDB{}, &ShardDBList{})
}
After running make manifests
and make install
, this CRD is available in the cluster.
Implementing the Finalizer Logic in the Reconciler
Now we modify the core Reconcile
function in controllers/sharddb_controller.go
. The logic will be split based on the presence of the deletionTimestamp
.
First, let's define our finalizer name as a constant.
const shardDBFinalizer = "db.example.com/finalizer"
Here is the skeleton of the updated Reconcile
function:
// controllers/sharddb_controller.go
func (r *ShardDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the ShardDB instance
instance := &dbv1.ShardDB{}
err := r.Get(ctx, req.NamespacedName, instance)
if err != nil {
if errors.IsNotFound(err) {
log.Info("ShardDB resource not found. Ignoring since object must be deleted")
return ctrl.Result{}, nil
}
log.Error(err, "Failed to get ShardDB")
return ctrl.Result{}, err
}
// 2. Check if the instance is marked for deletion
isMarkedForDeletion := instance.GetDeletionTimestamp() != nil
if isMarkedForDeletion {
if controllerutil.ContainsFinalizer(instance, shardDBFinalizer) {
// Run finalization logic. If it fails, requeue.
if err := r.finalizeShardDB(ctx, instance); err != nil {
// Don't remove finalizer if cleanup fails
return ctrl.Result{}, err
}
// Cleanup succeeded, remove the finalizer
controllerutil.RemoveFinalizer(instance, shardDBFinalizer)
err := r.Update(ctx, instance)
if err != nil {
return ctrl.Result{}, err
}
}
// Stop reconciliation as the item is being deleted
return ctrl.Result{}, nil
}
// 3. Add finalizer for this CR if it doesn't exist
if !controllerutil.ContainsFinalizer(instance, shardDBFinalizer) {
log.Info("Adding Finalizer for the ShardDB")
controllerutil.AddFinalizer(instance, shardDBFinalizer)
err = r.Update(ctx, instance)
if err != nil {
return ctrl.Result{}, err
}
}
// 4. Run regular reconciliation logic
// ... (code to create/update StatefulSet, Service, etc.)
// For brevity, this part is omitted but would contain standard operator logic.
log.Info("Running standard reconciliation for ShardDB")
return ctrl.Result{}, nil
}
// finalizeShardDB performs the actual cleanup logic.
func (r *ShardDBReconciler) finalizeShardDB(ctx context.Context, db *dbv1.ShardDB) error {
log := log.FromContext(ctx)
log.Info("Starting finalization for ShardDB")
// Here we will implement our multi-step, idempotent cleanup process.
// For now, we'll just log a message.
log.Info("Simulating backup and resource cleanup...")
time.Sleep(5 * time.Second) // Simulate long-running task
log.Info("Finalization for ShardDB completed successfully")
return nil
}
This structure correctly handles the finalizer lifecycle:
Update
call triggers a new reconciliation.deletionTimestamp
. If it's present, we divert to our finalizeShardDB
function.finalizeShardDB
function contains our critical cleanup logic. We'll flesh this out next.finalizeShardDB
returns nil
(success), we remove the finalizer and update the CR. This is the signal to Kubernetes to proceed with deletion.Building an Idempotent, Multi-Stage Finalizer
The simple finalizeShardDB
above is not production-ready. An operator can crash and restart at any point. If our cleanup involves multiple steps (e.g., backup pod 0, then pod 1, then pod 2), we need to track our progress. The CR's Status
subresource is the perfect place for this.
Let's refine our ShardDBStatus
and the finalizer logic to be robust.
First, we add more detail to the status:
// api/v1/sharddb_types.go
// ... (inside ShardDBStatus struct)
type ShardDBStatus struct {
Phase string `json:"phase,omitempty"`
Replicas int32 `json:"replicas,omitempty"`
// New fields for finalization tracking
// +optional
FinalizationStatus string `json:"finalizationStatus,omitempty"`
// +optional
LastBackupAttempt *metav1.Time `json:"lastBackupAttempt,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
Now, we build a more sophisticated finalizeShardDB
function.
// controllers/sharddb_controller.go
const (
finalizationStateBackupsStarted = "BackupsStarted"
finalizationStateScalingDown = "ScalingDown"
finalizationStateComplete = "Complete"
)
func (r *ShardDBReconciler) finalizeShardDB(ctx context.Context, db *dbv1.ShardDB) error {
log := log.FromContext(ctx)
// Update status to indicate deletion is in progress
if db.Status.Phase != "Deleting" {
db.Status.Phase = "Deleting"
if err := r.Status().Update(ctx, db); err != nil {
return err
}
}
// Step 1: Perform Backups
if db.Status.FinalizationStatus != finalizationStateBackupsStarted &&
db.Status.FinalizationStatus != finalizationStateScalingDown &&
db.Status.FinalizationStatus != finalizationStateComplete {
log.Info("Starting backup finalization step")
// This function would trigger backup jobs for each pod.
// It should be idempotent.
// For this example, we'll simulate it and update status.
err := r.triggerBackups(ctx, db)
if err != nil {
log.Error(err, "Backup step failed")
// You might want to update status with an error condition here
return err // Requeue
}
db.Status.FinalizationStatus = finalizationStateBackupsStarted
if err := r.Status().Update(ctx, db); err != nil {
return err
}
// Requeue to check backup job status
return fmt.Errorf("requeuing to check backup status")
}
// Step 2: Check backup status and scale down
if db.Status.FinalizationStatus == finalizationStateBackupsStarted {
log.Info("Checking backup job status")
backupsComplete, err := r.areBackupsComplete(ctx, db)
if err != nil {
return err
}
if !backupsComplete {
log.Info("Backups not yet complete, requeueing")
// Using an error to requeue is a common pattern for polling
return fmt.Errorf("requeuing: backups still in progress")
}
log.Info("Backups complete. Scaling down StatefulSet")
db.Status.FinalizationStatus = finalizationStateScalingDown
if err := r.Status().Update(ctx, db); err != nil {
return err
}
}
// Step 3: Perform scale down and wait for completion
if db.Status.FinalizationStatus == finalizationStateScalingDown {
sts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, sts)
if err != nil && !errors.IsNotFound(err) {
return err
}
if !errors.IsNotFound(err) {
if sts.Spec.Replicas != nil && *sts.Spec.Replicas != 0 {
log.Info("Setting StatefulSet replicas to 0")
zeroReplicas := int32(0)
sts.Spec.Replicas = &zeroReplicas
if err := r.Update(ctx, sts); err != nil {
return err
}
}
if sts.Status.Replicas != 0 {
log.Info("Waiting for StatefulSet pods to terminate")
return fmt.Errorf("requeuing: waiting for pods to terminate")
}
}
log.Info("StatefulSet scaled down successfully")
db.Status.FinalizationStatus = finalizationStateComplete
if err := r.Status().Update(ctx, db); err != nil {
return err
}
}
log.Info("All finalization steps complete.")
return nil // Success! The finalizer can now be removed.
}
// triggerBackups is a placeholder for actual backup logic
func (r *ShardDBReconciler) triggerBackups(ctx context.Context, db *dbv1.ShardDB) error {
log := log.FromContext(ctx)
log.Info("Triggering backup jobs for all ShardDB pods... (simulation)")
// In a real implementation, you would create Batchv1.Job objects for each PVC.
return nil
}
// areBackupsComplete is a placeholder for checking backup status
func (r *ShardDBReconciler) areBackupsComplete(ctx context.Context, db *dbv1.ShardDB) (bool, error) {
log := log.FromContext(ctx)
log.Info("Checking backup job status... (simulation)")
// In a real implementation, you would list Jobs with a specific label selector
// and check their .status.succeeded count.
return true, nil // Simulate immediate success for the example
}
This implementation is far more robust:
* State Machine: The FinalizationStatus
field creates a simple state machine. If the operator crashes during the BackupsStarted
phase, it will resume from that point on the next reconciliation, not from the beginning.
* Idempotency: The logic checks the current state before acting. It won't re-trigger backups if they've already been started.
* Polling via Requeue: Instead of blocking, we check the status of long-running operations (backups, pod termination) and return a temporary error (e.g., fmt.Errorf("requeuing...")
). This tells controller-runtime
to requeue the request, effectively polling without consuming a worker goroutine.
* Status Updates: The status
subresource is updated at each stage, providing excellent observability for users running kubectl describe sharddb
.
Edge Cases and Production Hardening
Senior engineers know that handling the happy path is only half the battle. Here are critical edge cases to consider.
Stuck Finalizers
Problem: What happens if there's a bug in finalizeShardDB
that causes it to return an error indefinitely? Or if the operator is down? The ShardDB
CR will be stuck in the Terminating
state forever, and kubectl delete
will hang.
Solution:
finalizer_failures
) and alerts. An alert should fire if a CR has been in a Terminating
state for an excessive period (e.g., > 1 hour). kubectl patch sharddb <name> -p '{"metadata":{"finalizers":[]}}' --type=merge
This is a destructive operation and should only be performed after manually verifying that the underlying resources have been cleaned up.
Controller Concurrency and Leader Election
Problem: The Operator SDK's manager can be configured with MaxConcurrentReconciles > 1
. While the controller-runtime
ensures that the Reconcile
function for a single object instance (e.g., default/my-db
) is never run concurrently, reconciliations for different objects (default/db1
, default/db2
) can run in parallel.
If your finalizer logic interacts with a shared, external system (e.g., a central backup repository, a corporate asset database), you could face race conditions.
Solution:
* For actions scoped to a single CR, no extra locking is needed.
* For actions that affect the entire cluster or an external system, you may need to implement your own locking mechanism. Kubernetes provides the coordination.k8s.io/v1
API (Leases) for this purpose. You can create a Lease
object and have your finalizer logic attempt to acquire it before performing a global action.
* The operator manager itself uses a Leader Election lease to ensure only one instance of the operator is active. This is sufficient for most cases, but be mindful of concurrency if you have multiple workers and shared external state.
Finalizer Race Conditions
Problem: What if a user edits the CR to remove the finalizer while your operator is in the middle of a long cleanup operation? The operator might finish its cleanup, try to remove its finalizer (which is already gone), and Kubernetes might have already deleted the object, leading to errors.
Solution:
* The controller-runtime
client is designed to handle this. When you call r.Update(ctx, instance)
or r.Status().Update(ctx, instance)
, it uses the object's resourceVersion
for optimistic concurrency control. If another actor has modified the object since you fetched it, the update will fail with a conflict error. The controller-runtime
manager will automatically requeue the request, causing your Reconcile
function to run again with the latest version of the object. Your logic should be written to gracefully handle this re-execution.
* Always re-fetch the object at the start of your reconciliation loop to ensure you're working with the most recent state.
Complete Code Example
Here is a more complete, production-oriented sharddb_controller.go
to tie everything together.
package controllers
import (
"context"
"fmt"
"time"
appsv1 "k8s.io/api/apps/v1"
"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
"sigs.k8s.io/controller-runtime/pkg/log"
dbv1 "my-operator/api/v1"
)
const (
shardDBFinalizer = "db.example.com/finalizer"
finalizationStateBackupsStarted = "BackupsStarted"
finalizationStateScalingDown = "ScalingDown"
finalizationStateComplete = "Complete"
)
// ShardDBReconciler reconciles a ShardDB object
type ShardDBReconciler struct {
client.Client
Scheme *runtime.Scheme
}
//+kubebuilder:rbac:groups=db.example.com,resources=sharddbs,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=db.example.com,resources=sharddbs/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=db.example.com,resources=sharddbs/finalizers,verbs=update
//+kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch;delete
func (r *ShardDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
instance := &dbv1.ShardDB{}
if err := r.Get(ctx, req.NamespacedName, instance); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
isMarkedForDeletion := instance.GetDeletionTimestamp() != nil
if isMarkedForDeletion {
if controllerutil.ContainsFinalizer(instance, shardDBFinalizer) {
log.Info("Handling deletion for ShardDB")
result, err := r.finalizeShardDB(ctx, instance)
if err != nil {
log.Error(err, "Finalization failed, requeueing")
return ctrl.Result{RequeueAfter: 5 * time.Second}, err
}
if result.Requeue {
return result, nil
}
log.Info("Finalization complete, removing finalizer")
controllerutil.RemoveFinalizer(instance, shardDBFinalizer)
if err := r.Update(ctx, instance); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
if !controllerutil.ContainsFinalizer(instance, shardDBFinalizer) {
log.Info("Adding finalizer to ShardDB")
controllerutil.AddFinalizer(instance, shardDBFinalizer)
if err := r.Update(ctx, instance); err != nil {
return ctrl.Result{}, err
}
}
// Regular reconciliation logic goes here.
// Ensure StatefulSet exists and matches spec.
// Update status with current replica count, etc.
return ctrl.Result{}, nil
}
func (r *ShardDBReconciler) finalizeShardDB(ctx context.Context, db *dbv1.ShardDB) (ctrl.Result, error) {
log := log.FromContext(ctx)
if db.Status.Phase != "Deleting" {
db.Status.Phase = "Deleting"
if err := r.Status().Update(ctx, db); err != nil {
return ctrl.Result{}, err
}
}
switch db.Status.FinalizationStatus {
case "":
log.Info("Finalization step: Triggering backups")
if err := r.triggerBackups(ctx, db); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to trigger backups: %w", err)
}
db.Status.FinalizationStatus = finalizationStateBackupsStarted
if err := r.Status().Update(ctx, db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true, RequeueAfter: 15 * time.Second}, nil
case finalizationStateBackupsStarted:
log.Info("Finalization step: Checking backup status")
complete, err := r.areBackupsComplete(ctx, db)
if err != nil {
return ctrl.Result{}, fmt.Errorf("failed to check backup status: %w", err)
}
if !complete {
log.Info("Backups not yet complete")
return ctrl.Result{Requeue: true, RequeueAfter: 30 * time.Second}, nil
}
log.Info("Backups complete")
db.Status.FinalizationStatus = finalizationStateScalingDown
if err := r.Status().Update(ctx, db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
case finalizationStateScalingDown:
log.Info("Finalization step: Scaling down StatefulSet")
sts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, sts)
if err != nil {
if errors.IsNotFound(err) {
log.Info("StatefulSet already deleted")
db.Status.FinalizationStatus = finalizationStateComplete
return ctrl.Result{}, r.Status().Update(ctx, db)
}
return ctrl.Result{}, err
}
if sts.Spec.Replicas != nil && *sts.Spec.Replicas > 0 {
log.Info("Setting StatefulSet replicas to 0")
zeroReplicas := int32(0)
sts.Spec.Replicas = &zeroReplicas
if err := r.Update(ctx, sts); err != nil {
return ctrl.Result{}, err
}
}
if sts.Status.ReadyReplicas > 0 {
log.Info("Waiting for StatefulSet pods to terminate", "ReadyReplicas", sts.Status.ReadyReplicas)
return ctrl.Result{Requeue: true, RequeueAfter: 15 * time.Second}, nil
}
log.Info("StatefulSet successfully scaled down")
db.Status.FinalizationStatus = finalizationStateComplete
if err := r.Status().Update(ctx, db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
case finalizationStateComplete:
log.Info("Finalization complete")
return ctrl.Result{}, nil
default:
return ctrl.Result{}, fmt.Errorf("unknown finalization state: %s", db.Status.FinalizationStatus)
}
}
func (r *ShardDBReconciler) triggerBackups(ctx context.Context, db *dbv1.ShardDB) error { return nil }
func (r *ShardDBReconciler) areBackupsComplete(ctx context.Context, db *dbv1.ShardDB) (bool, error) { return true, nil }
func (r *ShardDBReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&dbv1.ShardDB{}).
Owns(&appsv1.StatefulSet{}).
Complete(r)
}
Conclusion: From Controller to Guardian
Implementing a finalizer transforms an Operator from a simple resource provisioner into a true guardian of your stateful application. It elevates the Operator's role to manage the full lifecycle, including the most critical and often overlooked phase: deletion. By intercepting the deletion process, leveraging the status subresource for idempotency, and carefully handling edge cases, you can build production-grade operators that provide the safety and reliability required for running critical stateful workloads on Kubernetes. This pattern is not just a best practice; for stateful systems, it is an absolute necessity.