K8s Operator Deep Dive: Zero-Downtime PostgreSQL Failover with Finalizers

24 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The StatefulSet Paradox: Why Default Controllers Fall Short for Production Databases

Every senior engineer who has managed a stateful workload on Kubernetes has encountered the limitations of the default StatefulSet controller. While it provides stable network identifiers and ordered pod management, it is fundamentally application-agnostic. It can restart a failed PostgreSQL pod, but it has no intrinsic understanding of primary/replica roles, replication lag, or the delicate ballet of a database failover. A StatefulSet alone cannot safely promote a replica to a new primary, reconfigure other replicas to follow it, and update client-facing services without manual intervention or brittle shell scripts.

This is the chasm where custom Kubernetes Operators thrive. An Operator encodes the domain-specific knowledge of a human operator into software. For a PostgreSQL cluster, this means codifying the exact steps required for high-availability: monitoring health, detecting primary failure, selecting the best promotion candidate, executing the promotion, and re-routing traffic.

This article is not an introduction to Operators. We assume you understand CRDs, controllers, and the basic reconciliation loop. Instead, we will build the core components of a production-grade PostgreSQL Operator, focusing on two critical, advanced patterns:

  • Finalizers for Graceful Deletion: How to prevent orphaned PVCs and ensure clean shutdown procedures when a user executes kubectl delete postgrescluster my-db.
  • Custom Failover Logic: Implementing a robust health-checking and promotion mechanism within the reconciliation loop to achieve automated, near-zero-downtime failover.
  • We will be working with Go and the Operator SDK/controller-runtime library, the de facto standard for building robust operators.

    Defining Our Contract: The `PostgresCluster` CRD

    First, we define the API for our cluster. The PostgresCluster Custom Resource Definition (CRD) is the user-facing interface. Its spec declares the desired state, and its status reports the actual, observed state. Note the specificity in the status fields—this is critical for the operator's decision-making and for user observability.

    yaml
    # config/crd/bases/db.example.com_postgresclusters.yaml
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: postgresclusters.db.example.com
    spec:
      group: db.example.com
      names:
        kind: PostgresCluster
        listKind: PostgresClusterList
        plural: postgresclusters
        singular: postgrescluster
      scope: Namespaced
      versions:
        - name: v1alpha1
          served: true
          storage: true
          schema:
            openAPIV3Schema:
              type: object
              properties:
                spec:
                  type: object
                  properties:
                    replicas:
                      type: integer
                      minimum: 1
                      description: The number of PostgreSQL instances in the cluster.
                    image:
                      type: string
                      description: The PostgreSQL container image to use.
                    dbName:
                      type: string
                    dbUser:
                      type: string
                    storage:
                      type: object
                      properties:
                        size:
                          type: string
                          pattern: '^[0-9]+(Gi|Mi|Ki)$'
                  required: ["replicas", "image"]
                status:
                  type: object
                  properties:
                    readyReplicas:
                      type: integer
                    currentPrimary:
                      type: string
                      description: "The pod name of the current primary instance."
                    conditions:
                      type: array
                      items:
                        type: object
                        properties:
                          type:
                            type: string
                          status:
                            type: string
                          lastTransitionTime:
                            type: string
                            format: date-time
                          message:
                            type: string

    The Core Reconciliation Loop: From Desired State to Reality

    The Reconcile function is the heart of the operator. It's a level-triggered loop that continuously drives the current state of the system towards the desired state defined in the PostgresCluster CR. Our initial implementation will focus on creating and scaling the underlying StatefulSet and Service resources.

    Here's a simplified view of the controller's main reconciliation logic.

    go
    // internal/controller/postgrescluster_controller.go
    
    import (
        // ... other imports
        appsv1 "k8s.ioio/api/apps/v1"
        corev1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/api/errors"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
        "k8s.io/apimachinery/pkg/types"
        ctrl "sigs.k8s.io/controller-runtime"
        "sigs.k8s.io/controller-runtime/pkg/client"
        "sigs.k8s.io/controller-runtime/pkg/log"
    )
    
    func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        logger := log.FromContext(ctx)
    
        // 1. Fetch the PostgresCluster instance
        cluster := &dbv1alpha1.PostgresCluster{}
        if err := r.Get(ctx, req.NamespacedName, cluster); err != nil {
            if errors.IsNotFound(err) {
                logger.Info("PostgresCluster resource not found. Ignoring since object must be deleted.")
                return ctrl.Result{}, nil
            }
            logger.Error(err, "Failed to get PostgresCluster")
            return ctrl.Result{}, err
        }
    
        // 2. Reconcile the StatefulSet
        sts := &appsv1.StatefulSet{}
        err := r.Get(ctx, types.NamespacedName{Name: cluster.Name, Namespace: cluster.Namespace}, sts)
        if err != nil && errors.IsNotFound(err) {
            // Define a new StatefulSet
            newSts := r.statefulSetForPostgresCluster(cluster)
            logger.Info("Creating a new StatefulSet", "StatefulSet.Namespace", newSts.Namespace, "StatefulSet.Name", newSts.Name)
            if err := r.Create(ctx, newSts); err != nil {
                logger.Error(err, "Failed to create new StatefulSet")
                return ctrl.Result{}, err
            }
            return ctrl.Result{Requeue: true}, nil
        } else if err != nil {
            logger.Error(err, "Failed to get StatefulSet")
            return ctrl.Result{}, err
        }
    
        // 3. Ensure StatefulSet replicas match the spec
        size := *cluster.Spec.Replicas
        if *sts.Spec.Replicas != size {
            sts.Spec.Replicas = &size
            if err := r.Update(ctx, sts); err != nil {
                logger.Error(err, "Failed to scale StatefulSet")
                return ctrl.Result{}, err
            }
            logger.Info("Scaled StatefulSet", "new size", size)
        }
    
        // ... more reconciliation logic for Services, etc. will go here
        // ... failover logic will be added later
    
        return ctrl.Result{}, nil
    }
    
    // Helper function to define the StatefulSet
    func (r *PostgresClusterReconciler) statefulSetForPostgresCluster(c *dbv1alpha1.PostgresCluster) *appsv1.StatefulSet {
        // ... implementation to generate the StatefulSet manifest from the CR spec ...
        // This would define the container spec, volume claim templates, etc.
        // CRITICAL: Ensure the pod management policy is Parallel for failover scenarios.
        return &appsv1.StatefulSet{
            ObjectMeta: metav1.ObjectMeta{
                Name:      c.Name,
                Namespace: c.Namespace,
            },
            Spec: appsv1.StatefulSetSpec{
                PodManagementPolicy: appsv1.ParallelPodManagement,
                // ... rest of the spec
            },
        }
    }

    This is the baseline. It creates and scales a StatefulSet. But what happens on deletion? Right now, deleting the CR will orphan the StatefulSet if we don't set an OwnerReference. Even with an OwnerReference, the PVCs might remain. We need more control.

    Advanced Pattern 1: Finalizers for Graceful Deletion

    When a user deletes a PostgresCluster, we don't want Kubernetes to immediately garbage collect it. We need to perform a sequence of cleanup actions: perhaps trigger a final backup, cleanly demote the primary, and ensure all PVCs are handled according to policy. This is the canonical use case for Finalizers.

    A Finalizer is a key in a resource's metadata that tells Kubernetes to wait for a controller to perform cleanup actions before fully deleting the resource. When a user requests deletion, Kubernetes adds a deletionTimestamp to the object but leaves it in the API server. Our operator must detect this timestamp, perform its cleanup, and then remove the finalizer from the object's metadata. Only then will Kubernetes complete the deletion.

    Step 1: Add the Finalizer

    In our reconciliation loop, we first check if our finalizer is present on the object. If not, we add it and update the object.

    go
    // internal/controller/postgrescluster_controller.go
    
    const postgresClusterFinalizer = "db.example.com/finalizer"
    
    func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        logger := log.FromContext(ctx)
        cluster := &dbv1alpha1.PostgresCluster{}
        // ... fetch cluster logic ...
    
        // Check if the object is being deleted
        isClusterMarkedForDeletion := cluster.GetDeletionTimestamp() != nil
        if isClusterMarkedForDeletion {
            if containsString(cluster.GetFinalizers(), postgresClusterFinalizer) {
                // Run our finalization logic. If it fails, we return the error
                // which will cause the reconciliation to be retried.
                if err := r.finalizePostgresCluster(ctx, cluster); err != nil {
                    logger.Error(err, "Failed to finalize cluster")
                    return ctrl.Result{}, err
                }
    
                // Remove the finalizer from the list and update it.
                cluster.SetFinalizers(removeString(cluster.GetFinalizers(), postgresClusterFinalizer))
                if err := r.Update(ctx, cluster); err != nil {
                    return ctrl.Result{}, err
                }
            }
            return ctrl.Result{}, nil
        }
    
        // Add finalizer for this CR if it doesn't exist
        if !containsString(cluster.GetFinalizers(), postgresClusterFinalizer) {
            cluster.SetFinalizers(append(cluster.GetFinalizers(), postgresClusterFinalizer))
            if err := r.Update(ctx, cluster); err != nil {
                return ctrl.Result{}, err
            }
        }
    
        // ... rest of the reconciliation logic (StatefulSet, Service, Failover) ...
    
        return ctrl.Result{}, nil
    }
    
    func (r *PostgresClusterReconciler) finalizePostgresCluster(ctx context.Context, c *dbv1alpha1.PostgresCluster) error {
        logger := log.FromContext(ctx).WithValues("namespace", c.Namespace, "name", c.Name)
    
        // 1. Trigger a final backup (pseudo-code)
        // err := r.BackupClient.TriggerFinalBackup(c.Name)
        // if err != nil { return err }
        logger.Info("Triggered final backup for cluster")
    
        // 2. Scale down the statefulset to 0 to cleanly terminate pods
        sts := &appsv1.StatefulSet{}
        err := r.Get(ctx, types.NamespacedName{Name: c.Name, Namespace: c.Namespace}, sts)
        if err != nil {
            if errors.IsNotFound(err) {
                // If StatefulSet is already gone, we're good.
                logger.Info("StatefulSet already deleted during finalization.")
                return nil
            }
            return err
        }
    
        if sts.Spec.Replicas != nil && *sts.Spec.Replicas > 0 {
            var zero int32 = 0
            sts.Spec.Replicas = &zero
            if err := r.Update(ctx, sts); err != nil {
                return err
            }
            logger.Info("Scaled statefulset to 0 for graceful shutdown.")
            // We might need to wait here until pods are gone, but for simplicity, we assume
            // the next reconcile will handle the rest once the sts is updated.
        }
    
        // 3. Optionally, handle PVCs based on a retention policy in the spec.
        // This is where you would delete PVCs if the policy is 'Delete'.
        logger.Info("Successfully finalized PostgresCluster")
        return nil
    }
    
    // Helper functions for finalizer string slice manipulation
    func containsString(slice []string, s string) bool { /* ... */ }
    func removeString(slice []string, s string) []string { /* ... */ }

    This pattern is incredibly powerful. It guarantees your cleanup logic runs before Kubernetes removes the object, preventing resource leaks and enabling safe, stateful shutdowns.

    Advanced Pattern 2: Engineering Automated Failover

    This is where the operator's application-specific intelligence shines. We need to implement a state machine within our reconciliation loop to handle failover.

    The State Machine Logic:

  • Identify Current State: List all pods belonging to the StatefulSet. Connect to each one to determine its role (primary or replica) and health.
  • Compare with Desired State: The desired state is to have exactly one healthy primary and N-1 healthy replicas.
  • Act:
  • * Healthy State: If everything is fine, update the status field of the PostgresCluster CR with the current primary and ready replica count. Do nothing else.

    * Primary Down: If the pod identified as the primary is unhealthy or missing, initiate failover.

    * No Primary: If no primary can be found (e.g., initial cluster bootstrap), promote the first pod (-0).

    Implementation Details

    To implement this, we need a way to query PostgreSQL's state from within the operator.

    go
    // A simplified client to check PostgreSQL status
    
    // IsPrimary checks if the postgres instance at the given host is a primary.
    func (pgClient *PostgresClient) IsPrimary(ctx context.Context, podIP string) (bool, error) {
        // Connect to postgres at podIP:5432
        // Execute query: `SELECT pg_is_in_recovery();`
        // If it returns 'false', it's a primary.
        // This is a simplified example; a real implementation would use the Go pq driver.
    }
    
    // PromoteReplica executes the promotion command on a replica.
    func (pgClient *PostgresClient) PromoteReplica(ctx context.Context, podName, namespace string) error {
        // This is a complex but crucial part. We need to exec into the pod.
        // We use the kubernetes clientset's REST client for this.
        command := []string{
            "pg_ctl",
            "promote",
            "-D",
            "/var/lib/postgresql/data", // Assumes standard data directory
        }
        // ... code to build and execute a remote command (exec) in the pod ...
        // This involves creating a POST request to the pod's exec subresource.
    }

    Now, let's integrate this into the reconciliation loop.

    go
    // internal/controller/postgrescluster_controller.go
    
    func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        // ... fetch cluster, handle finalizers, reconcile StatefulSet ...
    
        // --- FAILOVER LOGIC --- 
    
        // 1. List all pods for this cluster
        podList := &corev1.PodList{}
        listOpts := []client.ListOption{
            client.InNamespace(cluster.Namespace),
            client.MatchingLabels{"app": cluster.Name}, // Assuming statefulset adds this label
        }
        if err := r.List(ctx, podList, listOpts...); err != nil {
            logger.Error(err, "Failed to list pods")
            return ctrl.Result{}, err
        }
    
        // 2. Identify roles and health
        var primaryPod *corev1.Pod
        var replicaPods []*corev1.Pod
        var potentialPrimaries []*corev1.Pod
    
        for _, pod := range podList.Items {
            p := pod // copy for reference
            if p.Status.Phase != corev1.PodRunning || p.Status.PodIP == "" {
                continue // Skip non-running pods
            }
    
            isPrimary, err := r.PostgresClient.IsPrimary(ctx, p.Status.PodIP)
            if err != nil {
                logger.Info("Could not determine role for pod, assuming unhealthy", "pod", p.Name, "error", err)
                continue // Skip unhealthy pods
            }
    
            if isPrimary {
                primaryPod = &p
            } else {
                replicaPods = append(replicaPods, &p)
            }
            potentialPrimaries = append(potentialPrimaries, &p)
        }
    
        // 3. Act based on state
        if primaryPod == nil {
            // STATE: No Primary Found. Initiate Promotion.
            logger.Info("No primary found. Attempting to promote a new one.")
            if len(potentialPrimaries) == 0 {
                logger.Info("No running pods available to promote.")
                // Requeue to check again later
                return ctrl.Result{RequeueAfter: 15 * time.Second}, nil
            }
    
            // For simplicity, we promote the first available pod. A production operator
            // would check replication lag (e.g., pg_stat_replication) to find the best candidate.
            candidate := potentialPrimaries[0]
            logger.Info("Promoting pod to new primary", "pod", candidate.Name)
            if err := r.PostgresClient.PromoteReplica(ctx, candidate.Name, candidate.Namespace); err != nil {
                logger.Error(err, "Failed to promote replica")
                return ctrl.Result{}, err
            }
    
            // After promotion, we need to update the primary service to point to the new primary
            // and reconfigure other replicas to follow it. This is a critical step.
            // We also update the status and requeue immediately.
            cluster.Status.CurrentPrimary = candidate.Name
            // ... update status logic ...
            if err := r.Status().Update(ctx, cluster); err != nil {
                // handle conflict
            }
            return ctrl.Result{Requeue: true}, nil
        }
    
        // STATE: Primary exists. Ensure service points to it.
        primarySvc := &corev1.Service{}
        // ... get the primary service ...
        // Check if service selector points to the correct primary pod's label.
        // If not, update the service selector.
        // e.g., service.Spec.Selector["statefulset.kubernetes.io/pod-name"] = primaryPod.Name
    
        // Finally, update status
        cluster.Status.CurrentPrimary = primaryPod.Name
        cluster.Status.ReadyReplicas = int32(len(podList.Items))
        // ... update status logic ...
        if err := r.Status().Update(ctx, cluster); err != nil {
            // handle conflict
        }
    
        // Requeue periodically to perform health checks
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }

    Edge Cases and Production Hardening

    What we've built is the core logic, but production systems require handling the messy realities of distributed systems.

    Split-Brain Scenarios: This is the most dangerous failure mode. A network partition could isolate the primary from the Kubernetes API server. The operator, unable to reach the primary, might promote a new one. When the partition heals, you have two pods accepting writes. Solution: Implement fencing. The old primary must be forcefully terminated (kubectl delete pod --grace-period=0) before* the new primary is promoted. The operator must have a lock (e.g., using a Lease object in Kubernetes) to ensure only one operator instance is making failover decisions.

    * Reconciliation Storms: If a pod is flapping, the reconciliation loop might trigger continuously. Solution: When requeueing due to a transient error, use exponential backoff. controller-runtime handles this automatically if you return an error, but for manual requeues (RequeueAfter), you may need to manage this yourself.

    * RBAC and Security: The operator needs significant privileges. It must be able to get, list, watch, update pods, statefulsets, and services. Critically, it needs the create permission on pods/exec to run commands. This is a powerful permission that must be tightly scoped.

    yaml
        # config/rbac/role.yaml
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRole
        metadata:
          name: manager-role
        rules:
        - apiGroups: ["db.example.com"]
          resources: ["postgresclusters", "postgresclusters/status", "postgresclusters/finalizers"]
          verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
        - apiGroups: ["apps"]
          resources: ["statefulsets"]
          verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
        - apiGroups: [""]
          resources: ["pods", "services", "events", "persistentvolumeclaims"]
          verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
        - apiGroups: [""]
          resources: ["pods/exec"]
          verbs: ["create"]

    * Performance and API Server Load: An operator that reconciles every 5 seconds for 1000 clusters will destroy the API server. Solution: Use intelligent requeue intervals. Use controller-runtime's Owns and Watches builders in your SetupWithManager function to only trigger reconciles on relevant events. For example, don't reconcile if an unrelated ConfigMap changes. Use predicates to filter out events that don't affect the desired state (e.g., an annotation change on a pod you don't care about).

    go
        // main.go
        func (r *PostgresClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
            return ctrl.NewControllerManagedBy(mgr).
                For(&dbv1alpha1.PostgresCluster{}).
                Owns(&appsv1.StatefulSet{}).
                Owns(&corev1.Service{}).
                WithOptions(controller.Options{MaxConcurrentReconciles: 5}). // Tune concurrency
                WithEventFilter(predicate.GenerationChangedPredicate{}). // Ignore status-only updates
                Complete(r)
        }

    Conclusion: From Controller to Operator

    We have moved from a simple controller that manages a StatefulSet to a true Operator with application-specific intelligence. By implementing Finalizers, we guarantee data safety and operational hygiene during resource termination. By engineering custom failover logic directly into the reconciliation loop, we provide a level of automated high availability that is impossible with default Kubernetes primitives.

    Building a production-ready operator is a significant undertaking that requires a deep understanding of both Kubernetes' control plane mechanics and the stateful application's internal workings. The patterns discussed here—Finalizers for lifecycle control and a state-machine-driven reconciliation loop for failover—are foundational building blocks for managing any complex, stateful service on Kubernetes with the robustness and automation that production environments demand.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles