Building a Go Kubernetes Operator for Automated Postgres Failover

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The State Management Gap in Kubernetes

Kubernetes provides powerful primitives for managing stateless applications. Deployments, ReplicaSets, and Services offer a robust foundation for scaling and self-healing. However, when it comes to stateful applications like a PostgreSQL primary-replica cluster, these primitives fall short. A StatefulSet can provide stable network identities and persistent storage, but it has no intrinsic knowledge of application-level concepts like primary status, replication lag, or failover procedures.

This is the gap where the Operator Pattern shines. An Operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user. It encodes the operational knowledge of a human administrator into software.

This article will guide you through building a simplified yet production-aware PostgreSQL operator using Go and the Kubebuilder framework. Our goal is not to replicate the full functionality of mature operators like Patroni or CrunchyData's PGO, but to dissect the core mechanics of a custom controller responsible for a critical task: automated primary failover. We will focus on:

  • Defining a Custom API (PostgresCluster CRD): Modeling our database cluster as a first-class Kubernetes resource.
  • Implementing the Reconciliation Loop: The core logic that drives the cluster state towards the desired specification.
  • Automated Primary Election & Failover: The critical logic for detecting primary failure and promoting a replica.
  • Handling Production Edge Cases: Addressing idempotency, finalizers for graceful deletion, and preventing split-brain scenarios.
  • This guide assumes you are comfortable with Go, Kubernetes architecture (pods, services, controllers), and basic PostgreSQL administration concepts.

    1. Defining Our API: The `PostgresCluster` CRD

    The first step in building an operator is to define a new API using a Custom Resource Definition (CRD). This PostgresCluster resource will be our user-facing abstraction. A user will create a PostgresCluster manifest, and our operator will translate that into the underlying Kubernetes resources (StatefulSet, Service, ConfigMap, etc.).

    We'll use Kubebuilder markers (// +kubebuilder:...) in our Go structs to generate the CRD manifest.

    api/v1/postgrescluster_types.go

    go
    package v1
    
    import (
    	corev1 "k8s.io/api/core/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    // PostgresClusterSpec defines the desired state of PostgresCluster
    type PostgresClusterSpec struct {
    	// Number of desired pods. This is a pointer to distinguish between explicit
    	// zero and not specified. Defaults to 3.
    	// +kubebuilder:validation:Minimum=1
    	// +kubebuilder:default=3
    	Replicas *int32 `json:"replicas"`
    
    	// Image is the container image to use for the PostgreSQL cluster.
    	Image string `json:"image"`
    
    	// Resources required by each PostgreSQL instance.
    	Resources corev1.ResourceRequirements `json:"resources,omitempty"`
    
    	// StorageSpec defines the storage configuration for the PostgreSQL data.
    	StorageSpec corev1.PersistentVolumeClaimSpec `json:"storageSpec"`
    }
    
    // PostgresClusterStatus defines the observed state of PostgresCluster
    type PostgresClusterStatus struct {
    	// ReadyReplicas is the number of pods in the cluster that are in a Ready state.
    	ReadyReplicas int32 `json:"readyReplicas"`
    
    	// CurrentPrimary is the name of the pod that is currently acting as the primary.
    	CurrentPrimary string `json:"currentPrimary,omitempty"`
    
    	// Conditions represent the latest available observations of the PostgresCluster's state.
    	// +optional
    	Conditions []metav1.Condition `json:"conditions,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    //+kubebuilder:subresource:status
    //+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
    //+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.readyReplicas"
    //+kubebuilder:printcolumn:name="Primary",type="string",JSONPath=".status.currentPrimary"
    //+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
    
    // PostgresCluster is the Schema for the postgresclusters API
    type PostgresCluster struct {
    	metav1.TypeMeta   `json:",inline"`
    	metav1.ObjectMeta `json:"metadata,omitempty"`
    
    	Spec   PostgresClusterSpec   `json:"spec,omitempty"`
    	Status PostgresClusterStatus `json:"status,omitempty"`
    }
    
    //+kubebuilder:object:root=true
    
    // PostgresClusterList contains a list of PostgresCluster
    type PostgresClusterList struct {
    	metav1.TypeMeta `json:",inline"`
    	metav1.ListMeta `json:"metadata,omitempty"`
    	Items           []PostgresCluster `json:"items"`
    }
    
    func init() {
    	SchemeBuilder.Register(&PostgresCluster{}, &PostgresClusterList{})
    }
    

    Key design choices here:

    Spec vs. Status: The spec is the desired state declared by the user. The status is the observed* state of the world, written only by our controller. The +kubebuilder:subresource:status marker is critical; it prevents controllers from fighting over the object by creating a separate /status endpoint for updates.

    * Conditions: Using a Conditions array is a standard Kubernetes pattern for reporting detailed status. It allows us to communicate states like Progressing, Available, or Degraded with reasons and timestamps.

    Pointers for Optional Fields: Replicas is a pointer (int32) so we can differentiate between a user setting replicas: 0 and not setting the field at all, allowing us to default it to 3.

    After running make manifests, Kubebuilder generates the CRD YAML. This is the definition you apply to your cluster to make it aware of the PostgresCluster kind.

    2. The Heart of the Operator: The Reconciliation Loop

    The Reconcile function in controllers/postgrescluster_controller.go is the core of our operator. It's a control loop that is triggered whenever a PostgresCluster resource (or any resource it owns) changes. Its job is to observe the current state and take actions to converge it with the spec.

    A robust reconciliation loop must be idempotent and stateless. It should not remember its last action; instead, it should re-evaluate the state of the world on every invocation and decide what to do next.

    Here's the high-level structure of our Reconcile function:

    go
    // controllers/postgrescluster_controller.go
    
    func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	log := log.FromContext(ctx)
    
    	// 1. Fetch the PostgresCluster instance
    	var pgCluster dbv1.PostgresCluster
    	if err := r.Get(ctx, req.NamespacedName, &pgCluster); err != nil {
    		// If the resource is not found, it might have been deleted. We can ignore this.
    		return ctrl.Result{}, client.IgnoreNotFound(err)
    	}
    
    	// 2. Handle Deletion: Check for a deletion timestamp and run finalizers
    	// ... (We will implement this later)
    
    	// 3. Reconcile the StatefulSet for PostgreSQL pods
    	sts, err := r.reconcileStatefulSet(ctx, &pgCluster)
    	if err != nil {
    		log.Error(err, "Failed to reconcile StatefulSet")
    		// Update status with a degraded condition
    		return ctrl.Result{}, err // Requeue with error
    	}
    
    	// 4. Reconcile the Services (Primary and Replica)
    	err = r.reconcileServices(ctx, &pgCluster)
    	if err != nil {
    		log.Error(err, "Failed to reconcile Services")
    		return ctrl.Result{}, err
    	}
    
    	// 5. Perform Primary Election and Failover Logic
    	// This is the most complex part
    	primaryPodName, err := r.managePrimaryAndReplicas(ctx, &pgCluster, sts)
    	if err != nil {
    		log.Error(err, "Failed to manage primary/replica roles")
    		return ctrl.Result{}, err
    	}
    
    	// 6. Update the Status Subresource
    	err = r.updateClusterStatus(ctx, &pgCluster, sts, primaryPodName)
    	if err != nil {
    		log.Error(err, "Failed to update PostgresCluster status")
    		return ctrl.Result{}, err
    	}
    
    	// Requeue after a short duration to re-check the state
    	return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }

    We break down the logic into distinct, testable functions. Let's look at reconcileStatefulSet.

    Idempotent Resource Creation (reconcileStatefulSet)

    This function ensures a StatefulSet matching our spec exists. It must handle both creation and updates.

    go
    // controllers/postgrescluster_controller.go (helper function)
    
    func (r *PostgresClusterReconciler) reconcileStatefulSet(ctx context.Context, pgCluster *dbv1.PostgresCluster) (*appsv1.StatefulSet, error) {
    	log := log.FromContext(ctx)
    
    	desiredSts := r.defineStatefulSet(pgCluster)
    
    	// Set PostgresCluster instance as the owner and controller
    	if err := controllerutil.SetControllerReference(pgCluster, desiredSts, r.Scheme); err != nil {
    		return nil, err
    	}
    
    	// Check if the StatefulSet already exists
    	foundSts := &appsv1.StatefulSet{}
    	err := r.Get(ctx, types.NamespacedName{Name: desiredSts.Name, Namespace: desiredSts.Namespace}, foundSts)
    	if err != nil && errors.IsNotFound(err) {
    		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
    		err = r.Create(ctx, desiredSts)
    		if err != nil {
    			return nil, err
    		}
    		// StatefulSet created successfully - return the new STS
    		return desiredSts, nil
    	} else if err != nil {
    		return nil, err
    	}
    
    	// StatefulSet already exists - check for updates
    	// A simple check here: if replica count or image has changed, update it.
    	// Production operators use more sophisticated deep-diffing.
    	if *foundSts.Spec.Replicas != *desiredSts.Spec.Replicas || foundSts.Spec.Template.Spec.Containers[0].Image != desiredSts.Spec.Template.Spec.Containers[0].Image {
    		log.Info("Updating StatefulSet", "StatefulSet.Namespace", foundSts.Namespace, "StatefulSet.Name", foundSts.Name)
    		foundSts.Spec.Replicas = desiredSts.Spec.Replicas
    		foundSts.Spec.Template.Spec.Containers[0].Image = desiredSts.Spec.Template.Spec.Containers[0].Image
    		err = r.Update(ctx, foundSts)
    		if err != nil {
    			return nil, err
    		}
    	}
    
    	return foundSts, nil
    }
    
    // defineStatefulSet creates the StatefulSet definition based on the CR spec.
    func (r *PostgresClusterReconciler) defineStatefulSet(pgCluster *dbv1.PostgresCluster) *appsv1.StatefulSet {
        // ... logic to build the StatefulSet object ...
        // This includes setting labels, container spec, volume mounts, and volume claim templates from pgCluster.Spec.StorageSpec
        // Crucially, we add a label selector that we'll use to find our pods.
        labels := map[string]string{"app": "postgres", "cluster": pgCluster.Name}
    
        return &appsv1.StatefulSet{
            // ... metadata ...
            Spec: appsv1.StatefulSetSpec{
                Replicas: pgCluster.Spec.Replicas,
                Selector: &metav1.LabelSelector{MatchLabels: labels},
                Template: corev1.PodTemplateSpec{
                    ObjectMeta: metav1.ObjectMeta{Labels: labels},
                    Spec: corev1.PodSpec{
                        Containers: []corev1.Container{{
                            Name:  "postgres",
                            Image: pgCluster.Spec.Image,
                            // ... ports, env vars, probes, volumeMounts ...
                        }},
                    },
                },
                VolumeClaimTemplates: []corev1.PersistentVolumeClaim{{
                    ObjectMeta: metav1.ObjectMeta{Name: "pgdata"},
                    Spec:       pgCluster.Spec.StorageSpec,
                }},
            },
        }
    }

    controllerutil.SetControllerReference is vital. It establishes an ownership link. When the PostgresCluster CR is deleted, the Kubernetes garbage collector will automatically delete the StatefulSet and its associated Pods.

    3. Implementing Automated Failover

    This is where our operator's intelligence lies. We need to implement a state machine within the managePrimaryAndReplicas function. Our strategy will be:

  • Identify Roles: Use a label, postgres-role: primary or postgres-role: replica, on pods to track their current role.
  • Identify Primary Service: A ClusterIP Service will always point to the current primary pod. Applications connect to this stable service endpoint.
  • The State Machine Logic:
  • * List all pods belonging to our cluster.

    * Check for an existing primary pod (one with the primary label).

    * Case 1: No Primary Exists (Initial Bootstrap). Elect the pod with ordinal 0 (e.g., my-cluster-0) as the new primary.

    * Case 2: Primary Exists and is Healthy. Ensure all other pods are configured as replicas. Update the primary service's selector to point to this pod.

    * Case 3: Primary Exists but is Unhealthy (Failover Trigger). The pod is not in the Ready state. Promote a healthy replica to be the new primary.

    managePrimaryAndReplicas Implementation

    go
    // controllers/postgrescluster_controller.go
    
    const (
    	primaryRoleLabel = "postgres-role"
    	primaryValue     = "primary"
    	replicaValue     = "replica"
    )
    
    func (r *PostgresClusterReconciler) managePrimaryAndReplicas(ctx context.Context, pgCluster *dbv1.PostgresCluster, sts *appsv1.StatefulSet) (string, error) {
    	log := log.FromContext(ctx)
    
    	// 1. List all pods for this cluster
    	podList := &corev1.PodList{}
    	labels := map[string]string{"app": "postgres", "cluster": pgCluster.Name}
    	if err := r.List(ctx, podList, client.InNamespace(pgCluster.Namespace), client.MatchingLabels(labels)); err != nil {
    		return "", err
    	}
    
    	var primaryPod *corev1.Pod
    	var healthyReplicas []*corev1.Pod
    
    	// 2. Categorize pods
    	for i := range podList.Items {
    		pod := &podList.Items[i]
    		role, exists := pod.Labels[primaryRoleLabel]
    
    		if exists && role == primaryValue {
    			primaryPod = pod
    		} else if isPodReady(pod) {
    			healthyReplicas = append(healthyReplicas, pod)
    		}
    	}
    
    	// 3. The State Machine
    	if primaryPod == nil {
    		// Case 1: No primary. Elect a new one.
    		log.Info("No primary found. Electing a new one.")
    		if len(podList.Items) > 0 {
    			// For simplicity, elect the pod with the lowest ordinal index.
    			// A production system would check WAL positions.
    			sort.Slice(podList.Items, func(i, j int) bool {
    				return podList.Items[i].Name < podList.Items[j].Name
    			})
    			newPrimary := &podList.Items[0]
    			if err := r.promotePod(ctx, newPrimary); err != nil {
    				return "", err
    			}
    			return newPrimary.Name, nil
    		} else {
    			log.Info("No pods available to elect a primary.")
    			return "", nil // No pods yet, will reconcile again
    		}
    	} else if !isPodReady(primaryPod) {
    		// Case 3: Primary exists but is unhealthy. Initiate failover.
    		log.Info("Primary is not ready. Initiating failover.", "primaryPod", primaryPod.Name)
    		if len(healthyReplicas) > 0 {
    			// Sort replicas to have a deterministic promotion candidate.
    			sort.Slice(healthyReplicas, func(i, j int) bool {
    				return healthyReplicas[i].Name < healthyReplicas[j].Name
    			})
    			newPrimary := healthyReplicas[0]
    			log.Info("Promoting new primary", "newPrimaryPod", newPrimary.Name)
    
    			// First, demote the old primary to prevent split-brain
    			if err := r.demotePod(ctx, primaryPod); err != nil {
    				log.Error(err, "Failed to demote old primary, retrying...")
    				return "", err
    			}
    
    			if err := r.promotePod(ctx, newPrimary); err != nil {
    				return "", err
    			}
    			return newPrimary.Name, nil
    		} else {
    			log.Error(nil, "Primary is unhealthy, but no healthy replicas are available for failover.")
    			return primaryPod.Name, fmt.Errorf("failover impossible: no healthy replicas")
    		}
    	} else {
    		// Case 2: Primary is healthy.
    		log.V(1).Info("Primary is healthy.", "primaryPod", primaryPod.Name)
    		// In a real operator, we would now ensure all other pods are configured
    		// to replicate from this primary. This would involve exec-ing into pods
    		// or updating a ConfigMap they read from.
    		return primaryPod.Name, nil
    	}
    }
    
    func (r *PostgresClusterReconciler) promotePod(ctx context.Context, pod *corev1.Pod) error {
    	// Update pod label to mark as primary
    	pod.Labels[primaryRoleLabel] = primaryValue
    	if err := r.Update(ctx, pod); err != nil {
    		return err
    	}
    
    	// In a real operator, you would `exec` into the pod and run `pg_ctl promote`
    	// or a similar command for your replication manager.
    	log.FromContext(ctx).Info("Simulating promotion command for pod", "podName", pod.Name)
    
    	return nil
    }
    
    func (r *PostgresClusterReconciler) demotePod(ctx context.Context, pod *corev1.Pod) error {
    	pod.Labels[primaryRoleLabel] = replicaValue
    	return r.Update(ctx, pod)
    }
    
    func isPodReady(pod *corev1.Pod) bool {
    	for _, cond := range pod.Status.Conditions {
    		if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
    			return true
    		}
    	}
    	return false
    }
    

    This logic also requires updating the primary service's selector to point to the newly promoted primary. This is done in reconcileServices, which ensures the service selector postgres-role: primary is present.

    4. Advanced Topic: Finalizers for Graceful Deletion

    What happens when a user runs kubectl delete postgrescluster my-cluster? Without a finalizer, Kubernetes will immediately delete the CR, and the garbage collector will delete the StatefulSet and Pods. The PersistentVolumeClaims (PVCs) will be orphaned (by default). This is dangerous; we might want to perform a backup or a clean shutdown first.

    A finalizer is a key in the metadata.finalizers list of a resource. When present, a kubectl delete command will only set the metadata.deletionTimestamp. The resource is not actually deleted from etcd until our operator removes the finalizer key. This gives us a hook to run pre-delete logic.

    Implementation Steps:

  • Add the Finalizer: In our main Reconcile function, if the object does not have a deletion timestamp, we add our finalizer.
  • go
        // controllers/postgrescluster_controller.go - inside Reconcile
    
        const postgresClusterFinalizer = "db.example.com/finalizer"
    
        // ... after fetching the pgCluster ...
    
        // Handle deletion
        if pgCluster.ObjectMeta.DeletionTimestamp.IsZero() {
            // The object is not being deleted, so if it does not have our finalizer, 
            // then lets add it and update the object.
            if !controllerutil.ContainsFinalizer(&pgCluster, postgresClusterFinalizer) {
                controllerutil.AddFinalizer(&pgCluster, postgresClusterFinalizer)
                if err := r.Update(ctx, &pgCluster); err != nil {
                    return ctrl.Result{}, err
                }
            }
        } else {
            // The object is being deleted
            if controllerutil.ContainsFinalizer(&pgCluster, postgresClusterFinalizer) {
                // Our finalizer is present, so lets handle any external dependency
                if err := r.cleanupPostgresCluster(ctx, &pgCluster); err != nil {
                    // if fail to delete the external dependency here, return with error
                    // so that it can be retried
                    return ctrl.Result{}, err
                }
    
                // remove our finalizer from the list and update it.
                controllerutil.RemoveFinalizer(&pgCluster, postgresClusterFinalizer)
                if err := r.Update(ctx, &pgCluster); err != nil {
                    return ctrl.Result{}, err
                }
            }
    
            // Stop reconciliation as the item is being deleted
            return ctrl.Result{}, nil
        }
    
        // ... continue with normal reconciliation ...
  • Implement the Cleanup Logic: The cleanupPostgresCluster function contains our pre-delete actions.
  • go
        func (r *PostgresClusterReconciler) cleanupPostgresCluster(ctx context.Context, pgCluster *dbv1.PostgresCluster) error {
            log := log.FromContext(ctx)
            log.Info("Performing cleanup for PostgresCluster")
    
            // This is where you would implement your cleanup logic.
            // For example, triggering a final backup to S3.
            // Or, if you wanted to delete the PVCs, you would list them and delete them here.
    
            // Example: Triggering a backup job (pseudo-code)
            // backupJob := createBackupJob(pgCluster)
            // if err := r.Create(ctx, backupJob); err != nil {
            //     return err
            // }
            // log.Info("Started final backup job.")
            // You would then need logic to wait for the job to complete.
    
            log.Info("Cleanup finished. Finalizer can be removed.")
            return nil
        }

    5. Edge Cases and Production Considerations

    * Split-Brain Prevention: This is the most dangerous failure mode in a primary-replica system. It occurs when network partitions cause two nodes to believe they are the primary, leading to data divergence. Our simple failover logic has a potential flaw: what if the old primary was just partitioned and comes back online?

    Fencing: A robust solution involves fencing* the old primary. When promoting a new primary, the operator must ensure the old one can no longer accept writes. This can be done by:

    1. Immediately changing the old pod's labels and restarting it to ensure it comes up as a replica.

    2. Using network policies to block client traffic to the old primary.

    3. Forcing a pod deletion (GracePeriodSeconds: 0) to kill it quickly.

    * Our approach of demoting the old primary via label change before promoting the new one is a step in the right direction, but a real system might need stronger guarantees.

    * Operator Leader Election: What if you run multiple replicas of your operator for high availability? You must ensure only one is active at a time to prevent them from fighting over resources. Kubebuilder handles this for you out of the box using a Lease resource in the cluster. It's crucial to understand this is happening under the hood.

    * Reconciliation Performance: For a large number of PostgresCluster resources, frequent reconciliation can strain the Kubernetes API server. You can optimize this using:

    * Predicates: Filter which events trigger a reconciliation. For example, you might ignore status updates on child resources if nothing in their spec changed. predicate.GenerationChangedPredicate{} is a common choice.

    * Watches: Configure your controller to only watch for changes on specific resources. Kubebuilder's Watches(&source.Kind{Type: &appsv1.StatefulSet{}}, ...) sets this up, but you can customize it further.

    * Error Handling and Requeueing: The Reconcile function's return value is critical.

    * ctrl.Result{}, nil: Reconciliation was successful. Don't requeue immediately.

    * ctrl.Result{Requeue: true}, nil: Success, but I want to run again immediately.

    * ctrl.Result{RequeueAfter: duration}, nil: Success, but run again after a delay.

    * ctrl.Result{}, err: An error occurred. The controller-runtime will requeue with an exponential backoff.

    Choosing the right return value is key to building a responsive but not overwhelming controller.

    Conclusion

    We have designed and implemented the core of a Kubernetes operator for managing a stateful PostgreSQL cluster. By defining a CRD, we've extended the Kubernetes API with a high-level abstraction. Within the reconciliation loop, we've encoded the critical operational logic for initial setup, primary election, and automated failover.

    This example, while simplified, demonstrates the immense power of the Operator Pattern. It moves beyond static configuration (like Helm charts) to a dynamic, continuously monitored, and self-healing system. The true potential is realized when you add more operational knowledge: automated backups and restores, version upgrades with minimal downtime, performance tuning, and credential management. Building a custom operator is a significant investment, but for complex, mission-critical applications, it is the definitive way to achieve true cloud-native automation on Kubernetes.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles