Custom Kubernetes Operators for Stateful Application Management

20 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Stateful Application Impedance Mismatch in Kubernetes

Kubernetes excels at managing stateless applications. Its core primitives—Deployments, ReplicaSets, and Services—are designed for fungible workloads that can be terminated, scaled, and replaced without data loss. However, stateful applications like databases (PostgreSQL, Cassandra), message queues (Kafka), or distributed caches (Redis) present a fundamental challenge. Their identity, data, and network endpoints are not ephemeral. A naive kubectl delete on a database pod can be catastrophic.

While StatefulSets provide foundational blocks like stable network identities and persistent storage, they only solve part of the problem. They don't understand the application's internal logic: primary/replica roles, backup procedures, version upgrades, or cluster bootstrapping. This gap between Kubernetes' generic primitives and application-specific operational knowledge is where the Operator Pattern becomes indispensable.

This article is not an introduction to operators. We assume you understand the purpose of Custom Resource Definitions (CRDs) and the basic concept of a controller. Instead, we will build a production-grade PostgresCluster operator from the ground up, focusing on the advanced patterns and edge cases that are critical for real-world deployments.

Our goal is to create an operator that manages a primary-replica PostgreSQL cluster, handling:

* Desired State Management: Defining cluster size, PostgreSQL version, and storage via a PostgresCluster CRD.

* Lifecycle Automation: Automatically provisioning a primary StatefulSet, replica Deployments, and necessary Services and Secrets.

* Graceful Deletion: Ensuring that deleting a PostgresCluster resource also cleans up its associated PVCs and other artifacts using finalizers.

* Robust Status Reporting: Providing clear, accurate status updates on the CR itself.

* Performance and Idempotency: Designing a reconciliation loop that is efficient and resilient to failures.


Designing the `PostgresCluster` API (The CRD)

Everything starts with a well-designed API. Our CRD defines the schema for our PostgresCluster custom resource. A production-ready CRD goes beyond simple fields; it includes validation, status subresources, and versioning.

Here is our v1alpha1 PostgresCluster CRD. Note the advanced validation rules and the definition of a status subresource.

yaml
# config/crd/bases/db.my-domain.com_postgresclusters.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: postgresclusters.db.my-domain.com
spec:
  group: db.my-domain.com
  names:
    kind: PostgresCluster
    listKind: PostgresClusterList
    plural: postgresclusters
    singular: postgrescluster
  scope: Namespaced
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicas:
                  type: integer
                  minimum: 1
                  description: The number of PostgreSQL instances in the cluster (1 primary + N-1 replicas).
                postgresVersion:
                  type: string
                  pattern: "^1[3-6]\\.[0-9]+$"
                  description: The PostgreSQL major version (e.g., "15.2").
                storage:
                  type: object
                  properties:
                    size:
                      type: string
                      pattern: '^[1-9][0-9]*([EPTGMK]i|[EPTGMk])?$'
                      description: The size of the persistent volume, e.g., 10Gi.
                    storageClassName:
                      type: string
                      description: The storage class to use for the persistent volume.
                  required:
                    - size
                    - storageClassName
              required:
                - replicas
                - postgresVersion
                - storage
            status:
              type: object
              properties:
                phase:
                  type: string
                  enum: ["Creating", "Ready", "Failed"]
                primaryHost:
                  type: string
                replicaHosts:
                  type: array
                  items:
                    type: string
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                        enum: ["True", "False", "Unknown"]
                      lastTransitionTime:
                        type: string
                        format: date-time
                      reason:
                        type: string
                      message:
                        type: string
                    required:
                      - type
                      - status
      # This enables the /status subresource, a critical pattern.
      subresources:
        status: {}

Key Production Considerations Here:

  • OpenAPIv3 Validation: We use minimum, pattern, and required to enforce schema correctness at the API server level. This prevents invalid configurations from ever entering the system. For example, a user cannot create a cluster with 0 replicas or an invalid storage size format.
  • Status Subresource (subresources: {status: {}}): This is non-negotiable for production operators. It creates a separate /status endpoint for the CR. This prevents actors with only update permissions on the main resource from modifying the status, which should be the sole domain of the controller. It also enables more fine-grained RBAC and helps with optimistic concurrency control.
  • Versioning: We start with v1alpha1. As the API matures, we can introduce v1beta1 and v1 and provide conversion webhooks to migrate between versions, a crucial feature for long-term maintainability.

  • The Heart of the Operator: A Resilient Reconciliation Loop

    The Reconcile function is the core of any operator. It is a state machine that is invoked whenever the PostgresCluster CR or any of its owned resources change. Its sole purpose is to observe the current state of the cluster and take action to drive it toward the desired state defined in the CR's spec.

    Our reconciler, written in Go using the Operator SDK (based on controller-runtime), will follow this idempotent logic:

  • Fetch the PostgresCluster instance.
  • Handle Deletion: Check for a deletionTimestamp. If present, run cleanup logic (using a finalizer). This is a critical advanced pattern we'll detail later.
  • Ensure Finalizer: If the resource is not being deleted, ensure our finalizer is present.
  • Observe Current State: Check for the existence and state of owned resources (Secrets, Services, StatefulSet, Deployments).
  • Reconcile Sub-resources: For each sub-resource:
  • * If it doesn't exist, create it.

    * If it exists but is misconfigured (e.g., wrong image tag), update it.

  • Update Status: After all actions, observe the new state of the system and update the PostgresCluster.status subresource.
  • Here is the skeleton of our Reconcile function in controllers/postgrescluster_controller.go:

    go
    // controllers/postgrescluster_controller.go
    
    import (
    	"context"
    	// ... other imports
    	ctrl "sigs.k8s.io/controller-runtime"
    	"sigs.k8s.io/controller-runtime/pkg/client"
    	"sigs.k8s.io/controller-runtime/pkg/log"
    
    	dbv1alpha1 "github.com/my-org/postgres-operator/api/v1alpha1"
    )
    
    // A constant for our finalizer
    const postgresClusterFinalizer = "db.my-domain.com/finalizer"
    
    // PostgresClusterReconciler reconciles a PostgresCluster object
    type PostgresClusterReconciler struct {
    	client.Client
    	Scheme *runtime.Scheme
    }
    
    func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    
    	// 1. Fetch the PostgresCluster instance
    	cluster := &dbv1alpha1.PostgresCluster{}
    	if err := r.Get(ctx, req.NamespacedName, cluster); err != nil {
    		if errors.IsNotFound(err) {
    			// Request object not found, could have been deleted after reconcile request.
    			// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
    			// Return and don't requeue
    			logger.Info("PostgresCluster resource not found. Ignoring since object must be deleted")
    			return ctrl.Result{}, nil
    		}
    		// Error reading the object - requeue the request.
    		logger.Error(err, "Failed to get PostgresCluster")
    		return ctrl.Result{}, err
    	}
    
    	// 2. Handle Deletion with Finalizers (more on this later)
    	isMarkedForDeletion := cluster.GetDeletionTimestamp() != nil
    	if isMarkedForDeletion {
    		if controllerutil.ContainsFinalizer(cluster, postgresClusterFinalizer) {
    			// Run our finalization logic. If it fails, we return the error which will cause a requeue.
    			if err := r.finalizePostgresCluster(ctx, cluster); err != nil {
    				return ctrl.Result{}, err
    			}
    
    			// Remove finalizer. Once all finalizers are removed, the object will be deleted.
    			controllerutil.RemoveFinalizer(cluster, postgresClusterFinalizer)
    			err := r.Update(ctx, cluster)
    			if err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    		return ctrl.Result{}, nil
    	}
    
    	// 3. Ensure Finalizer is present
    	if !controllerutil.ContainsFinalizer(cluster, postgresClusterFinalizer) {
    		controllerutil.AddFinalizer(cluster, postgresClusterFinalizer)
    		if err := r.Update(ctx, cluster); err != nil {
    			return ctrl.Result{}, err
    		}
    	}
    
    	// 4. & 5. Reconcile owned resources
        // This is where the core business logic resides.
        // Example: Reconcile the primary StatefulSet
        sts := &appsv1.StatefulSet{}
        err := r.Get(ctx, types.NamespacedName{Name: cluster.Name + "-primary", Namespace: cluster.Namespace}, sts)
        if err != nil && errors.IsNotFound(err) {
            // Define and create the StatefulSet
            stsToCreate := r.statefulSetForPrimary(cluster)
            logger.Info("Creating a new Primary StatefulSet", "StatefulSet.Namespace", stsToCreate.Namespace, "StatefulSet.Name", stsToCreate.Name)
            if err = r.Create(ctx, stsToCreate); err != nil {
                logger.Error(err, "Failed to create new Primary StatefulSet")
                return ctrl.Result{}, err
            }
            // StatefulSet created successfully - return and requeue to check status later
            return ctrl.Result{Requeue: true}, nil
        } else if err != nil {
            logger.Error(err, "Failed to get Primary StatefulSet")
            return ctrl.Result{}, err
        }
        // ... logic to update the StatefulSet if it's out of sync ...
    
    	// ... reconcile services, secrets, replica deployments etc. ...
    
    	// 6. Update status
        // This is a complex operation that requires careful implementation.
        // We'll dive into this in a dedicated section.
    
    	return ctrl.Result{}, nil
    }
    
    // A helper function to define the desired StatefulSet
    func (r *PostgresClusterReconciler) statefulSetForPrimary(c *dbv1alpha1.PostgresCluster) *appsv1.StatefulSet {
    	// ... logic to build the StatefulSet object ...
        // CRITICAL: Set the owner reference so Kubernetes garbage collection works
        // and so our reconciler gets triggered on changes to the StatefulSet.
        ctrl.SetControllerReference(c, sts, r.Scheme)
        return sts
    }
    
    // SetupWithManager sets up the controller with the Manager.
    func (r *PostgresClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&dbv1alpha1.PostgresCluster{}).
            // Watch for changes to secondary resources and requeue the owner
    		Owns(&appsv1.StatefulSet{}).
    		Owns(&appsv1.Deployment{}).
    		Owns(&corev1.Service{}).
    		Complete(r)
    }

    Idempotency in Practice: Notice the Get call pattern. We always check if a resource IsNotFound before creating it. We don't just blindly issue a Create command. If the resource already exists, we proceed to check if it needs an update. This ensures that if the operator restarts mid-reconciliation, it can pick up where it left off without creating duplicate resources or erroring out.

    Error Handling: When a Kubernetes API call fails (e.g., r.Create), we return the error. The controller-runtime manager will automatically requeue the request with exponential backoff, preventing us from hammering the API server in a tight loop during transient failures.


    Advanced Pattern: Finalizers for Graceful Deletion

    What happens when a user runs kubectl delete postgrescluster my-db? By default, Kubernetes deletes the PostgresCluster object, and its owned objects (like the StatefulSet and Deployments) are garbage collected. But what about the PersistentVolumeClaims (PVCs)? By default, they are not deleted, to prevent data loss. What if you need to perform external cleanup, like de-provisioning a cloud database user or taking a final backup?

    This is the problem that finalizers solve. A finalizer is a key in a resource's metadata that tells Kubernetes to wait for a controller to perform cleanup before fully deleting the resource.

    Our implementation has two parts:

  • Adding the Finalizer: When our reconciler first sees a new PostgresCluster CR, it adds our custom finalizer (db.my-domain.com/finalizer) to its metadata.finalizers list and updates the object.
  • Handling the Deletion Timestamp: When a user requests deletion, the Kubernetes API server doesn't delete the object immediately. Instead, it sets the metadata.deletionTimestamp field. Our reconciler detects this.
  • * It sees the deletionTimestamp and knows the resource is being deleted.

    * It executes its cleanup logic. In our case, this could mean scaling down replicas, running a backup job, and explicitly deleting the PVCs associated with the cluster.

    * Only after the cleanup is successful, the controller removes its finalizer from the metadata.finalizers list and updates the object.

    * Once the finalizers list is empty and the deletionTimestamp is set, the Kubernetes garbage collector deletes the object for good.

    Here is the concrete implementation of our cleanup logic:

    go
    // controllers/postgrescluster_controller.go
    
    func (r *PostgresClusterReconciler) finalizePostgresCluster(ctx context.Context, c *dbv1alpha1.PostgresCluster) error {
    	logger := log.FromContext(ctx)
    
    	// In a real-world scenario, you might trigger a backup job here.
    	logger.Info("Starting finalization logic for PostgresCluster", "name", c.Name)
    
    	// Example: Delete associated PVCs. This is a destructive action and should be used with care.
    	// It's often better to leave PVCs for manual cleanup unless explicitly configured otherwise.
    	pvcList := &corev1.PersistentVolumeClaimList{}
    	listOpts := []client.ListOption{
    		client.InNamespace(c.Namespace),
    		client.MatchingLabels{"cluster": c.Name}, // Assuming our PVCs have this label
    	}
    
    	if err := r.List(ctx, pvcList, listOpts...); err != nil {
    		logger.Error(err, "Failed to list PVCs for finalization")
    		return err
    	}
    
    	for _, pvc := range pvcList.Items {
    		logger.Info("Deleting PVC during finalization", "pvcName", pvc.Name)
    		if err := r.Delete(ctx, &pvc); err != nil {
    			// Don't ignore not found errors, as it could have been deleted in a previous attempt
    			if !errors.IsNotFound(err) {
    				logger.Error(err, "Failed to delete PVC", "pvcName", pvc.Name)
    				return err
    			}
    		}
    	}
    
    	logger.Info("PostgresCluster finalization complete", "name", c.Name)
    	return nil
    }

    This pattern is absolutely essential for any operator managing resources with external dependencies or persistent data.


    Managing the Status Subresource Effectively

    The status subresource is the operator's primary feedback mechanism. It informs users and other automation about the state of the resource it manages. A poorly managed status field leads to confusion and makes debugging impossible.

    Best Practices for Status Updates:

  • Never Modify Spec: The reconciler should treat the spec as read-only. Its job is to make reality match the spec, not the other way around.
  • Read-Modify-Write: Always fetch the latest version of the CR before updating its status to avoid race conditions where another controller (or a previous reconciliation loop) has updated it.
  • Use Conditions: The conditions pattern is a Kubernetes standard. It provides a structured way to report the status of various aspects of the resource. Common condition types include Available, Progressing, and Degraded.
  • Let's implement a status update at the end of our Reconcile function. This involves observing the real state of our StatefulSet and Deployment pods and reflecting that in the PostgresCluster status.

    go
    // In the Reconcile function, after reconciling all sub-resources...
    
    // 6. Update status
    
    // Fetch the primary pod from the StatefulSet
    primaryPods := &corev1.PodList{}
    listOpts := []client.ListOption{
        client.InNamespace(cluster.Namespace),
        client.MatchingLabels{"app": cluster.Name + "-primary"}, // Label from StatefulSet pod template
    }
    if err := r.List(ctx, primaryPods, listOpts...); err != nil {
        logger.Error(err, "Failed to list primary pods")
        return ctrl.Result{}, err
    }
    
    // In a real operator, you'd check pod conditions (Ready, etc.)
    if len(primaryPods.Items) > 0 {
        primaryPod := primaryPods.Items[0]
        cluster.Status.PrimaryHost = primaryPod.Status.PodIP
    } else {
        cluster.Status.PrimaryHost = ""
    }
    
    // ... similar logic to get replica hosts ...
    
    // Determine the overall phase
    // This is a simplified example. A real implementation would be more robust.
    if cluster.Status.PrimaryHost != "" {
        cluster.Status.Phase = "Ready"
    } else {
        cluster.Status.Phase = "Creating"
    }
    
    // Use the status subresource client to update the status
    // This prevents race conditions with updates to the main object spec
    if err := r.Status().Update(ctx, cluster); err != nil {
        logger.Error(err, "Failed to update PostgresCluster status")
        return ctrl.Result{}, err
    }
    
    return ctrl.Result{}, nil

    Using r.Status().Update() instead of r.Update() is crucial. It specifically targets the /status subresource, ensuring our status changes don't conflict with any concurrent changes to the object's spec or metadata.


    Performance and Scalability Considerations

    A naive operator can place significant load on the Kubernetes API server. For an operator that might manage hundreds or thousands of CRs, performance is a production requirement.

    1. Controller-Runtime Caching:

    The client.Client provided by the SetupWithManager is a caching client. It reads from a local cache that is kept in sync with the API server via Watch events. This means most Get and List calls do not hit the API server, which is a massive performance win.

    * Edge Case: The cache can be slightly stale. If you need to read the absolute latest state of an object immediately after writing it (which is rare), you can use a non-caching client via mgr.GetAPIReader().

    2. Watch Predicates:

    By default, your Reconcile function is triggered for any change to a watched resource. If a label is added to a StatefulSet by another system, it will trigger a reconciliation of the parent PostgresCluster. This is often wasteful.

    We can use predicates to filter these events. A common strategy is to only trigger reconciliation when the generation of an object changes. The generation is a number incremented by the API server only when the object's spec is modified.

    go
    // controllers/postgrescluster_controller.go
    
    import (
        // ...
        "sigs.k8s.io/controller-runtime/pkg/predicate"
    )
    
    func (r *PostgresClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
    	return ctrl.NewControllerManagedBy(mgr).
    		For(&dbv1alpha1.PostgresCluster{}).
    		Owns(&appsv1.StatefulSet{}).
            // ... other owned resources
            // Use a predicate to ignore status updates on owned resources
    		WithEventFilter(predicate.GenerationChangedPredicate{}).
    		Complete(r)
    }

    This simple addition can dramatically reduce the number of unnecessary reconciliation loops, saving CPU cycles on both the operator and the API server.

    3. Leader Election:

    For high availability, you should run multiple replicas of your operator. But you only want one operator instance actively reconciling a specific PostgresCluster at any given time to prevent race conditions. The Operator SDK handles this automatically via leader election. It uses a Lease object in Kubernetes to ensure that only one pod (the leader) is running the controllers. The other pods remain on standby, ready to take over if the leader fails.


    Conclusion: From Controller to Production Operator

    We have moved far beyond a simple controller that just creates a resource. By implementing robust reconciliation logic, finalizers for graceful deletion, meticulous status management, and performance optimizations, we have built the foundation of a production-grade Kubernetes operator.

    The key takeaway is that an operator's value lies in encoding deep, domain-specific operational knowledge. It's not just about creating pods; it's about understanding the entire lifecycle of a complex stateful application and managing it automatically and reliably. The patterns discussed here—idempotency, finalizers, status subresources, and performance tuning—are the building blocks that enable operators to become trusted, autonomous members of your Kubernetes cluster, turning complex manual tasks into declarative, automated workflows.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles