Stateful Workload Orchestration with Custom Kubernetes Operators

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inadequacy of Standard Primitives for Complex State

For senior engineers operating in a Kubernetes environment, the limitations of default workload APIs like Deployments and even StatefulSets become apparent when managing truly complex stateful systems. A StatefulSet provides stable network identities and persistent storage, which are foundational. However, it lacks the application-specific domain knowledge required for operations like coordinated cluster bootstrapping, graceful node decommissioning, managed topology changes, and automated failover.

Consider a distributed database like etcd, Cassandra, or a custom-built replicated system. Simply scaling down a StatefulSet via kubectl scale is a recipe for disaster. Which pod do you remove? The leader? A replica with unsynced data? How do you inform the rest of the cluster that a member is leaving permanently? This operational logic—the sequence of API calls, health checks, and state transitions—typically resides in runbooks or bespoke automation scripts external to Kubernetes.

The Operator pattern addresses this by extending the Kubernetes API itself. By creating a Custom Resource Definition (CRD) and a corresponding controller (the Operator), we can encapsulate this complex operational logic into a declarative, Kubernetes-native resource. A developer can then simply declare apiVersion: database.mycorp.com/v1alpha1, kind: ReplicatedKVStore, spec: { replicas: 5 }, and the Operator handles the intricate orchestration required to make that desired state a reality.

This article will dissect the construction of such an Operator in Go using the controller-runtime library, focusing on the patterns essential for production-grade stateful management.

Defining the API: The Custom Resource Definition (CRD)

Everything starts with a well-defined API. Our CRD is the public contract for our stateful application. We'll design a CRD for a hypothetical ReplicatedKVStore. The spec defines the desired state, and the status reflects the observed state, which the Operator is responsible for updating.

replicatedkvstore_crd.yaml

yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: replicatedkvstores.database.mycorp.com
spec:
  group: database.mycorp.com
  names:
    kind: ReplicatedKVStore
    listKind: ReplicatedKVStoreList
    plural: replicatedkvstores
    singular: replicatedkvstore
  scope: Namespaced
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicas:
                  type: integer
                  minimum: 1
                  description: The desired number of replicas in the key-value store cluster.
                version:
                  type: string
                  description: The container image version to deploy.
                storageClassName:
                  type: string
                  description: The storage class for persistent volumes.
                volumeSize:
                  type: string
                  pattern: '^[0-9]+(Gi|Mi|Ki)$'
                  description: The size of the persistent volume for each replica (e.g., 10Gi).
              required:
                - replicas
                - version
                - storageClassName
                - volumeSize
            status:
              type: object
              properties:
                phase:
                  type: string
                  description: The current phase of the cluster (e.g., Creating, Ready, Failed).
                readyReplicas:
                  type: integer
                  description: The number of replicas that are fully initialized and ready.
                currentVersion:
                  type: string
                  description: The currently deployed version of the cluster.
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                      lastTransitionTime:
                        type: string
                        format: date-time
                      message:
                        type: string

This CRD defines our ReplicatedKVStore resource. The spec is what the user controls. The status is what our Operator controls. The status.conditions field is a best practice, providing detailed, observable state transitions for users and other automation.

The Core of the Operator: The Reconciliation Loop

Using a tool like Kubebuilder, we can scaffold the Go project for our Operator. The heart of the implementation lies within the Reconcile method of our controller. This method is invoked by the controller-runtime manager whenever a ReplicatedKVStore resource is created, updated, or deleted, or when a resource it owns (like a StatefulSet) is modified.

The reconciliation logic is idempotent. It compares the desired state (from the CR's spec) with the actual state of the cluster and takes action to converge the two.

Here is a skeleton of the Reconcile function:

internal/controller/replicatedkvstore_controller.go

go
package controller

import (
	"context"

	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/types"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"

	databasev1alpha1 "my-operator/api/v1alpha1"
)

// ReplicatedKVStoreReconciler reconciles a ReplicatedKVStore object
type ReplicatedKVStoreReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

func (r *ReplicatedKVStoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	// 1. Fetch the ReplicatedKVStore instance
	instance := &databasev1alpha1.ReplicatedKVStore{}
	err := r.Get(ctx, req.NamespacedName, instance)
	if err != nil {
		if errors.IsNotFound(err) {
			// Object was deleted. The reconciliation is complete.
			logger.Info("ReplicatedKVStore resource not found. Ignoring since object must be deleted")
			return ctrl.Result{}, nil
		}
		// Error reading the object - requeue the request.
		logger.Error(err, "Failed to get ReplicatedKVStore")
		return ctrl.Result{}, err
	}

	// *** CORE RECONCILIATION LOGIC GOES HERE ***

	// 2. Check for underlying StatefulSet
	foundSts := &appsv1.StatefulSet{}
	err = r.Get(ctx, types.NamespacedName{Name: instance.Name, Namespace: instance.Namespace}, foundSts)
	if err != nil && errors.IsNotFound(err) {
		// Define and create a new StatefulSet
		sts := r.statefulSetForKVStore(instance)
		logger.Info("Creating a new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
		err = r.Create(ctx, sts)
		if err != nil {
			logger.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
			return ctrl.Result{}, err
		}
		// StatefulSet created successfully - return and requeue
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		logger.Error(err, "Failed to get StatefulSet")
		return ctrl.Result{}, err
	}

	// 3. Ensure StatefulSet spec matches the CR spec
	// Example: Check replica count
	desiredReplicas := instance.Spec.Replicas
	if *foundSts.Spec.Replicas != desiredReplicas {
		foundSts.Spec.Replicas = &desiredReplicas
		err = r.Update(ctx, foundSts)
		if err != nil {
			logger.Error(err, "Failed to update StatefulSet replicas")
			return ctrl.Result{}, err
		}
		// Spec updated, requeue to check status after update
		return ctrl.Result{Requeue: true}, nil
	}

	// 4. Update the status of the ReplicatedKVStore resource
	// ... logic to check pod readiness and update instance.Status ...

	return ctrl.Result{}, nil
}

// statefulSetForKVStore returns a ReplicatedKVStore StatefulSet object
func (r *ReplicatedKVStoreReconciler) statefulSetForKVStore(cr *databasev1alpha1.ReplicatedKVStore) *appsv1.StatefulSet {
	// ... Logic to construct the StatefulSet YAML programmatically ...
	// This is where you define the container spec, volumes, probes, etc.
	// It's CRITICAL to set the OwnerReference here!
	sts := &appsv1.StatefulSet{ /* ... */ }
	ctrl.SetControllerReference(cr, sts, r.Scheme)
	return sts
}

// SetupWithManager sets up the controller with the Manager.
func (r *ReplicatedKVStoreReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&databasev1alpha1.ReplicatedKVStore{}).
		Owns(&appsv1.StatefulSet{}). // This makes the controller watch StatefulSets it creates
		Complete(r)
}

ctrl.SetControllerReference is vital. It establishes an ownership link from our CR to the StatefulSet. This enables Kubernetes garbage collection (deleting the StatefulSet when the CR is deleted) and allows our controller to be triggered when the StatefulSet changes.

Advanced State Management: Handling Deletion with Finalizers

A simple OwnerReference is insufficient for stateful workloads. When a user runs kubectl delete my-kv-store, the Kubernetes API server immediately marks the object for deletion. The garbage collector will then delete the owned StatefulSet and its pods. For a database, this is catastrophic. It doesn't allow for a graceful shutdown, data draining, or notifying the cluster of a permanent departure.

This is where finalizers become essential. A finalizer is a key in the resource's metadata that tells Kubernetes to wait for a controller to perform cleanup actions before fully deleting the resource.

The Finalizer Workflow:

  • When our Operator first reconciles a new ReplicatedKVStore CR, it adds a finalizer string (e.g., database.mycorp.com/finalizer) to its metadata.finalizers list.
  • When a user requests deletion of the CR, the API server doesn't delete it immediately. Instead, it sets the metadata.deletionTimestamp field.
  • Our Operator's reconciliation loop is triggered. It sees that deletionTimestamp is not nil.
    • The Operator now executes its custom cleanup logic. This could involve:

    * Calling a pre-stop API endpoint on the pods to drain connections.

    * Instructing the cluster leader to remove the departing nodes from the membership list.

    * Taking a final backup of the Persistent Volumes.

  • Once the cleanup is complete and verified, the Operator removes its finalizer string from the CR's metadata.finalizers list and updates the object.
  • The Kubernetes API server, seeing that the deletionTimestamp is set and the finalizers list is empty, finally deletes the resource.
  • Implementation in the Reconciler:

    go
    func (r *ReplicatedKVStoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    	logger := log.FromContext(ctx)
    	instance := &databasev1alpha1.ReplicatedKVStore{}
    	// ... fetch instance ...
    
    	// The finalizer name
    	myFinalizerName := "database.mycorp.com/finalizer"
    
    	// Check if the object is being deleted
    	if instance.ObjectMeta.DeletionTimestamp.IsZero() {
    		// The object is not being deleted, so if it does not have our finalizer,
    		// then lets add the finalizer and update the object.
    		if !controllerutil.ContainsFinalizer(instance, myFinalizerName) {
    			controllerutil.AddFinalizer(instance, myFinalizerName)
    			if err := r.Update(ctx, instance); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    	} else {
    		// The object is being deleted
    		if controllerutil.ContainsFinalizer(instance, myFinalizerName) {
    			// Our finalizer is present, so lets handle any external dependency
    			if err := r.cleanupExternalResources(ctx, instance); err != nil {
    				// if fail to delete the external dependency here, return with error
    				// so that it can be retried
    				logger.Error(err, "Failed to run cleanup logic")
    				return ctrl.Result{}, err
    			}
    
    			// Once cleanup is successful, remove the finalizer from the list and update it.
    			controllerutil.RemoveFinalizer(instance, myFinalizerName)
    			if err := r.Update(ctx, instance); err != nil {
    				return ctrl.Result{}, err
    			}
    		}
    
    		// Stop reconciliation as the item is being deleted
    		return ctrl.Result{}, nil
    	}
    
    	// ... rest of the reconciliation logic for create/update ...
    
    	return ctrl.Result{}, nil
    }
    
    func (r *ReplicatedKVStoreReconciler) cleanupExternalResources(ctx context.Context, cr *databasev1alpha1.ReplicatedKVStore) error {
    	logger := log.FromContext(ctx)
    	logger.Info("Executing finalizer logic for ReplicatedKVStore")
    
    	// This is where you implement the application-specific cleanup.
    	// For example, connect to the cluster and gracefully remove nodes.
    	// This logic must be idempotent.
    	
    	// Example: Call a 'decommission' endpoint on each pod before it's terminated.
    	// This would require a more complex orchestration than shown here, possibly
    	// by scaling down the statefulset one by one and waiting for confirmation.
    
    	logger.Info("Finalizer logic completed successfully")
    	return nil
    }

    Edge Case: Coordinated Scaling and Upgrades

    This is where an Operator truly shines over a bare StatefulSet. A simple replica count change in the CR spec should not trigger a blind StatefulSet update.

    Scenario: Scaling Down

    When spec.replicas is decreased from 5 to 3, the reconciliation loop detects the mismatch. Instead of just updating the StatefulSet replica count, the Operator should:

  • Identify which pods will be terminated by the StatefulSet controller (it terminates pods with the highest ordinals first, e.g., my-kv-store-4, then my-kv-store-3).
  • For the pod about to be terminated (my-kv-store-4):
  • a. Call the application's API to transfer leadership if it's the current leader.

    b. Call an API to drain its data to other replicas.

    c. Call an API to remove itself from the cluster membership.

  • Monitor the status of this operation. Only when the application confirms that my-kv-store-4 is safely decommissioned should the Operator proceed.
  • Repeat the process for my-kv-store-3.
  • Finally, after all nodes are gracefully removed from the application cluster, update the StatefulSet's replica count to 3.
  • This requires intricate logic within the reconciler, likely involving annotating pods that are in a 'draining' state and checking those annotations in subsequent reconciliation loops. The process transforms a dangerous operation into a safe, automated procedure.

    Scenario: Version Upgrade

    When spec.version changes, the Operator should manage the StatefulSet's rolling update. However, a simple pod restart may not be sufficient.

  • The Operator updates the StatefulSet's template with the new container image spec.version.
  • The StatefulSet controller begins its rolling update (e.g., terminating my-kv-store-4 and recreating it with the new version).
  • Our Operator must monitor this process. After the new pod (my-kv-store-4) becomes Ready, the Operator's job isn't done.
    • It should now perform post-upgrade health checks specific to the application.

    a. Does the new pod successfully join the cluster?

    b. Is data replication to the new pod healthy?

    c. Are there any schema migrations or data format changes that need to be run?

  • The Operator can pause the StatefulSet rolling update by setting spec.updateStrategy.rollingUpdate.partition to the ordinal of the next pod to be updated. It only advances the partition (allowing the StatefulSet to update the next pod) after all health checks for the currently updated pod have passed.
  • This level of control ensures that a bad version rollout is stopped immediately after the first pod fails its post-upgrade checks, preventing a full cluster outage.

    Performance and Stability Considerations

    Requeue Strategy: Naively returning an error on every transient issue will cause an exponential backoff, delaying reconciliation. For recoverable errors (e.g., a temporary API conflict), it's better to return ctrl.Result{Requeue: true} or ctrl.Result{RequeueAfter: time.Second 30} to have more control over the retry loop.

    * API Server Load: A poorly written Operator can flood the Kubernetes API server with requests. controller-runtime's client is built on client-go, which includes client-side rate limiting. Be mindful of making excessive GET and UPDATE calls within a single reconciliation. Batch status updates where possible.

    * Leader Election: For high availability, you'll run multiple replicas of your Operator. controller-runtime handles leader election out of the box using a Lease object in the cluster. Only the leader pod will run the reconciliation loop, preventing multiple controllers from fighting over the same resources.

    * Testing: Writing unit tests for the reconciliation logic is crucial. For integration testing, the envtest package from controller-runtime spins up a real etcd and kube-apiserver in memory, allowing you to test your Operator's interactions with the Kubernetes API without needing a full cluster.

    Conclusion

    Building a stateful Operator is a significant engineering investment that moves beyond declarative configuration into declarative, automated operations. By embedding the domain-specific logic of a complex stateful application directly into a custom Kubernetes controller, we create a truly cloud-native system. This system is not only self-healing in the face of pod failures but can also manage its own lifecycle—scaling, upgrades, and termination—with a level of safety and automation that is impossible to achieve with standard Kubernetes primitives alone. The patterns of reconciliation, finalizers, and fine-grained lifecycle control are the hallmarks of a production-ready Operator designed for the most critical stateful workloads.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles