Custom Kubernetes Operators in Go for Stateful App Management
The Inadequacy of Off-the-Shelf Primitives for Complex Stateful Workloads
As senior engineers, we appreciate the power of Kubernetes primitives like StatefulSet. They provide foundational guarantees for stateful workloads: stable network identities, persistent storage, and ordered deployment. However, their utility plateaus when faced with the complex, domain-specific operational logic required by applications like distributed databases, message queues, or caching clusters.
A StatefulSet can manage pod lifecycle, but it has no intrinsic knowledge of application-level concerns. It cannot perform a coordinated primary-replica failover, orchestrate a schema migration during an upgrade, manage application-level sharding, or execute a consistent, cluster-wide backup. These "Day-2 operations" are typically relegated to a combination of manual intervention and brittle shell scripts, an anti-pattern in the cloud-native ecosystem.
This is the problem domain the Operator Pattern solves. By extending the Kubernetes API with a Custom Resource Definition (CRD) and implementing a custom controller, we can encapsulate this complex operational knowledge into a declarative, automated system. This article will not cover the basics of what an operator is. Instead, we will dive directly into building a production-grade PostgresCluster operator in Go, focusing on the nuanced patterns required for robust lifecycle management.
Our goal: To create a PostgresCluster resource that manages a primary-replica PostgreSQL setup, complete with automated backup scheduling and graceful termination logic.
Defining the Contract: The `PostgresCluster` CRD
The CRD is the API contract between the user and our operator. A well-designed spec and status are paramount for a declarative and observable system. We'll use kubebuilder to scaffold our project and define the API types.
Our PostgresCluster resource will have a spec defining the desired state and a status reflecting the observed state.
File: api/v1alpha1/postgrescluster_types.go
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// PostgresClusterSpec defines the desired state of PostgresCluster
type PostgresClusterSpec struct {
// Version is the desired PostgreSQL version.
// +kubebuilder:validation:Required
Version string `json:"version"`
// Replicas is the number of PostgreSQL instances in the cluster.
// Must be at least 1.
// +kubebuilder:validation:Minimum=1
Replicas int32 `json:"replicas"`
// StorageSize is the size of the persistent volume for each instance.
// e.g., "10Gi"
// +kubebuilder:validation:Required
StorageSize string `json:"storageSize"`
// BackupSchedule is a cron-formatted string for scheduling backups.
// e.g., "0 0 * * *" for daily at midnight.
// +kubebuilder:validation:Optional
BackupSchedule string `json:"backupSchedule,omitempty"`
}
// PostgresClusterStatus defines the observed state of PostgresCluster
type PostgresClusterStatus struct {
// Phase indicates the current state of the cluster.
// e.g., Creating, Ready, Failing, Updating.
Phase string `json:"phase,omitempty"`
// PrimaryPod is the name of the current primary PostgreSQL pod.
PrimaryPod string `json:"primaryPod,omitempty"`
// ReplicaPods is a list of current replica pod names.
ReplicaPods []string `json:"replicaPods,omitempty"`
// Conditions represent the latest available observations of the cluster's state.
// +patchMergeKey=type
// +patchStrategy=merge
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
// LastBackupTime is the timestamp of the last successful backup.
LastBackupTime *metav1.Time `json:"lastBackupTime,omitempty"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Version",type="string",JSONPath=".spec.version"
//+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
//+kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Primary",type="string",JSONPath=".status.primaryPod"
// PostgresCluster is the Schema for the postgresclusters API
type PostgresCluster struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec PostgresClusterSpec `json:"spec,omitempty"`
Status PostgresClusterStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// PostgresClusterList contains a list of PostgresCluster
type PostgresClusterList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []PostgresCluster `json:"items"`
}
func init() {
SchemeBuilder.Register(&PostgresCluster{}, &PostgresClusterList{})
}
After running make manifests and make install, this Go definition generates the CRD YAML that extends the Kubernetes API. The +kubebuilder markers are critical: they add validation, subresources (/status), and configure kubectl get output columns.
The Controller's Core: A Resilient Reconciliation Loop
The Reconcile function is the heart of the operator. It's invoked whenever a PostgresCluster resource (or a resource it owns) changes. Its sole purpose is to drive the current state of the cluster towards the desired state defined in the CR's spec.
A naive implementation might be a simple sequence of CreateOrUpdate calls. A production-grade reconciler is far more complex, idempotent, and aware of deletion lifecycles.
File: controllers/postgrescluster_controller.go
const postgresClusterFinalizer = "database.example.com/finalizer"
func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the PostgresCluster instance
cluster := &databasev1alpha1.PostgresCluster{}
err := r.Get(ctx, req.NamespacedName, cluster)
if err != nil {
if errors.IsNotFound(err) {
log.Info("PostgresCluster resource not found. Ignoring since object must be deleted")
return ctrl.Result{}, nil
}
log.Error(err, "Failed to get PostgresCluster")
return ctrl.Result{}, err
}
// 2. Handle deletion: Check if the DeletionTimestamp is set.
if !cluster.ObjectMeta.DeletionTimestamp.IsZero() {
// The object is being deleted
if controllerutil.ContainsFinalizer(cluster, postgresClusterFinalizer) {
// Run our finalization logic. If it fails, we'll retry later.
if err := r.finalizePostgresCluster(ctx, cluster); err != nil {
log.Error(err, "Failed to run finalizer")
return ctrl.Result{}, err
}
// Remove finalizer. Once all finalizers are removed, the object will be deleted.
controllerutil.RemoveFinalizer(cluster, postgresClusterFinalizer)
if err := r.Update(ctx, cluster); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// 3. Add finalizer for non-deleted objects
if !controllerutil.ContainsFinalizer(cluster, postgresClusterFinalizer) {
log.Info("Adding finalizer to the PostgresCluster")
controllerutil.AddFinalizer(cluster, postgresClusterFinalizer)
if err := r.Update(ctx, cluster); err != nil {
return ctrl.Result{}, err
}
}
// 4. Main reconciliation logic
// ... Reconcile StatefulSet
// ... Reconcile Services
// ... Reconcile CronJob for backups
// ... Update Status
// For this example, we'll stub out the main logic
log.Info("Reconciling PostgresCluster")
// Example of updating status
cluster.Status.Phase = "Ready"
if err := r.Status().Update(ctx, cluster); err != nil {
log.Error(err, "Failed to update PostgresCluster status")
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
This structure correctly handles the object lifecycle. The finalizer logic is not optional; it's essential for any operator managing external resources or requiring ordered cleanup.
Production Pattern 1: Graceful Deletion with Finalizers
When a user runs kubectl delete postgrescluster my-cluster, the Kubernetes API server marks the object for deletion by setting the metadata.deletionTimestamp. It does not immediately remove the object. Instead, it waits for all controllers that have registered a finalizer on the object to remove their finalizer. This is our window to perform cleanup.
Without a finalizer, deleting our PostgresCluster would orphan its PersistentVolumeClaims, potentially leading to data loss or dangling cloud resources that incur costs.
Our finalizePostgresCluster function implements the cleanup logic:
func (r *PostgresClusterReconciler) finalizePostgresCluster(ctx context.Context, cluster *databasev1alpha1.PostgresCluster) error {
log := log.FromContext(ctx)
log.Info("Starting finalization for PostgresCluster")
// 1. Perform a final backup before deletion.
// This could involve creating a one-off Kubernetes Job to run pg_dump
// and upload to S3. This logic must be idempotent.
log.Info("Triggering final backup...")
// backupErr := r.triggerFinalBackup(ctx, cluster)
// if backupErr != nil { ... return error to retry ... }
// 2. Delete any external resources not handled by garbage collection.
// For example, if we registered this cluster with an external monitoring service.
log.Info("Cleaning up external resources...")
// 3. The StatefulSet, Services, and CronJob will be garbage collected
// automatically because we will set the Controller Reference. No explicit
// deletion logic is needed for them here.
log.Info("PostgresCluster finalization complete")
return nil
}
Edge Case: What if the final backup job fails? The finalizePostgresCluster function returns an error, the Reconcile call is requeued, and the finalizer is not removed. The PostgresCluster object will remain in a Terminating state until the cleanup logic succeeds. This prevents data loss by ensuring critical cleanup actions are completed before the resource is fully deleted.
Production Pattern 2: Managing Dependent Objects with Controller References
Our operator needs to create and manage several Kubernetes resources: a StatefulSet for the pods, a Service for the primary endpoint, a headless Service for peer discovery, and a CronJob for backups.
The key to managing these is the controllerutil.SetControllerReference function. This sets the ownerReferences field on the child object, pointing back to our PostgresCluster instance. This has two critical benefits:
PostgresCluster is deleted, the Kubernetes garbage collector automatically deletes all objects it owns.StatefulSet is deleted) will trigger a reconciliation of the owner (PostgresCluster).Here is a robust function to reconcile the StatefulSet:
import (
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
// ... other imports
)
func (r *PostgresClusterReconciler) reconcileStatefulSet(ctx context.Context, cluster *databasev1alpha1.PostgresCluster) (*appsv1.StatefulSet, error) {
log := log.FromContext(ctx)
desiredSts := r.defineStatefulSet(cluster)
// Set PostgresCluster instance as the owner and controller
if err := controllerutil.SetControllerReference(cluster, desiredSts, r.Scheme); err != nil {
return nil, err
}
// Check if the StatefulSet already exists
foundSts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: desiredSts.Name, Namespace: desiredSts.Namespace}, foundSts)
if err != nil && errors.IsNotFound(err) {
log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
err = r.Create(ctx, desiredSts)
if err != nil {
return nil, err
}
// StatefulSet created successfully - return and requeue
return desiredSts, nil
} else if err != nil {
return nil, err
}
// StatefulSet already exists - check for updates
// A simple spec comparison is often too naive due to default values being set by the API server.
// We need a more sophisticated check.
// For this example, we'll check for replica count and image changes.
needsUpdate := false
if *foundSts.Spec.Replicas != *desiredSts.Spec.Replicas {
log.Info("StatefulSet replica count mismatch", "found", *foundSts.Spec.Replicas, "desired", *desiredSts.Spec.Replicas)
foundSts.Spec.Replicas = desiredSts.Spec.Replicas
needsUpdate = true
}
foundImage := foundSts.Spec.Template.Spec.Containers[0].Image
desiredImage := desiredSts.Spec.Template.Spec.Containers[0].Image
if foundImage != desiredImage {
log.Info("StatefulSet image mismatch", "found", foundImage, "desired", desiredImage)
foundSts.Spec.Template.Spec.Containers[0].Image = desiredImage
needsUpdate = true
}
if needsUpdate {
log.Info("Updating StatefulSet")
if err := r.Update(ctx, foundSts); err != nil {
return nil, err
}
}
return foundSts, nil
}
// defineStatefulSet creates the desired StatefulSet object
func (r *PostgresClusterReconciler) defineStatefulSet(cluster *databasev1alpha1.PostgresCluster) *appsv1.StatefulSet {
// ... logic to build the StatefulSet struct ...
// This includes setting labels, container image from spec.Version,
// port, readiness/liveness probes, and volumeClaimTemplates based on spec.StorageSize.
// This function is purely declarative.
return &appsv1.StatefulSet{ /* ... full definition ... */ }
}
Advanced Consideration: Immutable Fields. A user might try to change spec.storageSize. However, the volumeClaimTemplates field in a StatefulSet is immutable after creation. A naive operator might try to update the StatefulSet and get an API error, causing an infinite reconciliation loop. A production operator must handle this. The best practice is to implement a validating admission webhook to reject such changes with a clear error message. Barring that, the operator's update logic must compare the existing StatefulSet's storage spec with its desired spec and, if they differ, log a warning and revert the change in the desiredSts object before attempting an update.
Production Pattern 3: Encoding Operational Logic - Automated Backups
This is where the operator truly shines, by automating a task that is traditionally manual or managed by a separate system. We'll use the spec.backupSchedule to manage a CronJob.
Similar to the StatefulSet, we'll have a reconcileBackupCronJob function.
import (
batchv1 "k8s.io/api/batch/v1"
// ...
)
func (r *PostgresClusterReconciler) reconcileBackupCronJob(ctx context.Context, cluster *databasev1alpha1.PostgresCluster) error {
log := log.FromContext(ctx)
// If schedule is empty, ensure CronJob is deleted
if cluster.Spec.BackupSchedule == "" {
cj := &batchv1.CronJob{
ObjectMeta: metav1.ObjectMeta{
Name: cluster.Name + "-backup",
Namespace: cluster.Namespace,
},
}
if err := r.Delete(ctx, cj); err != nil && !errors.IsNotFound(err) {
return fmt.Errorf("failed to delete backup cronjob: %w", err)
}
log.Info("Backup schedule is empty, ensuring CronJob is deleted.")
return nil
}
// Define the desired CronJob
desiredCj := r.defineCronJob(cluster)
if err := controllerutil.SetControllerReference(cluster, desiredCj, r.Scheme); err != nil {
return err
}
// Create or Update logic similar to the StatefulSet reconciler
foundCj := &batchv1.CronJob{}
err := r.Get(ctx, types.NamespacedName{Name: desiredCj.Name, Namespace: desiredCj.Namespace}, foundCj)
if err != nil && errors.IsNotFound(err) {
log.Info("Creating backup CronJob")
return r.Create(ctx, desiredCj)
} else if err != nil {
return err
}
// If found, check if the schedule has changed
if foundCj.Spec.Schedule != desiredCj.Spec.Schedule {
foundCj.Spec.Schedule = desiredCj.Spec.Schedule
log.Info("Updating backup CronJob schedule")
return r.Update(ctx, foundCj)
}
return nil
}
func (r *PostgresClusterReconciler) defineCronJob(cluster *databasev1alpha1.PostgresCluster) *batchv1.CronJob {
// ... logic to build the CronJob struct ...
// The job's pod would contain a container with `pg_dump` and `aws-cli`,
// configured with secrets to connect to the DB and upload to S3.
// The connection details would point to the primary service we manage.
return &batchv1.CronJob{ /* ... full definition ... */ }
}
This reconciliation logic ensures that if a user adds, removes, or changes the backupSchedule on the PostgresCluster resource, the operator will automatically create, delete, or update the corresponding CronJob in the cluster.
Stability and Performance in Production
* Controller Caching: The controller-runtime framework, used by kubebuilder, creates a cache of Kubernetes objects using informers. Your Reconcile function reads from this cache, which is nearly instantaneous and avoids hammering the API server. Writes (Create, Update, Delete), however, go directly to the API server. Be aware that this means you might be reconciling based on slightly stale data. The system is eventually consistent.
* Leader Election: For high availability, you'll run multiple replicas of your operator. Leader election is crucial to prevent multiple instances from reconciling the same object simultaneously, which would cause chaos. The kubebuilder scaffold enables this by default, using a Lease object in the cluster to coordinate leadership.
* Resource Watching: The operator must be configured to watch not only PostgresCluster resources but also the resources it owns. This ensures the reconciler is triggered when, for example, a user manually deletes the StatefulSet our operator manages. The operator will then detect the deviation and recreate it.
File: controllers/postgrescluster_controller.go
func (r *PostgresClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&databasev1alpha1.PostgresCluster{}).
Owns(&appsv1.StatefulSet{}).
Owns(&corev1.Service{}).
Owns(&batchv1.CronJob{}).
Complete(r)
}
This SetupWithManager configuration is declarative and powerful. Owns tells the controller to watch the specified types and, if one changes, to enqueue a reconciliation request for its owner. This is far more efficient than watching all StatefulSets in the cluster.
Conclusion
Building a production-grade Kubernetes Operator is a significant engineering endeavor that goes far beyond basic CRUD operations. It requires a deep understanding of the controller-runtime's reconciliation loop, the critical importance of finalizers for safe resource termination, and the robust management of owned resources via controller references. By encapsulating complex, stateful operational logic into a declarative, self-healing Go application, we transform a high-maintenance stateful service into a true cloud-native citizen. The patterns discussed here—meticulous CRD design, idempotent reconciliation, finalizer-based cleanup, and automated Day-2 operations—are the foundation for building powerful, reliable automation on Kubernetes.