Advanced K8s Operator Tuning: Dynamic Informer Caching & Finalizers
The Reconciliation Bottleneck: Beyond Simple Ownership
In the world of Kubernetes operators, the reconciliation loop is the heart of all logic. The controller-runtime framework provides a powerful abstraction over the client-go machinery, making it deceptively simple to watch a Custom Resource (CR) and its owned resources. A typical SetupWithManager looks like this:
// main.go (simplified)
func (r *CustomAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&appv1.CustomApp{}).
Owns(&corev1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
Under the hood, For() and Owns() are setting up informers. These informers maintain a local cache of the specified resource types (CustomApp, Deployment, Service) and trigger the Reconcile function when any of these objects change. When your Reconcile function calls r.Client.Get(...), it reads from this local cache, avoiding a costly round trip to the Kubernetes API server. This is the cornerstone of efficient operator design.
But what happens when the relationship isn't direct ownership? Consider a CustomApp CR that needs to pull configuration from a ConfigMap and credentials from a Secret, referenced by name in its spec:
# api/v1/customapp_types.go
// CustomAppSpec defines the desired state of CustomApp
type CustomAppSpec struct {
Image string `json:"image"`
ConfigMapName string `json:"configMapName"`
SecretName string `json:"secretName"`
}
The CustomApp operator needs to read this ConfigMap and Secret to correctly configure the Deployment it manages. Crucially, it also needs to react to changes in that ConfigMap or Secret. If an administrator updates a value in the ConfigMap, the operator must trigger a reconciliation for the CustomApp to roll out a new Deployment with the updated configuration.
The naive approach inside the Reconcile function is to use the uncached client directly:
// controller/customapp_controller.go (NAIVE APPROACH - DO NOT USE)
func (r *CustomAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
var customApp appv1.CustomApp
if err := r.Get(ctx, req.NamespacedName, &customApp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Fetch the referenced ConfigMap directly from the API server
var configMap corev1.ConfigMap
if err := r.Get(ctx, client.ObjectKey{Namespace: customApp.Namespace, Name: customApp.Spec.ConfigMapName}, &configMap); err != nil {
log.Error(err, "unable to fetch referenced ConfigMap")
return ctrl.Result{}, err
}
// ... logic to use configMap data ...
return ctrl.Result{}, nil
}
This code works, but it's a performance time bomb. Every single reconciliation for a CustomApp instance results in a direct GET request to the API server for the ConfigMap. With hundreds or thousands of CRs, this generates a storm of API requests, leading to:
ConfigMap when the CustomApp itself changes. It remains completely unaware of direct changes to the ConfigMap, defeating a key requirement.We need a way to leverage the caching power of informers for these non-owned, dynamically referenced resources.
The Limits of Static `Watches`
controller-runtime provides the Watches builder function to handle this. You can set up a watch on ConfigMap objects and use a handler.EnqueueRequestsFromMapFunc to map a ConfigMap change back to the CustomApp CRs that reference it.
// main.go (Slightly better, but still flawed)
.Watches(
&source.Kind{Type: &corev1.ConfigMap{}},
handler.EnqueueRequestsFromMapFunc(r.findCustomAppsForConfigMap),
)
// controller/customapp_controller.go
func (r *CustomAppReconciler) findCustomAppsForConfigMap(o client.Object) []reconcile.Request {
// 1. List ALL CustomApp objects in the cluster
// 2. Iterate through them to see which ones reference the changed ConfigMap
// 3. Return a list of reconcile.Requests for the matching CustomApps
}
This is an improvement because it reacts to ConfigMap changes. However, it introduces a new scaling problem. The findCustomAppsForConfigMap function has to LIST all CustomApp objects in the cluster (or at least in a namespace) every time any ConfigMap changes. If you have 10,000 CustomApp CRs and 5,000 ConfigMaps, a change to a single, unrelated ConfigMap forces your operator to iterate over all 10,000 CRs. This is inefficient and scales poorly.
The core issue is that the watch is static. We are telling the operator to watch all ConfigMaps because we don't know at startup which ones will be referenced. The ideal solution would be to only watch the specific ConfigMaps that are currently referenced by an existing CustomApp CR.
Production Pattern: Dynamic Informer Management
This brings us to the advanced pattern: managing informers dynamically within the lifecycle of your operator. Instead of defining static watches at startup, we will create and destroy informers on-demand as CustomApp CRs are created, updated, and deleted.
Our strategy will be:
- Create a manager struct within our Reconciler to track the active informers.
CustomApp is reconciled, check if an informer for its referenced ConfigMap is already running.ConfigMap and start it.Watches with a custom handler to ensure changes detected by this new informer trigger the correct CustomApp reconciliation.CustomApp is deleted, we must stop and clean up the associated informer to prevent resource leaks.Let's build this out.
1. The Dynamic Informer Manager
First, we'll augment our CustomAppReconciler to include a manager for our dynamic watches.
// controller/customapp_controller.go
import (
// ... other imports
"sync"
"k8s.io/client-go/tools/cache"
)
// activeInformer holds a reference to a running informer and its stop channel.
type activeInformer struct {
informer cache.SharedIndexInformer
stopCh chan struct{}
}
// CustomAppReconciler reconciles a CustomApp object
type CustomAppReconciler struct {
client.Client
Scheme *runtime.Scheme
// A mutex to protect concurrent access to the activeInformers map.
mutex sync.RWMutex
activeInformers map[string]activeInformer // Key: "<namespace>/<configMapName>"
// We need access to the controller manager to add new watches
Manager ctrl.Manager
}
// We'll need to initialize our map in main.go when creating the reconciler.
This structure gives us a thread-safe map to store references to the informers we create. The key will uniquely identify the ConfigMap being watched.
2. Implementing the Dynamic Watch Logic
Now, we modify the Reconcile function. Its new responsibility is not just to apply state, but also to manage the lifecycle of its associated watches.
// controller/customapp_controller.go
const customAppFinalizer = "app.example.com/finalizer"
func (r *CustomAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
var customApp appv1.CustomApp
if err := r.Get(ctx, req.NamespacedName, &customApp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// --- Finalizer Logic for Cleanup ---
// This will be explained in the next section, but is critical for the pattern.
if customApp.ObjectMeta.DeletionTimestamp.IsZero() {
// The object is not being deleted, so if it does not have our finalizer,
// then lets add the finalizer and update the object.
if !controllerutil.ContainsFinalizer(&customApp, customAppFinalizer) {
controllerutil.AddFinalizer(&customApp, customAppFinalizer)
if err := r.Update(ctx, &customApp); err != nil {
return ctrl.Result{}, err
}
}
} else {
// The object is being deleted
if controllerutil.ContainsFinalizer(&customApp, customAppFinalizer) {
// Our finalizer is present, so let's handle external dependency cleanup
if err := r.cleanupDynamicWatch(ctx, &customApp); err != nil {
// if fail to delete the external dependency here, return with error
// so that it can be retried
return ctrl.Result{}, err
}
// remove our finalizer from the list and update it.
controllerutil.RemoveFinalizer(&customApp, customAppFinalizer)
if err := r.Update(ctx, &customApp); err != nil {
return ctrl.Result{}, err
}
}
// Stop reconciliation as the item is being deleted
return ctrl.Result{}, nil
}
// --- Dynamic Watch Setup Logic ---
if err := r.setupDynamicWatch(ctx, &customApp); err != nil {
log.Error(err, "failed to set up dynamic watch for ConfigMap")
return ctrl.Result{}, err
}
// --- Core Reconciliation Logic ---
// Now, we can safely read the ConfigMap from the manager's cache.
// Note: The informer might not be synced on the very first reconcile.
// Production code needs to handle this, perhaps by requeuing.
log.Info("Reconciling CustomApp and its dependencies")
// ... existing logic to create/update Deployment using the ConfigMap data ...
return ctrl.Result{}, nil
}
func (r *CustomAppReconciler) setupDynamicWatch(ctx context.Context, app *appv1.CustomApp) error {
log := log.FromContext(ctx)
informerKey := fmt.Sprintf("%s/%s", app.Namespace, app.Spec.ConfigMapName)
r.mutex.RLock()
_, found := r.activeInformers[informerKey]
r.mutex.RUnlock()
if found {
log.V(1).Info("Informer for ConfigMap already exists", "key", informerKey)
return nil
}
log.Info("Setting up new dynamic informer for ConfigMap", "key", informerKey)
// Use the manager's API reader for this setup to bypass the cache for a direct list
// This ensures we get the resource version needed to start the watch.
listOptions := &client.ListOptions{
FieldSelector: fields.OneTermEqualSelector("metadata.name", app.Spec.ConfigMapName),
Namespace: app.Namespace,
}
// We create a new List-Watch for a specific ConfigMap in a specific namespace.
lw := cache.NewFilteredListWatchFromClient(
r.Manager.GetClient().RESTClient(),
"configmaps",
app.Namespace,
func(options *metav1.ListOptions) {
options.FieldSelector = fmt.Sprintf("metadata.name=%s", app.Spec.ConfigMapName)
},
)
inf := cache.NewSharedIndexInformer(lw, &corev1.ConfigMap{}, 0, cache.Indexers{})
// Add an event handler to the informer. This handler will be called whenever the
// watched ConfigMap changes.
inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: r.enqueueReconciliationForConfigMap,
UpdateFunc: func(old, new interface{}) { r.enqueueReconciliationForConfigMap(new) },
DeleteFunc: r.enqueueReconciliationForConfigMap,
})
// Start the informer
stopCh := make(chan struct{})
go inf.Run(stopCh)
// Wait for the cache to sync. In a real operator, you might do this
// in the background and requeue the reconciliation if it's not ready.
if !cache.WaitForCacheSync(stopCh, inf.HasSynced) {
close(stopCh) // clean up on failure
return fmt.Errorf("failed to sync cache for informer %s", informerKey)
}
r.mutex.Lock()
defer r.mutex.Unlock()
r.activeInformers[informerKey] = activeInformer{
informer: inf,
stopCh: stopCh,
}
log.Info("Successfully started and synced dynamic informer", "key", informerKey)
return nil
}
// This function is called by the informer's event handler.
func (r *CustomAppReconciler) enqueueReconciliationForConfigMap(obj interface{}) {
cm := obj.(*corev1.ConfigMap)
log := r.Manager.GetLogger().WithValues("configmap", cm.Name)
// Find all CustomApp objects that reference this ConfigMap.
// This requires an index for efficiency, which we'll set up next.
var appList appv1.CustomAppList
if err := r.List(context.Background(), &appList, client.InNamespace(cm.Namespace), client.MatchingFields{".spec.configMapName": cm.Name}); err != nil {
log.Error(err, "failed to list CustomApps for ConfigMap change")
return
}
for _, app := range appList.Items {
log.V(1).Info("Enqueuing reconciliation for CustomApp due to ConfigMap change", "customapp", app.Name)
r.Manager.GetController("customapp").Reconcile(context.Background(), reconcile.Request{
NamespacedName: types.NamespacedName{
Name: app.Name,
Namespace: app.Namespace,
},
})
}
}
3. Indexing for Efficient Lookups
The enqueueReconciliationForConfigMap function needs an efficient way to find CustomApps that reference a given ConfigMap. A full LIST is what we wanted to avoid. The solution is to add an index to the manager's cache at startup.
// main.go
func main() {
// ... setup code ...
if err := mgr.GetFieldIndexer().IndexField(context.Background(), &appv1.CustomApp{}, ".spec.configMapName", func(rawObj client.Object) []string {
app := rawObj.(*appv1.CustomApp)
if app.Spec.ConfigMapName == "" {
return nil
}
return []string{app.Spec.ConfigMapName}
}); err != nil {
setupLog.Error(err, "failed to create index for .spec.configMapName")
os.Exit(1)
}
// ... create and add the reconciler to the manager ...
}
With this index in place, our r.List(..., client.MatchingFields{".spec.configMapName": cm.Name}) call becomes a highly efficient lookup against the local cache, not an API server query.
The Critical Role of Finalizers for Cleanup
Our current implementation has a major flaw: a resource leak. When a CustomApp CR is deleted, its reconciliation loop stops, but the dynamic informer we started continues to run indefinitely. It holds a watch connection to the API server and consumes memory in the operator pod. We must clean it up.
This is the perfect use case for a Kubernetes Finalizer. A finalizer is a key in an object's metadata that tells the API server not to fully delete the object until that key is removed. It's a hook that allows our controller to perform cleanup actions before the object is garbage collected.
The finalizer logic was already included in the Reconcile function above. Let's break down the flow:
CustomApp is first reconciled and does not have our finalizer (app.example.com/finalizer), we add it and update the object. This is the first action in the reconciliation.kubectl delete customapp my-app, the API server doesn't immediately delete it. Instead, it sets the metadata.deletionTimestamp field and triggers a final reconciliation.Reconcile function detects that deletionTimestamp is set. It then calls our cleanup function, cleanupDynamicWatch.deletionTimestamp is set and the finalizer list is empty, the Kubernetes garbage collector will permanently delete the object.Here is the implementation for the cleanup function:
// controller/customapp_controller.go
func (r *CustomAppReconciler) cleanupDynamicWatch(ctx context.Context, app *appv1.CustomApp) error {
log := log.FromContext(ctx)
informerKey := fmt.Sprintf("%s/%s", app.Namespace, app.Spec.ConfigMapName)
r.mutex.Lock()
defer r.mutex.Unlock()
if activeInf, found := r.activeInformers[informerKey]; found {
log.Info("Cleaning up and stopping dynamic informer", "key", informerKey)
close(activeInf.stopCh) // Signal the informer to stop
delete(r.activeInformers, informerKey)
} else {
log.V(1).Info("No active informer found for cleanup, skipping", "key", informerKey)
}
return nil
}
This function safely finds the active informer in our map, closes its stop channel to terminate the Run goroutine, and removes it from the map. The combination of dynamic informers and finalizers creates a robust, self-managing system that is both highly performant and resource-efficient.
Performance Considerations and Benchmarks
Let's analyze the performance difference between the naive approach and our dynamic informer pattern in a hypothetical scenario.
Scenario:
* 1,000 CustomApp CRs deployed in a cluster.
* Each CustomApp references a unique ConfigMap.
* Default controller-runtime resync period is 10 hours (so reconciliations are primarily event-driven).
Analysis A: Naive r.Get() Approach
* Initial State: 1,000 CustomApp CRs exist. No direct API load from the operator at rest.
* Event: A single ConfigMap is updated by a user.
* Operator Reaction: Nothing. The operator is not watching ConfigMaps, so it is completely unaware of the change. The application configuration becomes stale.
* Event: A single CustomApp spec is changed (e.g., its image tag is updated).
* Operator Reaction: A reconciliation is triggered. Inside the loop, r.Get() is called for the ConfigMap. This results in 1 GET API call.
* Worst Case (Thundering Herd): If an operator restart or a global change causes all 1,000 CustomApps to be reconciled around the same time, this would generate 1,000 GET API calls in a short burst.
Analysis B: Dynamic Informer with Finalizers Approach
* Initial State: As the 1,000 CustomApps are created, the operator establishes 1,000 WATCH connections to the API server, one for each referenced ConfigMap. This is a higher initial connection count, but WATCH is a very efficient, long-lived connection type designed for this purpose.
* Event: A single ConfigMap is updated by a user.
* Operator Reaction: The API server pushes the change event down the corresponding WATCH connection. The specific informer for that ConfigMap receives the event. Its event handler fires, looks up the associated CustomApp via the index (a fast, local operation), and enqueues a single reconciliation request. Total API server load: Effectively 0 direct calls, just the push event on an existing connection.
* Event: A single CustomApp spec is changed.
* Operator Reaction: A reconciliation is triggered. The logic reads the ConfigMap data from the local informer cache. Total API server load: 0 GET API calls.
* Worst Case (Thundering Herd): On operator restart, it will reconcile all 1,000 CustomApps. It will re-establish the 1,000 WATCH connections. While this is a burst of activity, it's setting up efficient streams rather than performing inefficient polling. Subsequent operations are near-zero cost from an API call perspective.
Memory Overhead: The primary trade-off is memory consumption in the operator pod. Each informer maintains a local cache of its object. Since we are creating one informer per specific ConfigMap, the cache for each is tiny (it holds just one object). The overhead of the informer machinery itself is non-zero but generally manageable compared to the performance gains and API server stability it provides.
Advanced Edge Cases and Production Hardening
This pattern is powerful, but production environments demand resilience.
1. Handling ConfigMap Reference Changes
What if a CustomApp is updated to point to a different ConfigMap?
# OLD
spec:
configMapName: my-config-v1
# NEW
spec:
configMapName: my-config-v2
Our current setupDynamicWatch logic is idempotent; it checks if an informer exists and does nothing if it does. This is insufficient. We need to detect the change and perform a handoff: stop the old informer and start a new one.
This requires storing the last-seen ConfigMap name in the CustomApp's status.
* In Reconcile, compare customApp.Spec.ConfigMapName with customApp.Status.ObservedConfigMapName.
If they differ, call cleanupDynamicWatch for the old* name from the status.
Then, call setupDynamicWatch for the new* name from the spec.
* Finally, update the status field ObservedConfigMapName to the new value.
2. Operator Restarts and State
Our activeInformers map is held in memory. What happens if the operator pod crashes and restarts? The map is lost. This is not a problem; in fact, it's a feature of the declarative nature of Kubernetes.
Upon restart, controller-runtime will automatically enqueue a reconciliation for every existing CustomApp object. As each CustomApp is reconciled, our setupDynamicWatch logic will be called, and the in-memory map of active informers will be rebuilt from the ground up, reflecting the true state of the cluster. The use of finalizers ensures that no cleanup was missed for CRs deleted during the downtime.
3. Required RBAC Permissions
This pattern requires more granular RBAC permissions. Your operator's ClusterRole must have get, list, and watch permissions on the resources it will be dynamically watching. In our case, this means adding configmaps to the rule set:
# config/rbac/role.yaml
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
Without these permissions, the attempt to create the ListWatch for the informer will fail, causing the reconciliation to error out.
Conclusion
While controller-runtime provides excellent defaults for simple ownership models, building sophisticated, production-grade Kubernetes operators often requires delving deeper into the client-go caching and watch machinery. The naive approach of calling Get() for non-owned resources in a reconciliation loop is an anti-pattern that leads to severe performance issues at scale.
By embracing a dynamic informer management strategy, we shift from an inefficient, poll-based model to a highly efficient, event-driven architecture. This pattern, underpinned by the crucial cleanup guarantees of finalizers, allows an operator to maintain a minimal set of targeted watches, reacting instantly to changes in related dependencies without overwhelming the Kubernetes API server. It is a testament to the power and flexibility of the controller pattern and a key technique for any senior engineer working in the Kubernetes ecosystem.