Dynamic Kubernetes Policy Enforcement with Validating Admission Webhooks

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Governance Gap: Beyond Linting and CI Gates

In any mature Kubernetes environment, maintaining operational and security standards is a perpetual challenge. While static analysis tools (like kubeval or conftest) and CI/CD pipeline gates are essential first lines of defense, they have a fundamental limitation: they operate before a resource is submitted to the cluster. They cannot prevent a user with kubectl access from directly applying non-compliant manifests, nor can they react to the dynamic state of the cluster during an admission request.

This is where Kubernetes Admission Controllers come into play. They are plugins that intercept requests to the Kubernetes API server after the request is authenticated and authorized, but before the object is persisted in etcd. They are the final gatekeepers of cluster state.

While Kubernetes provides several built-in admission controllers (LimitRanger, ResourceQuota, etc.), they often lack the custom, business-specific logic required by organizations. The solution is the ValidatingAdmissionWebhook and MutatingAdmissionWebhook. These allow you to deploy your own custom logic as a webhook server that the API server calls synchronously during the admission process.

This article is a deep dive into building a production-grade Validating Admission Webhook in Go. We will not cover the basics. We assume you know what a webhook is and why you'd want one. Instead, we will focus on the intricate details of a real-world implementation: enforcing a common, critical policy that all containers in Deployments and StatefulSets must have CPU and memory limits and requests specified.

We will cover:

  • The Webhook Server Architecture: Building a lean, efficient HTTP server in Go to handle AdmissionReview objects.
  • Automated TLS Management: Solving the complex certificate lifecycle problem using cert-manager—the de facto production standard.
  • Deployment and Configuration: Crafting the Deployment, Service, and ValidatingWebhookConfiguration manifests with a focus on resilience and scope.
  • Advanced Considerations: Tackling performance bottlenecks, ensuring high availability, and preventing self-invocation deadlocks.

  • Core Architecture: The Synchronous API Server Callback

    Before we write a line of code, it's critical to internalize the request flow. When a user runs kubectl apply -f my-deployment.yaml, the following happens in our webhook's context:

  • Authentication & Authorization: The API server authenticates the user and authorizes the action (CREATE or UPDATE a Deployment).
  • Webhook Invocation: The API server checks its configured ValidatingWebhookConfiguration resources. If the incoming request matches a rule, it constructs an AdmissionReview JSON object containing the full resource manifest.
  • Synchronous POST: The API server sends a synchronous HTTP POST request to the Service endpoint defined in the webhook configuration.
  • Webhook Logic Execution: Our custom Go server receives the AdmissionReview object, deserializes it, and executes our validation logic.
  • Admission Response: Our server constructs an AdmissionReview response. For a validating webhook, the crucial fields are allowed (a boolean) and an optional status object with a message and code.
  • API Server Action: The API server receives the response.
  • - If allowed: true, the request proceeds, and the object is persisted to etcd.

    - If allowed: false, the API server rejects the request and returns an error to the client (kubectl), including the message from our webhook's response.

    The key takeaway is synchronicity. Our webhook is in the critical path of API requests. If it's slow, the entire cluster's resource management becomes slow. If it's down and the failurePolicy is Fail, a significant portion of cluster operations can be halted.

    Building the Validating Webhook Server in Go

    Let's implement a webhook that enforces a simple but powerful policy: All containers within Deployments and StatefulSets must have both CPU and memory requests and limits defined.

    Project Structure

    bash
    . 
    ├── cmd/
    │   └── main.go         # Server entrypoint
    ├── internal/
    │   └── webhook/
    │       └── validate.go   # Core validation logic
    ├── pkg/
    │   └── certs/
    │       └── tls.go        # (Placeholder for manual certs, we'll use cert-manager)
    ├── Dockerfile
    ├── go.mod
    └── go.sum

    The HTTP Server (`cmd/main.go`)

    We'll use the standard net/http library. The server needs to handle two things: a /validate endpoint for admission reviews and a /healthz endpoint for liveness/readiness probes.

    go
    // cmd/main.go
    package main
    
    import (
    	"context"
    	"crypto/tls"
    	"fmt"
    	"log"
    	"net/http"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/your-org/k8s-webhook-example/internal/webhook"
    )
    
    func main() {
    	log.Println("Starting webhook server...")
    
    	// Load TLS certificates. For production, these are mounted from a secret
    	// managed by cert-manager.
    	certPath := "/etc/webhook/certs/tls.crt"
    	keyPath := "/etc/webhook/certs/tls.key"
    
    	pair, err := tls.LoadX509KeyPair(certPath, keyPath)
    	if err != nil {
    		log.Fatalf("Failed to load key pair: %v", err)
    	}
    
    	// Define server and handlers
    	mux := http.NewServeMux()
    	mux.HandleFunc("/validate", webhook.HandleValidate)
    	mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    		w.WriteHeader(http.StatusOK)
    		fmt.Fprint(w, "ok")
    	})
    
    	server := &http.Server{
    		Addr:      ":8443",
    		Handler:   mux,
    		TLSConfig: &tls.Config{Certificates: []tls.Certificate{pair}, MinVersion: tls.VersionTLS12},
    	}
    
    	// Start server in a goroutine
    	go func() {
    		if err := server.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
    			log.Fatalf("ListenAndServeTLS failed: %v", err)
    		}
    	}()
    
    	log.Println("Server started on port 8443")
    
    	// Graceful shutdown
    	signalChan := make(chan os.Signal, 1)
    	signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
    	<-signalChan
    
    	log.Println("Shutdown signal received, exiting gracefully...")
    	server.Shutdown(context.Background())
    }

    The Core Validation Logic (`internal/webhook/validate.go`)

    This is where we decode the AdmissionReview request, inspect the object, apply our policy, and craft the response.

    go
    // internal/webhook/validate.go
    package webhook
    
    import (
    	"encoding/json"
    	"fmt"
    	"io/ioutil"
    	"log"
    	"net/http"
    
    	admissionv1 "k8s.io/api/admission/v1"
    	appsv1 "k8s.io/api/apps/v1"
    	corev1 "k8s.io/api/core/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/apimachinery/pkg/runtime/serializer"
    )
    
    var (
    	// UniversalDeserializer is used to decode objects from the AdmissionReview request.
    	UniversalDeserializer = serializer.NewCodecFactory(runtime.NewScheme()).UniversalDeserializer()
    )
    
    func HandleValidate(w http.ResponseWriter, r *http.Request) {
    	body, err := ioutil.ReadAll(r.Body)
    	if err != nil {
    		log.Printf("Error reading request body: %v", err)
    		http.Error(w, "could not read request body", http.StatusBadRequest)
    		return
    	}
    
    	// Decode the AdmissionReview request
    	var admissionReviewReq admissionv1.AdmissionReview
    	if _, _, err := UniversalDeserializer.Decode(body, nil, &admissionReviewReq); err != nil {
    		log.Printf("Error decoding admission review: %v", err)
    		http.Error(w, "could not decode admission review", http.StatusBadRequest)
    		return
    	}
    
    	if admissionReviewReq.Request == nil {
    		http.Error(w, "malformed admission review: request is nil", http.StatusBadRequest)
    		return
    	}
    
    	// Default to allowing the request
    	admissionResponse := &admissionv1.AdmissionResponse{
    		UID:     admissionReviewReq.Request.UID,
    		Allowed: true,
    	}
    
    	// Apply validation logic based on the resource Kind
    	var validationError error
    	switch admissionReviewReq.Request.Kind.Kind {
    	case "Deployment":
    		var deployment appsv1.Deployment
    		if err := json.Unmarshal(admissionReviewReq.Request.Object.Raw, &deployment); err != nil {
    			log.Printf("Error unmarshalling deployment: %v", err)
    			validationError = fmt.Errorf("could not unmarshal Deployment object")
    		} else {
    			validationError = validatePodSpec(deployment.Spec.Template.Spec, deployment.Name, deployment.Kind)
    		}
    	case "StatefulSet":
    		var statefulSet appsv1.StatefulSet
    		if err := json.Unmarshal(admissionReviewReq.Request.Object.Raw, &statefulSet); err != nil {
    			log.Printf("Error unmarshalling statefulset: %v", err)
    			validationError = fmt.Errorf("could not unmarshal StatefulSet object")
    		} else {
    			validationError = validatePodSpec(statefulSet.Spec.Template.Spec, statefulSet.Name, statefulSet.Kind)
    		}
    	}
    
    	if validationError != nil {
    		admissionResponse.Allowed = false
    		admissionResponse.Result = &metav1.Status{
    			Message: validationError.Error(),
    			Code:    http.StatusForbidden,
    		}
    	}
    
    	// Construct the final AdmissionReview response
    	admissionReviewResp := admissionv1.AdmissionReview{
    		TypeMeta: metav1.TypeMeta{
    			APIVersion: "admission.k8s.io/v1",
    			Kind:       "AdmissionReview",
    		},
    		Response: admissionResponse,
    	}
    
    	// Send the response
    	respBytes, err := json.Marshal(admissionReviewResp)
    	if err != nil {
    		log.Printf("Error marshalling response: %v", err)
    		http.Error(w, "could not encode response", http.StatusInternalServerError)
    		return
    	}
    
    	w.Header().Set("Content-Type", "application/json")
    	w.WriteHeader(http.StatusOK)
    	_, _ = w.Write(respBytes)
    }
    
    // validatePodSpec contains the core policy logic.
    func validatePodSpec(podSpec corev1.PodSpec, name, kind string) error {
    	for _, container := range podSpec.Containers {
    		if container.Resources.Limits == nil || container.Resources.Requests == nil {
    			return fmt.Errorf("resource limits and requests are not defined for container '%s' in %s '%s'", container.Name, kind, name)
    		}
    		if _, ok := container.Resources.Limits[corev1.ResourceCPU]; !ok {
    			return fmt.Errorf("cpu limit is not defined for container '%s' in %s '%s'", container.Name, kind, name)
    		}
    		if _, ok := container.Resources.Limits[corev1.ResourceMemory]; !ok {
    			return fmt.Errorf("memory limit is not defined for container '%s' in %s '%s'", container.Name, kind, name)
    		}
    		if _, ok := container.Resources.Requests[corev1.ResourceCPU]; !ok {
    			return fmt.Errorf("cpu request is not defined for container '%s' in %s '%s'", container.Name, kind, name)
    		}
    		if _, ok := container.Resources.Requests[corev1.ResourceMemory]; !ok {
    			return fmt.Errorf("memory request is not defined for container '%s' in %s '%s'", container.Name, kind, name)
    		}
    	}
    	return nil
    }

    Notice we unmarshal the admissionReviewReq.Request.Object.Raw into the specific type (appsv1.Deployment). This gives us a strongly-typed object to work with, which is far more robust than parsing raw JSON.


    Production-Grade TLS with cert-manager

    The API server must communicate with the webhook over HTTPS. This means our webhook needs a TLS certificate signed by a Certificate Authority (CA) that the API server trusts. Manually generating these certs with openssl, distributing the CA, and managing rotation is an operational nightmare.

    This is a solved problem. We will use cert-manager, a Kubernetes-native certificate management tool.

    The Workflow:

  • Install cert-manager: A one-time setup in your cluster.
  • Create an Issuer or ClusterIssuer: This resource tells cert-manager how to issue certificates. We'll use a self-signed issuer for simplicity, but in production, this could be Let's Encrypt or an internal Vault/HashiCorp CA.
  • Create a Certificate Resource: This resource tells cert-manager what certificate to issue. We'll request a certificate for our webhook's Service DNS name (e.g., resource-validator.validation-ns.svc).
  • cert-manager Magic: cert-manager will see the Certificate resource, generate a key pair, sign it using the configured Issuer, and store the result in a Secret (e.g., webhook-tls-secret). It will also automatically handle renewal before expiration.
  • Mount the Secret: We mount this Secret as a volume into our webhook pod at /etc/webhook/certs/.
  • Inject CA Bundle: The ValidatingWebhookConfiguration needs the CA bundle to verify our webhook's certificate. cert-manager has a cainjector controller that automatically patches our webhook configuration with the correct CA data.
  • Kubernetes Manifests for Deployment

    Here are the complete manifests to deploy our webhook and configure cert-manager.

    yaml
    # 00-namespace.yaml
    apiVersion: v1
    kind: Namespace
    metadata:
      name: validation-ns
    ---
    # 01-cert-manager-issuer.yaml
    # This assumes cert-manager is already installed in the cluster.
    apiVersion: cert-manager.io/v1
    kind: Issuer
    metadata:
      name: selfsigned-issuer
      namespace: validation-ns
    spec:
      selfSigned: {}
    ---
    # 02-cert-manager-certificate.yaml
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: webhook-server-cert
      namespace: validation-ns
    spec:
      secretName: webhook-tls-secret # The secret where certs will be stored
      dnsNames:
        - resource-validator.validation-ns.svc
        - resource-validator.validation-ns.svc.cluster.local
      issuerRef:
        name: selfsigned-issuer
        kind: Issuer
    ---
    # 03-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: resource-validator
      namespace: validation-ns
      labels:
        app: resource-validator
    spec:
      replicas: 2 # Start with HA in mind
      selector:
        matchLabels:
          app: resource-validator
      template:
        metadata:
          labels:
            app: resource-validator
        spec:
          containers:
            - name: webhook
              # Replace with your actual image
              image: your-registry/k8s-webhook-example:latest
              ports:
                - containerPort: 8443
                  name: webhook-api
              volumeMounts:
                - name: webhook-certs
                  mountPath: /etc/webhook/certs
                  readOnly: true
              readinessProbe:
                httpGet:
                  scheme: HTTPS
                  path: /healthz
                  port: 8443
                initialDelaySeconds: 5
                periodSeconds: 10
          volumes:
            - name: webhook-certs
              secret:
                secretName: webhook-tls-secret
    ---
    # 04-service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: resource-validator
      namespace: validation-ns
    spec:
      selector:
        app: resource-validator
      ports:
        - port: 443
          targetPort: 8443
    ---
    # 05-validating-webhook.yaml
    apiVersion: admissionregistration.k8s.io/v1
    kind: ValidatingWebhookConfiguration
    metadata:
      name: resource-validator-webhook
      # This annotation tells cert-manager's cainjector to patch the caBundle field.
      annotations:
        cert-manager.io/inject-ca-from: validation-ns/webhook-server-cert
    webhooks:
      - name: validator.your-domain.com
        clientConfig:
          service:
            namespace: validation-ns
            name: resource-validator
            path: "/validate"
          # caBundle will be populated by cert-manager
        rules:
          - operations: ["CREATE", "UPDATE"]
            apiGroups: ["apps"]
            apiVersions: ["v1"]
            resources: ["deployments", "statefulsets"]
        # CRITICAL: This determines what happens if your webhook is down.
        failurePolicy: Fail
        # The webhook has no side effects on the system.
        sideEffects: None
        # Use v1 for production clusters.
        admissionReviewVersions: ["v1"]
        # How long the API server will wait for a response.
        timeoutSeconds: 5

    After applying these, cert-manager creates the secret, our Deployment mounts it, and the cainjector patches the ValidatingWebhookConfiguration with the CA. The webhook is now live.

    Advanced Topics and Production Hardening

    Getting a simple webhook running is one thing; making it resilient and performant is another.

    Edge Case: Avoiding Webhook Deadlocks

    What happens if your webhook is configured to intercept Secrets, and cert-manager (which creates Secrets) is running in a namespace that the webhook is watching? cert-manager tries to create a Secret, which calls your webhook. Your webhook pod can't start because it's waiting for the Secret with its TLS certs. The Secret can't be created because the webhook isn't running to approve it. Deadlock.

    Solution: Use namespace selectors to exclude critical infrastructure namespaces.

    yaml
    # In ValidatingWebhookConfiguration
    # ...
    webhooks:
      - name: validator.your-domain.com
        # ...
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values: ["kube-system", "cert-manager", "validation-ns"]

    This ensures your webhook ignores resources in its own namespace (validation-ns), the cert-manager namespace, and kube-system.

    Performance and Latency Considerations

    Your webhook's response time is added directly to the kubectl apply latency for every matching request. Milliseconds matter.

    Scope Rules Tightly: The rules in your ValidatingWebhookConfiguration are your first line of defense. Never use wildcards like apiGroups: [""]. Be as specific as possible about the resources and operations you care about. The API server will not even attempt to call your webhook for non-matching requests.

    Avoid External Calls: The validation logic should be self-contained. If your webhook needs to call another service (e.g., a policy engine like OPA, or even the Kubernetes API itself to check other resources), you introduce network latency and another point of failure. If you must* make external calls, they need to be extremely fast and highly available. Use aggressive timeouts.

    * Set timeoutSeconds Conservatively: The default is 10 seconds, which is far too high for a production webhook. A user waiting 10 seconds for kubectl to respond will assume the cluster is broken. Start with a low value like 2 or 3 seconds and monitor. If your webhook regularly times out, it is too slow and needs optimization.

    * Monitor Everything: Expose Prometheus metrics from your Go application. Track:

    * webhook_requests_total: A counter for received requests.

    * webhook_request_duration_seconds: A histogram of request processing time.

    * webhook_allowed_requests_total, webhook_denied_requests_total: Counters for outcomes.

    This data is invaluable for SLOs and performance tuning.

    High Availability and `failurePolicy`

    The failurePolicy field is critical.

    failurePolicy: Fail (Default): If the webhook cannot be reached (DNS failure, network issue, all pods down, timeout), the API server rejects the resource request. This ensures your policies are always enforced but introduces a risk: an outage of your webhook can block deployments cluster-wide. You must* run multiple replicas of your webhook pod across different nodes and have robust monitoring and alerting.

    * failurePolicy: Ignore: If the webhook cannot be reached, the API server allows the resource request to proceed. This prioritizes availability over strict policy enforcement. This is safer during initial rollout or for non-critical validation, but it means your policy can be bypassed during a webhook outage.

    For a critical policy like resource limits, Fail is usually the correct choice, but it comes with the operational burden of ensuring the webhook service is one of the most highly available components in your cluster.

    Testing Your Webhook

    Testing is non-negotiable.

  • Unit Tests: The core validation logic (validatePodSpec function) should be pure and easily testable. Feed it various corev1.PodSpec structs and assert the error response is correct.
  • Integration Tests: Use a framework like kubebuilder's envtest to spin up a real etcd and kube-apiserver in memory during your test suite. This allows you to test the full HTTP request/response flow without needing a real cluster.
  • End-to-End (E2E) Tests: In a staging cluster, deploy the webhook and then run a script that attempts to apply both compliant and non-compliant manifests. Assert that the non-compliant ones fail with the expected error message.
  • Let's see it in action. Create two files:

    yaml
    # compliant-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-compliant
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx:1.21
            resources:
              limits:
                cpu: "200m"
                memory: "128Mi"
              requests:
                cpu: "100m"
                memory: "64Mi"
    yaml
    # non-compliant-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-non-compliant
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx:1.21
            # Missing resources section

    Apply them:

    bash
    $ kubectl apply -f compliant-deployment.yaml
    deployment.apps/nginx-compliant created
    
    $ kubectl apply -f non-compliant-deployment.yaml
    Error from server: error when creating "non-compliant-deployment.yaml": admission webhook "validator.your-domain.com" denied the request: resource limits and requests are not defined for container 'nginx' in Deployment 'nginx-non-compliant'

    The webhook works exactly as intended, providing immediate, actionable feedback to the user directly from the Kubernetes API.

    Conclusion

    Dynamic admission webhooks are a powerful, Kubernetes-native mechanism for enforcing custom policy and governance. By moving validation logic into a synchronous, API-aware component, you can create a robust, real-time security and operational backstop that complements static analysis and CI/CD checks.

    However, this power comes with significant responsibility. Because webhooks exist in the critical path of the API server, they must be engineered for high performance, high availability, and resilience. By leveraging Go for its performance and concurrency, and cert-manager for production-grade TLS lifecycle management, you can build admission controllers that are both powerful and operationally sound. The patterns discussed here—tightly-scoped rules, careful selection of failure policies, deadlock avoidance, and comprehensive monitoring—are not optional extras; they are the foundation for successfully running validating webhooks in a production environment.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles