Dynamic Kubernetes Policy Enforcement with Validating Admission Webhooks
The Governance Gap: Beyond Linting and CI Gates
In any mature Kubernetes environment, maintaining operational and security standards is a perpetual challenge. While static analysis tools (like kubeval
or conftest
) and CI/CD pipeline gates are essential first lines of defense, they have a fundamental limitation: they operate before a resource is submitted to the cluster. They cannot prevent a user with kubectl
access from directly applying non-compliant manifests, nor can they react to the dynamic state of the cluster during an admission request.
This is where Kubernetes Admission Controllers come into play. They are plugins that intercept requests to the Kubernetes API server after the request is authenticated and authorized, but before the object is persisted in etcd. They are the final gatekeepers of cluster state.
While Kubernetes provides several built-in admission controllers (LimitRanger
, ResourceQuota
, etc.), they often lack the custom, business-specific logic required by organizations. The solution is the ValidatingAdmissionWebhook
and MutatingAdmissionWebhook
. These allow you to deploy your own custom logic as a webhook server that the API server calls synchronously during the admission process.
This article is a deep dive into building a production-grade Validating Admission Webhook in Go. We will not cover the basics. We assume you know what a webhook is and why you'd want one. Instead, we will focus on the intricate details of a real-world implementation: enforcing a common, critical policy that all containers in Deployments
and StatefulSets
must have CPU and memory limits
and requests
specified.
We will cover:
AdmissionReview
objects.cert-manager
—the de facto production standard.Deployment
, Service
, and ValidatingWebhookConfiguration
manifests with a focus on resilience and scope.Core Architecture: The Synchronous API Server Callback
Before we write a line of code, it's critical to internalize the request flow. When a user runs kubectl apply -f my-deployment.yaml
, the following happens in our webhook's context:
CREATE
or UPDATE
a Deployment
).ValidatingWebhookConfiguration
resources. If the incoming request matches a rule, it constructs an AdmissionReview
JSON object containing the full resource manifest.Service
endpoint defined in the webhook configuration.AdmissionReview
object, deserializes it, and executes our validation logic.AdmissionReview
response. For a validating webhook, the crucial fields are allowed
(a boolean) and an optional status
object with a message and code. - If allowed: true
, the request proceeds, and the object is persisted to etcd.
- If allowed: false
, the API server rejects the request and returns an error to the client (kubectl
), including the message from our webhook's response.
The key takeaway is synchronicity. Our webhook is in the critical path of API requests. If it's slow, the entire cluster's resource management becomes slow. If it's down and the failurePolicy
is Fail
, a significant portion of cluster operations can be halted.
Building the Validating Webhook Server in Go
Let's implement a webhook that enforces a simple but powerful policy: All containers within Deployments
and StatefulSets
must have both CPU and memory requests
and limits
defined.
Project Structure
.
├── cmd/
│ └── main.go # Server entrypoint
├── internal/
│ └── webhook/
│ └── validate.go # Core validation logic
├── pkg/
│ └── certs/
│ └── tls.go # (Placeholder for manual certs, we'll use cert-manager)
├── Dockerfile
├── go.mod
└── go.sum
The HTTP Server (`cmd/main.go`)
We'll use the standard net/http
library. The server needs to handle two things: a /validate
endpoint for admission reviews and a /healthz
endpoint for liveness/readiness probes.
// cmd/main.go
package main
import (
"context"
"crypto/tls"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"github.com/your-org/k8s-webhook-example/internal/webhook"
)
func main() {
log.Println("Starting webhook server...")
// Load TLS certificates. For production, these are mounted from a secret
// managed by cert-manager.
certPath := "/etc/webhook/certs/tls.crt"
keyPath := "/etc/webhook/certs/tls.key"
pair, err := tls.LoadX509KeyPair(certPath, keyPath)
if err != nil {
log.Fatalf("Failed to load key pair: %v", err)
}
// Define server and handlers
mux := http.NewServeMux()
mux.HandleFunc("/validate", webhook.HandleValidate)
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, "ok")
})
server := &http.Server{
Addr: ":8443",
Handler: mux,
TLSConfig: &tls.Config{Certificates: []tls.Certificate{pair}, MinVersion: tls.VersionTLS12},
}
// Start server in a goroutine
go func() {
if err := server.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
log.Fatalf("ListenAndServeTLS failed: %v", err)
}
}()
log.Println("Server started on port 8443")
// Graceful shutdown
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
<-signalChan
log.Println("Shutdown signal received, exiting gracefully...")
server.Shutdown(context.Background())
}
The Core Validation Logic (`internal/webhook/validate.go`)
This is where we decode the AdmissionReview
request, inspect the object, apply our policy, and craft the response.
// internal/webhook/validate.go
package webhook
import (
"encoding/json"
"fmt"
"io/ioutil"
"log"
"net/http"
admissionv1 "k8s.io/api/admission/v1"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/serializer"
)
var (
// UniversalDeserializer is used to decode objects from the AdmissionReview request.
UniversalDeserializer = serializer.NewCodecFactory(runtime.NewScheme()).UniversalDeserializer()
)
func HandleValidate(w http.ResponseWriter, r *http.Request) {
body, err := ioutil.ReadAll(r.Body)
if err != nil {
log.Printf("Error reading request body: %v", err)
http.Error(w, "could not read request body", http.StatusBadRequest)
return
}
// Decode the AdmissionReview request
var admissionReviewReq admissionv1.AdmissionReview
if _, _, err := UniversalDeserializer.Decode(body, nil, &admissionReviewReq); err != nil {
log.Printf("Error decoding admission review: %v", err)
http.Error(w, "could not decode admission review", http.StatusBadRequest)
return
}
if admissionReviewReq.Request == nil {
http.Error(w, "malformed admission review: request is nil", http.StatusBadRequest)
return
}
// Default to allowing the request
admissionResponse := &admissionv1.AdmissionResponse{
UID: admissionReviewReq.Request.UID,
Allowed: true,
}
// Apply validation logic based on the resource Kind
var validationError error
switch admissionReviewReq.Request.Kind.Kind {
case "Deployment":
var deployment appsv1.Deployment
if err := json.Unmarshal(admissionReviewReq.Request.Object.Raw, &deployment); err != nil {
log.Printf("Error unmarshalling deployment: %v", err)
validationError = fmt.Errorf("could not unmarshal Deployment object")
} else {
validationError = validatePodSpec(deployment.Spec.Template.Spec, deployment.Name, deployment.Kind)
}
case "StatefulSet":
var statefulSet appsv1.StatefulSet
if err := json.Unmarshal(admissionReviewReq.Request.Object.Raw, &statefulSet); err != nil {
log.Printf("Error unmarshalling statefulset: %v", err)
validationError = fmt.Errorf("could not unmarshal StatefulSet object")
} else {
validationError = validatePodSpec(statefulSet.Spec.Template.Spec, statefulSet.Name, statefulSet.Kind)
}
}
if validationError != nil {
admissionResponse.Allowed = false
admissionResponse.Result = &metav1.Status{
Message: validationError.Error(),
Code: http.StatusForbidden,
}
}
// Construct the final AdmissionReview response
admissionReviewResp := admissionv1.AdmissionReview{
TypeMeta: metav1.TypeMeta{
APIVersion: "admission.k8s.io/v1",
Kind: "AdmissionReview",
},
Response: admissionResponse,
}
// Send the response
respBytes, err := json.Marshal(admissionReviewResp)
if err != nil {
log.Printf("Error marshalling response: %v", err)
http.Error(w, "could not encode response", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = w.Write(respBytes)
}
// validatePodSpec contains the core policy logic.
func validatePodSpec(podSpec corev1.PodSpec, name, kind string) error {
for _, container := range podSpec.Containers {
if container.Resources.Limits == nil || container.Resources.Requests == nil {
return fmt.Errorf("resource limits and requests are not defined for container '%s' in %s '%s'", container.Name, kind, name)
}
if _, ok := container.Resources.Limits[corev1.ResourceCPU]; !ok {
return fmt.Errorf("cpu limit is not defined for container '%s' in %s '%s'", container.Name, kind, name)
}
if _, ok := container.Resources.Limits[corev1.ResourceMemory]; !ok {
return fmt.Errorf("memory limit is not defined for container '%s' in %s '%s'", container.Name, kind, name)
}
if _, ok := container.Resources.Requests[corev1.ResourceCPU]; !ok {
return fmt.Errorf("cpu request is not defined for container '%s' in %s '%s'", container.Name, kind, name)
}
if _, ok := container.Resources.Requests[corev1.ResourceMemory]; !ok {
return fmt.Errorf("memory request is not defined for container '%s' in %s '%s'", container.Name, kind, name)
}
}
return nil
}
Notice we unmarshal the admissionReviewReq.Request.Object.Raw
into the specific type (appsv1.Deployment
). This gives us a strongly-typed object to work with, which is far more robust than parsing raw JSON.
Production-Grade TLS with cert-manager
The API server must communicate with the webhook over HTTPS. This means our webhook needs a TLS certificate signed by a Certificate Authority (CA) that the API server trusts. Manually generating these certs with openssl
, distributing the CA, and managing rotation is an operational nightmare.
This is a solved problem. We will use cert-manager
, a Kubernetes-native certificate management tool.
The Workflow:
cert-manager
: A one-time setup in your cluster.Issuer
or ClusterIssuer
: This resource tells cert-manager
how to issue certificates. We'll use a self-signed issuer for simplicity, but in production, this could be Let's Encrypt or an internal Vault/HashiCorp CA.Certificate
Resource: This resource tells cert-manager
what certificate to issue. We'll request a certificate for our webhook's Service
DNS name (e.g., resource-validator.validation-ns.svc
).cert-manager
Magic: cert-manager
will see the Certificate
resource, generate a key pair, sign it using the configured Issuer
, and store the result in a Secret
(e.g., webhook-tls-secret
). It will also automatically handle renewal before expiration.Secret
as a volume into our webhook pod at /etc/webhook/certs/
.ValidatingWebhookConfiguration
needs the CA bundle to verify our webhook's certificate. cert-manager
has a cainjector
controller that automatically patches our webhook configuration with the correct CA data.Kubernetes Manifests for Deployment
Here are the complete manifests to deploy our webhook and configure cert-manager
.
# 00-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: validation-ns
---
# 01-cert-manager-issuer.yaml
# This assumes cert-manager is already installed in the cluster.
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: validation-ns
spec:
selfSigned: {}
---
# 02-cert-manager-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: webhook-server-cert
namespace: validation-ns
spec:
secretName: webhook-tls-secret # The secret where certs will be stored
dnsNames:
- resource-validator.validation-ns.svc
- resource-validator.validation-ns.svc.cluster.local
issuerRef:
name: selfsigned-issuer
kind: Issuer
---
# 03-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: resource-validator
namespace: validation-ns
labels:
app: resource-validator
spec:
replicas: 2 # Start with HA in mind
selector:
matchLabels:
app: resource-validator
template:
metadata:
labels:
app: resource-validator
spec:
containers:
- name: webhook
# Replace with your actual image
image: your-registry/k8s-webhook-example:latest
ports:
- containerPort: 8443
name: webhook-api
volumeMounts:
- name: webhook-certs
mountPath: /etc/webhook/certs
readOnly: true
readinessProbe:
httpGet:
scheme: HTTPS
path: /healthz
port: 8443
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: webhook-certs
secret:
secretName: webhook-tls-secret
---
# 04-service.yaml
apiVersion: v1
kind: Service
metadata:
name: resource-validator
namespace: validation-ns
spec:
selector:
app: resource-validator
ports:
- port: 443
targetPort: 8443
---
# 05-validating-webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: resource-validator-webhook
# This annotation tells cert-manager's cainjector to patch the caBundle field.
annotations:
cert-manager.io/inject-ca-from: validation-ns/webhook-server-cert
webhooks:
- name: validator.your-domain.com
clientConfig:
service:
namespace: validation-ns
name: resource-validator
path: "/validate"
# caBundle will be populated by cert-manager
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: ["apps"]
apiVersions: ["v1"]
resources: ["deployments", "statefulsets"]
# CRITICAL: This determines what happens if your webhook is down.
failurePolicy: Fail
# The webhook has no side effects on the system.
sideEffects: None
# Use v1 for production clusters.
admissionReviewVersions: ["v1"]
# How long the API server will wait for a response.
timeoutSeconds: 5
After applying these, cert-manager
creates the secret, our Deployment
mounts it, and the cainjector
patches the ValidatingWebhookConfiguration
with the CA. The webhook is now live.
Advanced Topics and Production Hardening
Getting a simple webhook running is one thing; making it resilient and performant is another.
Edge Case: Avoiding Webhook Deadlocks
What happens if your webhook is configured to intercept Secrets
, and cert-manager
(which creates Secrets
) is running in a namespace that the webhook is watching? cert-manager
tries to create a Secret
, which calls your webhook. Your webhook pod can't start because it's waiting for the Secret
with its TLS certs. The Secret
can't be created because the webhook isn't running to approve it. Deadlock.
Solution: Use namespace selectors to exclude critical infrastructure namespaces.
# In ValidatingWebhookConfiguration
# ...
webhooks:
- name: validator.your-domain.com
# ...
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "cert-manager", "validation-ns"]
This ensures your webhook ignores resources in its own namespace (validation-ns
), the cert-manager
namespace, and kube-system
.
Performance and Latency Considerations
Your webhook's response time is added directly to the kubectl apply
latency for every matching request. Milliseconds matter.
Scope Rules Tightly: The rules
in your ValidatingWebhookConfiguration
are your first line of defense. Never use wildcards like apiGroups: ["
"]. Be as specific as possible about the resources and operations you care about. The API server will not even attempt to call your webhook for non-matching requests.
Avoid External Calls: The validation logic should be self-contained. If your webhook needs to call another service (e.g., a policy engine like OPA, or even the Kubernetes API itself to check other resources), you introduce network latency and another point of failure. If you must* make external calls, they need to be extremely fast and highly available. Use aggressive timeouts.
* Set timeoutSeconds
Conservatively: The default is 10 seconds, which is far too high for a production webhook. A user waiting 10 seconds for kubectl
to respond will assume the cluster is broken. Start with a low value like 2
or 3
seconds and monitor. If your webhook regularly times out, it is too slow and needs optimization.
* Monitor Everything: Expose Prometheus metrics from your Go application. Track:
* webhook_requests_total
: A counter for received requests.
* webhook_request_duration_seconds
: A histogram of request processing time.
* webhook_allowed_requests_total
, webhook_denied_requests_total
: Counters for outcomes.
This data is invaluable for SLOs and performance tuning.
High Availability and `failurePolicy`
The failurePolicy
field is critical.
failurePolicy: Fail
(Default): If the webhook cannot be reached (DNS failure, network issue, all pods down, timeout), the API server rejects the resource request. This ensures your policies are always enforced but introduces a risk: an outage of your webhook can block deployments cluster-wide. You must* run multiple replicas of your webhook pod across different nodes and have robust monitoring and alerting.
* failurePolicy: Ignore
: If the webhook cannot be reached, the API server allows the resource request to proceed. This prioritizes availability over strict policy enforcement. This is safer during initial rollout or for non-critical validation, but it means your policy can be bypassed during a webhook outage.
For a critical policy like resource limits, Fail
is usually the correct choice, but it comes with the operational burden of ensuring the webhook service is one of the most highly available components in your cluster.
Testing Your Webhook
Testing is non-negotiable.
validatePodSpec
function) should be pure and easily testable. Feed it various corev1.PodSpec
structs and assert the error
response is correct.kubebuilder
's envtest
to spin up a real etcd
and kube-apiserver
in memory during your test suite. This allows you to test the full HTTP request/response flow without needing a real cluster.Let's see it in action. Create two files:
# compliant-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-compliant
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
resources:
limits:
cpu: "200m"
memory: "128Mi"
requests:
cpu: "100m"
memory: "64Mi"
# non-compliant-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-non-compliant
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
# Missing resources section
Apply them:
$ kubectl apply -f compliant-deployment.yaml
deployment.apps/nginx-compliant created
$ kubectl apply -f non-compliant-deployment.yaml
Error from server: error when creating "non-compliant-deployment.yaml": admission webhook "validator.your-domain.com" denied the request: resource limits and requests are not defined for container 'nginx' in Deployment 'nginx-non-compliant'
The webhook works exactly as intended, providing immediate, actionable feedback to the user directly from the Kubernetes API.
Conclusion
Dynamic admission webhooks are a powerful, Kubernetes-native mechanism for enforcing custom policy and governance. By moving validation logic into a synchronous, API-aware component, you can create a robust, real-time security and operational backstop that complements static analysis and CI/CD checks.
However, this power comes with significant responsibility. Because webhooks exist in the critical path of the API server, they must be engineered for high performance, high availability, and resilience. By leveraging Go for its performance and concurrency, and cert-manager
for production-grade TLS lifecycle management, you can build admission controllers that are both powerful and operationally sound. The patterns discussed here—tightly-scoped rules, careful selection of failure policies, deadlock avoidance, and comprehensive monitoring—are not optional extras; they are the foundation for successfully running validating webhooks in a production environment.