Kernel-Level K8s Security: eBPF for Granular L7 Network Policies
The Scaling Ceiling of `iptables` in Modern Kubernetes
For years, iptables has been the bedrock of Kubernetes networking, powering both service routing via kube-proxy and network policy enforcement. Its integration with the kernel's Netfilter framework is well-understood. However, in large-scale, dynamic microservice environments, this bedrock begins to crack. Senior engineers managing clusters with thousands of services and tens of thousands of pods have personally felt the pain points:
O(n)): iptables processes rules in sequential chains. As the number of services (-A KUBE-SERVICES...) or policy rules grows, the traversal time for each packet increases linearly. This introduces non-trivial latency and CPU overhead on every node.iptables rules requires taking a lock on the entire rule set. In a highly dynamic environment with frequent pod churn, this leads to lock contention, delaying updates and impacting control plane responsiveness.NetworkPolicy is limited to L3/L4 (IP addresses and ports). Enforcing policies like "allow GET requests to /api/v1/metrics but deny POST" requires a service mesh, which introduces its own complexity, resource overhead (sidecar proxies), and latency.iptables ruleset is a notorious operational burden. Determining why a packet was dropped can involve manually parsing hundreds of rules, a process that doesn't scale.Consider a simple packet lookup in a 10,000-service cluster. The packet hits the PREROUTING chain, jumps to the KUBE-SERVICES chain, and then must potentially traverse thousands of rules to find a match. This is repeated for every new connection.
# A glimpse into the complexity on a node
$ iptables-save | grep KUBE-SERVICES | wc -l
10001
# Each packet for a new connection must traverse this chain.
This is where eBPF (extended Berkeley Packet Filter) represents a fundamental paradigm shift. Instead of routing packets through complex chains in a generic framework, eBPF allows us to attach sandboxed, event-driven programs directly to kernel hooks, processing network traffic with the efficiency of compiled code.
This article will demonstrate how to leverage eBPF via Cilium to implement high-performance, L7-aware network policies, completely bypassing iptables and gaining unprecedented observability directly from the kernel.
Architectural Shift: From Netfilter Chains to eBPF Hooks
To appreciate the performance and capability gains, it's crucial to understand the architectural difference between the two models.
The iptables / Netfilter Model:
Packets traverse a series of well-defined hooks in the kernel's networking stack (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING). kube-proxy and CNI plugins insert chains of rules at these hooks. The kernel must walk these chains for each packet.
Diagrammatically:
Packet -> NIC -> Netfilter (PREROUTING -> KUBE-SERVICES chain walk) -> ...
The eBPF Model:
eBPF programs are attached to different, often earlier, hooks, primarily at the Traffic Control (TC) ingress/egress layer or even earlier with XDP (Express Data Path) directly on the driver level.
* TC (Traffic Control) Hook: An eBPF program attached here can see all network traffic entering or leaving a network device (physical or virtual, like a veth pair for a pod). It can make decisions—allow, drop, redirect—before the packet even enters the iptables PREROUTING stage.
* BPF Maps: The key to eBPF's performance is its use of BPF maps. These are highly efficient key/value stores accessible from both kernel-space eBPF programs and user-space control planes. Instead of linear rule scans, eBPF programs perform O(1) hash map lookups to determine service IPs, policy rules, and connection tracking state.
Diagrammatically:
Packet -> NIC -> TC Hook -> eBPF Program (O(1) map lookup) -> Decision (Allow/Drop/Redirect)
Cilium leverages this by replacing kube-proxy entirely. It watches the Kubernetes API for Services and Endpoints and populates BPF maps with this information. When a packet from a pod arrives at the TC hook on the veth interface, the attached eBPF program performs a quick lookup in a BPF map to find the destination backend pod's IP and performs DNAT directly, bypassing iptables entirely.
Practical Implementation: L7 Policies with Cilium
Let's move from theory to a production-grade implementation. We will set up a local Kubernetes cluster using Kind and install Cilium with kube-proxy replacement enabled to unlock the full power of eBPF.
Step 1: Environment Setup
First, create a Kind cluster configuration that disables the default CNI and kube-proxy to allow Cilium to take over.
kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: true # We will install Cilium
kubeProxyMode: "none" # We will replace kube-proxy with eBPF
nodes:
- role: control-plane
- role: worker
- role: worker
Now, create the cluster:
kind create cluster --config=kind-config.yaml --name=ebpf-l7-demo
Next, install the Cilium CLI and deploy Cilium to the cluster. We'll use Helm for a configurable installation.
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.15.5 \
--namespace kube-system \
--set kubeProxyReplacement=strict \
--set bpf.masquerade=true \
--set securityContext.capabilities.cilium={CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID} \
--set securityContext.capabilities.operator.cilium={CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID} \
--set cgroup.autoMount.enabled=false \
--set cgroup.hostRoot=/sys/fs/cgroup
Key Configuration Flags:
* kubeProxyReplacement=strict: This tells Cilium to completely handle service translation using eBPF, ensuring iptables is not used for Kubernetes Services.
* bpf.masquerade=true: Enables eBPF-based masquerading for traffic leaving the cluster, another task typically handled by iptables.
Verify the installation. You should see kube-proxy is absent and Cilium is running in kube-system.
cilium status --wait
# Expected output (abbreviated):
# ...
# KubeProxy Replacement: Strict
# ...
# All Cilium pods are ready
Step 2: Deploying a Sample Microservice Application
To demonstrate L7 policies, we'll use a simple scenario: a client pod trying to access an api-server pod. The api-server exposes two endpoints: /public and /private.
app.yaml
apiVersion: v1
kind: Pod
metadata:
name: api-server
labels:
app: api-server
policy-group: backend
spec:
containers:
- name: server
image: mendhak/http-https-echo
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: api-service
spec:
selector:
app: api-server
ports:
- protocol: TCP
port: 80
targetPort: 80
---
apiVersion: v1
kind: Pod
metadata:
name: client
labels:
app: client
policy-group: frontend
spec:
containers:
- name: client
image: appropriate/curl
command: ["sleep", "3600"]
Deploy the application:
kubectl apply -f app.yaml
By default, with no policies in place, the client can access both endpoints on the api-server.
# Exec into the client pod
kubectl exec -it client -- sh
# Test access - both should succeed
/ # curl -s http://api-service/public
/ # echo $?
0
/ # curl -s http://api-service/private
/ # echo $?
0
Step 3: Implementing an L7-Aware `CiliumNetworkPolicy`
Now, we'll enforce a policy: allow GET requests to /public but deny everything else. This requires L7 awareness, which is impossible with standard NetworkPolicy objects. We use a CiliumNetworkPolicy CRD.
l7-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "api-l7-policy"
spec:
endpointSelector:
matchLabels:
app: api-server
ingress:
- fromEndpoints:
- matchLabels:
app: client
toPorts:
- ports:
- port: "80"
protocol: TCP
rules:
http:
- method: "GET"
path: "/public"
Dissecting the Policy:
* endpointSelector: The policy applies to pods with the label app: api-server.
* fromEndpoints: It allows ingress traffic only from pods with the label app: client.
* toPorts.rules.http: This is the L7 magic. It specifies that for traffic on TCP port 80, the HTTP request must have the method GET and the path /public.
Apply the policy:
kubectl apply -f l7-policy.yaml
Step 4: Verifying the L7 Policy Enforcement
Let's re-run our tests from the client pod.
kubectl exec -it client -- sh
# This should SUCCEED (matches the policy)
/ # curl -s -o /dev/null -w "%{http_code}" http://api-service/public
200
# This should FAIL (path mismatch)
/ # curl -s -o /dev/null -w "%{http_code}" http://api-service/private
000 # curl exits with an error, connection reset by peer
# This should FAIL (method mismatch)
/ # curl -X POST -s -o /dev/null -w "%{http_code}" http://api-service/public
000
The policy is enforced correctly. The key here is how Cilium does this. When the policy is applied, Cilium's eBPF programs on the api-server's node are updated. For traffic on port 80, instead of just allowing it, the eBPF program redirects the initial packets of the TCP stream to a lightweight, embedded proxy (like Envoy or a custom parser) running in the Cilium agent. This proxy inspects the initial HTTP data, makes a policy decision, and communicates the verdict back to the eBPF program, which then either allows the connection to proceed or terminates it. This is far more efficient than a full sidecar proxy for every pod.
Kernel-Level Observability with Hubble
One of the most powerful benefits of an eBPF-based data plane is the deep, low-overhead observability it provides. Since eBPF programs see every packet, we can capture rich metadata without any application instrumentation. Cilium's observability tool, Hubble, reads this data directly from BPF maps.
First, enable the Hubble UI:
cilium hubble enable --ui
cilium hubble port-forward &
Now, let's use the Hubble CLI to see exactly why our requests were dropped.
# Run this command and then re-run the failed curl from the client pod
hubble observe --from-pod default/client --to-pod default/api-server -f
You will see real-time flow data. When the allowed request runs, you'll see a L7 FORWARDED verdict:
TIMESTAMP SOURCE:PORT -> DESTINATION:PORT TYPE VERDICT SUMMARY
May 20 15:30:01.123 default/client:34567 -> default/api-server:80 http-request FORWARDED GET http://api-service/public
When the denied request for /private runs, you'll see a clear DROPPED verdict with a reason:
TIMESTAMP SOURCE:PORT -> DESTINATION:PORT TYPE VERDICT SUMMARY
May 20 15:30:15.456 default/client:34589 -> default/api-server:80 http-request DROPPED Policy denied (L7)
This is a game-changer for debugging. Instead of guessing or parsing iptables --verbose logs, you get a definitive, human-readable reason for the packet drop, including the L7 metadata that triggered the policy. You can even use hubble ui to see a graphical service map and flow animations.
Performance Considerations and Benchmarking Insights
While functionality is great, the primary driver for adopting eBPF in large clusters is performance. Let's quantify the difference.
Theoretical Analysis:
* iptables kube-proxy: Service lookup is O(n) where n is the number of services. For N services, the worst-case lookup requires traversing N rules.
* Cilium eBPF: Service lookup is O(1). The service virtual IP is a key in a BPF hash map. The lookup is constant time, regardless of the number of services.
Practical Impact:
In a benchmark conducted by the Cilium community on a 30-node cluster, they compared the performance of kube-proxy in iptables mode vs. Cilium's eBPF mode as the number of services increased.
| Number of Services | iptables p99 Latency (ms) | Cilium eBPF p99 Latency (ms) |
|---|---|---|
| 1,000 | ~2 | ~0.5 |
| 5,000 | ~8 | ~0.5 |
| 10,000 | ~16 | ~0.5 |
| 20,000 | ~30+ | ~0.5 |
Source: Cilium project benchmarks (data is illustrative of typical results)
The results are stark. Cilium's latency remains flat and sub-millisecond, while iptables latency grows linearly and becomes a significant performance bottleneck in large clusters.
Edge Case: JIT Compilation Overhead
While per-packet processing is faster, eBPF is not without overhead. eBPF bytecode is loaded into the kernel and often Just-In-Time (JIT) compiled into native machine code for maximum performance. This JIT compilation consumes CPU cycles when a program is first loaded or updated. In environments with extremely high pod churn, this can lead to noticeable CPU spikes on nodes as Cilium constantly regenerates and loads new eBPF programs for new pod interfaces. This is a trade-off: a small, one-time cost for a massive gain in steady-state data plane performance.
Advanced Production Patterns and Troubleshooting
For senior engineers, deploying a new technology means understanding how to debug it when things go wrong.
Pattern: Direct Server Return (DSR)
In standard NAT-based service routing (both iptables and eBPF), the return packet must travel back through the node that made the load balancing decision to have its source IP un-NAT'd. eBPF enables a more advanced mode called Direct Server Return (DSR). In DSR mode, the backend pod sends the reply directly to the original client, bypassing the load-balancing node entirely on the return path. This further reduces latency and removes potential bottlenecks. This is configured in Cilium via loadBalancer.mode=dsr.
Troubleshooting: Inspecting BPF Maps Directly
When Hubble isn't enough and you need to go deeper, you can inspect the BPF maps that Cilium uses to store state. This is the eBPF equivalent of dumping iptables rules.
CILIUM_POD=$(kubectl get pods -n kube-system -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')kubectl exec -it -n kube-system $CILIUM_POD -- bashcilium endpoint list (Find the ID of your api-server pod). # cilium bpf ct list <endpoint-id>
# This shows you the live TCP connections being tracked by eBPF
# for that specific pod, including NAT state.
# cilium bpf policy get <endpoint-id>
# This dumps the policy rules applied to the endpoint as they exist
# within the BPF map, showing allowed source identities and ports.
This level of inspection allows you to verify, at the lowest level, exactly what rules the kernel is enforcing for a given pod, providing ultimate clarity during complex troubleshooting scenarios.
Conclusion: A New Foundation for Cloud-Native Networking
Moving from iptables to eBPF for Kubernetes networking is more than an optimization; it's an architectural evolution. It addresses the fundamental scaling limitations of the Netfilter-based model while simultaneously unlocking capabilities like L7-aware policies and low-overhead observability that were previously the exclusive domain of heavy service meshes.
For senior engineers responsible for the stability, performance, and security of large-scale clusters, understanding and harnessing eBPF is becoming a non-negotiable skill. By replacing kube-proxy and leveraging tools like Cilium, we can build a data plane that is not only faster and more efficient but also more secure and transparent, providing a solid foundation for the next generation of cloud-native applications.