eBPF-Powered K8s Networking: A Cilium Policy Enforcement Deep Dive
The `iptables` Bottleneck in Large-Scale Kubernetes
For years, kube-proxy has been the unsung hero of Kubernetes networking, translating abstract Service and Endpoint objects into concrete packet forwarding rules. In its most common mode, iptables, it achieves this by manipulating the kernel's netfilter chains. While functional, this model reveals critical scaling limitations in clusters with thousands of services and pods.
Every time a packet destined for a ClusterIP arrives at a node, it must traverse a series of iptables chains. The KUBE-SERVICES chain, for instance, contains a rule for every service. A packet must be linearly matched against these rules. For a service with many backend pods, the packet is then sent to a corresponding KUBE-SVC-* chain, which in turn contains rules for each endpoint, performing DNAT (Destination Network Address Translation) and random selection.
This architecture's core performance issues are twofold:
iptables rules scales linearly with the number of services and endpoints. In a cluster with 5,000 services, the KUBE-SERVICES chain has at least 5,000 rules. High packet rates combined with this linear traversal lead to significant CPU overhead in kernel space (ksoftirqd process).iptables rulesets are a single data structure protected by a single lock. When services or pods are added, removed, or updated, kube-proxy must acquire this lock to update the rules. In a dynamic environment with frequent deployments and scaling events, this lock becomes a point of contention, delaying network programming and potentially impacting performance across the entire node.Consider a perf trace on a node under heavy service churn. You would likely see significant time spent in functions like ipt_do_table, indicating the CPU cost of rule traversal. This isn't just a theoretical problem; at scale, it's a tangible ceiling on performance and cluster dynamism.
This is the problem space where eBPF (extended Berkeley Packet Filter) offers a revolutionary alternative. Instead of routing packets through complex, static chains, eBPF allows us to attach small, efficient, JIT-compiled programs directly to kernel hooks, processing network packets at the earliest possible point.
eBPF and Cilium: A Kernel-Native Paradigm Shift
Cilium leverages eBPF to completely reimplement Kubernetes networking, policy, and observability, often bypassing kube-proxy and iptables entirely.
eBPF is not a networking tool; it's a general-purpose, sandboxed virtual machine within the Linux kernel. It allows user-space applications to load and execute custom code on various kernel events, such as system calls, function entries/exits, and, most relevantly, network packet processing.
Cilium's datapath primarily uses these key eBPF attachment points:
Traffic Control (TC) Hooks: Cilium attaches eBPF programs to the ingress and egress hooks of network devices, including the virtual ethernet (veth) pairs connected to pods. This allows it to inspect, filter, redirect, and NAT every packet entering or leaving a pod's network namespace before* it hits the iptables PREROUTING chain. This is the foundation of Cilium's policy enforcement and load balancing.
* Express Data Path (XDP): For maximum performance, eBPF programs can be attached at the XDP layer, directly within the network driver. This is the earliest point a packet can be processed, even before the kernel allocates a socket buffer (sk_buff). It's ideal for high-speed packet dropping, such as DDoS mitigation, as it bypasses most of the kernel's network stack.
Cilium's architecture consists of a few key components:
* Cilium Agent: A DaemonSet that runs on every node. It's the core component that manages the eBPF programs on the node, watches the Kubernetes API for changes to pods, services, and policies, and compiles/loads the necessary eBPF bytecode into the kernel.
* Cilium Operator: A Deployment that handles cluster-wide tasks that only need to be done once, such as IPAM (IP Address Management) for pods across multiple nodes.
* Hubble: The observability component, which leverages the same eBPF datapath to provide deep network flow visibility with minimal overhead.
When Cilium is configured to replace kube-proxy, it uses eBPF maps—a generic key/value store in the kernel accessible by both eBPF programs and user-space—to implement service load balancing. Instead of iptables chains, there's an eBPF map containing Service IP:Port -> [Backend Pod IP 1, Backend Pod IP 2, ...]. The eBPF program attached to the TC hook performs a highly efficient hash table lookup in this map to find a backend and performs the DNAT directly. This is an O(1) operation, regardless of the number of services, providing a massive performance and scalability improvement.
Deep Dive: From `CiliumNetworkPolicy` to eBPF Bytecode
Let's trace the lifecycle of a complex CiliumNetworkPolicy to understand how Cilium translates high-level intent into low-level kernel enforcement.
Consider a multi-tenant application where we need to enforce strict ingress and egress rules for a backend API service.
Step 1: Defining the Advanced Policy
We'll create a policy in the production-api namespace for pods labeled app: api-server. The requirements are:
frontend pods within the same namespace.10.100.0.0/16).kube-dns.*.api.internal.corp. This is a critical L7 rule.shared-database service in the shared-services namespace.Here is the CiliumNetworkPolicy manifest:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "advanced-api-policy"
namespace: "production-api"
spec:
endpointSelector:
matchLabels:
app: "api-server"
ingress:
- fromEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": "production-api"
"app": "frontend"
toPorts:
- ports:
- port: "8080"
protocol: TCP
egress:
- toCIDR:
- "10.100.0.0/16"
- toEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": "kube-system"
"k8s:k8s-app": "kube-dns"
toPorts:
- ports:
- port: "53"
protocol: UDP
rules:
dns:
- matchPattern: "*"
- toPorts:
- ports:
- port: "443"
protocol: TCP
rules:
dns:
- matchPattern: "*.api.internal.corp"
- toServices:
- k8sService:
serviceName: "shared-database"
namespace: "shared-services"
Step 2: The Role of Identity
When this policy is applied, the Cilium agent does not think in terms of IP addresses. This is a fundamental concept. IP addresses are ephemeral in Kubernetes. Instead, Cilium assigns a numeric security identity to every pod based on its labels. This identity is a 16-bit unsigned integer stored in the pod's metadata.
You can see this mapping by inspecting Cilium's endpoints:
# Find a cilium agent pod
CILIUM_POD=$(kubectl -n kube-system get pods -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')
# List endpoints and their identities. Note the 'IDENTITY' column.
kubectl -n kube-system exec $CILIUM_POD -- cilium endpoint list
# Example Output:
# ENDPOINT POLICY (ingress/egress) IDENTITY LABELS (source:key=value)
# 1234 Enabled/Enabled 15872 k8s:app=api-server
# k8s:io.kubernetes.pod.namespace=production-api
# 5678 Enabled/Enabled 23456 k8s:app=frontend
# k8s:io.kubernetes.pod.namespace=production-api
The api-server pod has identity 15872 and the frontend pod has identity 23456. The policy is now a relationship between these numeric identities, not IPs.
Step 3: Policy Compilation to eBPF Map
The Cilium agent translates our YAML policy into rules that are stored in an eBPF map, typically named cilium_policy_. This map's key is a composite of the local endpoint's identity and the remote identity (or a reserved identity for CIDRs, world, etc.), and the value contains the allowed ports and protocols.
Conceptually, our policy creates entries like:
* Ingress Rule: Key: {LocalIdentity: 15872, RemoteIdentity: 23456}, Value: {Port: 8080, Protocol: TCP}
* Egress Rule: Key: {LocalIdentity: 15872, RemoteCIDR: 10.100.0.0/16}, Value: {Allow: true}
When a packet arrives at the TC hook for the api-server pod, the eBPF program extracts the source IP. It then looks up the source IP in another eBPF map (cilium_ipcache) to find its security identity. With the source and destination identities known, it performs a single, highly efficient lookup in the cilium_policy_ map. If a matching entry is found, the packet is allowed. If not, it's dropped.
We can inspect the computed policy for our specific endpoint:
# Get the endpoint ID for our api-server pod
ENDPOINT_ID=$(kubectl -n kube-system exec $CILIUM_POD -- cilium endpoint list -o jsonpath='{.items[?(@.status.identity.labels[0]=="k8s:app=api-server")].id}')
# Dump the policy that applies to this endpoint.
# You will see the numeric identities and allowed peers.
kubectl -n kube-system exec $CILIUM_POD -- cilium endpoint policy get $ENDPOINT_ID
Step 4: The L7 DNS-Aware Policy Magic
The rule matchPattern: "*.api.internal.corp" is where Cilium's advanced capabilities shine. This cannot be implemented with a simple identity/IP lookup.
Here's how it works under the hood:
api-server's veth pair is programmed to identify DNS traffic (UDP port 53). When the pod makes a DNS request for payments.api.internal.corp, the eBPF program intercepts it.payments.api.internal.corp matches the allowed pattern *.api.internal.corp. It does. The agent allows the DNS request to proceed to kube-dns.payments.api.internal.corp -> 10.1.2.3), the eBPF program intercepts it again. The agent now updates an eBPF map on the node, creating an entry that effectively says: For Identity 15872, traffic to IP 10.1.2.3 on port 443/TCP is allowed for a TTL of X seconds.api-server application subsequently tries to connect to 10.1.2.3 on port 443, the regular eBPF policy enforcement program on the TC hook will find this dynamically created rule in the map and allow the packet. An attempt to connect to any other IP, or the same IP on a different port, will be dropped.This entire process provides highly specific, L7-aware security enforcement at L3/L4 speeds in the kernel datapath.
Performance Tuning and Production Patterns
Mastering Cilium in production involves understanding and leveraging its advanced datapath features.
Pattern 1: XDP for Host-Level DDoS Mitigation
By default, Cilium's eBPF programs attach at the TC layer, which is already very fast. For ultimate performance, especially for dropping unwanted traffic, you can enable XDP mode.
In XDP mode, Cilium loads an eBPF program directly onto the physical network driver. This program can drop packets before they even enter the main kernel networking stack, making it incredibly efficient for mitigating volumetric attacks like SYN floods.
Implementation:
Enable XDP in the cilium-config ConfigMap. This requires network drivers that support native XDP.
# In the cilium-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
# ... other configs
datapath-mode: "native"
# valid modes: "native", "best-effort", "generic"
enable-xdp: "true"
Performance Consideration:
XDP provides a massive performance boost for packet dropping. Benchmarks can show an increase from a few million packets per second (PPS) with iptables or TC to tens of millions of PPS with XDP on the same hardware. However, the trade-off is hardware/driver dependency. If the driver doesn't support native XDP, Cilium may fall back to generic XDP, which has less of a performance benefit.
Pattern 2: High-Performance Masquerading with eBPF
When a pod sends traffic to an external IP, the packet's source IP must be translated (SNAT) to the node's IP for the return traffic to find its way back. This is called masquerading. Traditionally, this is handled by iptables using the MASQUERADE target, which relies on the kernel's connection tracking system (conntrack). conntrack tables have size limits and can be a source of lock contention and performance issues on nodes with a high rate of new connections.
Cilium implements this in eBPF, bypassing conntrack for connections originating from pods.
Implementation:
This is enabled via the bpf.masquerade option.
# Helm values for Cilium installation
# or set via cilium-config ConfigMap
kubeProxyReplacement: strict
bpf:
masquerade: true
Performance Consideration:
By using an eBPF map to track the pod_ip:port -> node_ip:port translations, Cilium avoids the overhead and lock contention of the central conntrack system. This results in lower latency for new connections and higher overall connection throughput per node, a critical advantage for egress-heavy workloads.
Pattern 3: Debugging Policies with Hubble
With policies enforced deep in the kernel, debugging can be challenging. tcpdump can show you packets, but not why they were dropped. Hubble is Cilium's purpose-built observability tool that taps into the eBPF datapath to provide flow-level visibility.
Implementation:
When a connection from a frontend pod to the api-server fails, you can use the Hubble CLI to see exactly why.
# Ensure hubble-relay is port-forwarded or accessible
kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &
# Observe for dropped packets from any frontend pod to any api-server pod
hubble observe --from-label app=frontend --to-label app=api-server --verdict DROPPED -f
# Example Output:
# TIMESTAMP SOURCE DESTINATION TYPE VERDICT REASON
# Oct 26 12:30:00.123 production-api/frontend-7b8c... -> production-api/api-server-5f4d... L4 DROPPED Policy denied, Port 8081 not allowed on ingress
The output is unambiguous. The packet was dropped by the policy engine because the destination port was 8081, but our policy only allows 8080. This level of detail, obtained with zero performance overhead on the source/destination pods, is invaluable for operating a secure, complex microservices environment.
Advanced Edge Cases and Gotchas
* hostNetwork: true Pods: Pods running in the host network namespace are not attached to a veth pair that Cilium can hook into. To apply policy to these pods, you must create a CiliumHostEndpoint resource that explicitly selects the node and applies policy to its physical network interface. This is a common requirement for node-level daemons like node-exporter.
* Kernel Version Dependencies: eBPF is a rapidly evolving kernel feature. Advanced Cilium functionality is directly tied to kernel versions. For example, efficient BPF-based host routing requires kernel 5.10+. Always consult the Cilium documentation for the feature-to-kernel-version mapping before planning a production deployment. Running on an old kernel (e.g., 4.14) will cause Cilium to fall back to less efficient implementations for certain features.
* Multi-Cluster Communication: Cilium's Cluster Mesh feature connects multiple Kubernetes clusters. It works by synchronizing service definitions and Cilium identities across clusters. An eBPF program on the egress path of a node can then directly route a packet destined for a remote cluster's service to a node in that cluster (often over a tunnel), completely bypassing traditional gateways and ingress controllers. This provides direct, policy-enforced pod-to-pod communication across cluster boundaries.
Conclusion
Moving from iptables to an eBPF-powered datapath like Cilium is more than a performance optimization; it's a paradigm shift in how we implement networking, security, and observability in Kubernetes. By moving logic from slow, generic kernel subsystems to fast, programmable, and context-aware hooks, we gain unprecedented scalability and visibility.
For senior engineers, mastering this technology means understanding the journey from a high-level policy manifest to the eBPF bytecode executing in the kernel. It means leveraging identity-based controls over brittle IP rules, utilizing L7-awareness for fine-grained security, and using tools like Hubble to debug policies at their point of enforcement. As clusters continue to grow in scale and complexity, the eBPF datapath is no longer an emerging trend—it is the production standard for high-performance Kubernetes networking.