Low-Latency K8s Network Policy via eBPF Dataplane in Istio Mesh

18 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The `iptables` Bottleneck in High-Throughput Kubernetes Clusters

In standard Kubernetes deployments, network policy enforcement and service routing are predominantly managed by kube-proxy and CNI plugins (like Calico or Flannel) that manipulate iptables rules. While functional, this model reveals significant performance degradation at scale. The core issue lies in the fundamental design of iptables and the Linux netfilter framework.

  • Linear Rule Traversal: iptables processes rules within chains sequentially. In a cluster with thousands of services and pods, each with its own network policies, these chains can grow to tens of thousands of rules. For every single packet, the kernel must traverse a potentially long list of rules until a match is found. This introduces a variable, non-trivial latency overhead that worsens as the cluster scales.
  • conntrack Table Contention: The connection tracking (conntrack) system, used for stateful firewalling, becomes a point of contention under high load. The conntrack table has a finite size, and lock contention for accessing and updating this table can lead to dropped packets and performance bottlenecks, especially on multi-core nodes with high connection churn.
  • Kernel-Userspace Context Switching: While iptables operates in the kernel, managing it and integrating with services often involves userspace proxies. This constant context switching between kernel space (packet processing) and user space (proxy logic) adds significant CPU overhead and latency.
  • Consider a packet's journey in a traditional setup:

    Pod (eth0) -> veth pair -> Host Network Namespace -> iptables PREROUTING -> iptables FORWARD -> iptables POSTROUTING -> Physical NIC

    Each iptables step involves traversing long, complex chains. For a cluster with 5,000 services, the KUBE-SERVICES chain alone can cause measurable latency. This architecture is simply not designed for the dynamic, high-density environment of modern microservices.

    Architectural Shift: eBPF as a Kernel-Native Dataplane

    eBPF (extended Berkeley Packet Filter) fundamentally changes this paradigm. Instead of chaining static rules, eBPF allows us to attach sandboxed, event-driven programs directly to various hooks within the kernel's networking stack. For Kubernetes networking, this offers a revolutionary advantage: we can implement routing, load balancing, and network policy logic that executes immediately upon a packet's arrival, bypassing the entire iptables and netfilter framework.

    Key Attachment Hooks: XDP vs. Traffic Control (TC)

    Two primary hook points are relevant for network policy enforcement:

    * XDP (eXpress Data Path): This hook is located in the network driver, making it the earliest possible point for packet processing. It operates on the raw packet data before the kernel even allocates a sk_buff (socket buffer). This makes it extraordinarily fast, ideal for DDoS mitigation or basic L3/L4 load balancing where you need to make a drop/pass/redirect decision with minimal overhead. However, its early execution point means it lacks context about the upper-level network stack, making it less suitable for complex, identity-aware Kubernetes policies.

    * TC (Traffic Control): TC hooks (cls_bpf) are attached to the traffic control ingress and egress points of a network interface (like a veth pair). By this stage, the kernel has constructed the sk_buff, providing the eBPF program with a wealth of metadata. This is the sweet spot for Kubernetes CNI implementations. It allows for sophisticated identity-based policy enforcement, transparent encryption, and service load balancing while still being significantly more performant than iptables.

    For our use case—replacing kube-proxy and enforcing rich network policies—the TC hook is the superior choice. It provides the necessary context to map packets to Kubernetes identities (pods, services, labels) and enforce policies accordingly.

    Production Pattern: Cilium CNI with Istio Service Mesh

    While Istio provides an exceptional service mesh for L7 concerns (mTLS, retries, traffic splitting, observability), its default configuration still relies on the underlying CNI to establish L3/L4 connectivity and policy. By pairing Istio with an eBPF-powered CNI like Cilium, we achieve a best-of-both-worlds architecture: hyper-efficient kernel-level L3/L4 networking and policy, combined with sophisticated application-level L7 controls.

    The Architecture:

  • Cilium as CNI: Cilium is installed as the Kubernetes CNI plugin, completely replacing kube-proxy.
  • eBPF Dataplane: Cilium attaches eBPF programs to the TC hooks on all pod veth pairs and host network interfaces.
  • Identity-Based Policy: Cilium assigns a security identity to each pod based on its Kubernetes labels. This identity is a simple integer. Network policies are then compiled into eBPF rules that operate on these efficient integer identities rather than cumbersome IP addresses.
  • BPF Maps: Critical data—like the mapping of pod IPs to identities, policy rules, and service VIPs to backend IPs—is stored in highly efficient BPF maps. Packet processing in the eBPF program becomes a series of fast hash table lookups in these kernel-space maps.
  • Istio Integration: Istio's istio-proxy (Envoy) sidecar continues to run alongside the application container in each pod. The eBPF program handles the initial L3/L4 packet filtering and forwarding. If the policy allows the connection and it's destined for another pod in the mesh, the packet is efficiently delivered to the destination pod's network namespace, where the Istio sidecar intercepts it for L7 processing.
  • Here is a diagram of the packet flow:

    text
              POD A (Client)                                  POD B (Server)
    +------------------------------------+           +------------------------------------+
    |   App   | istio-proxy (sidecar)  |           |   App   | istio-proxy (sidecar)  |
    +------------------------------------+           +------------------------------------+
         |            ^                                    ^            |
         |            | (L7 intercept)                       | (L7 intercept) |
         v            |                                    |            v
    +----------+ veth_a_pod +----------+           +----------+ veth_b_pod +----------+
    | netns A  +------------+ netns A  |           | netns B  +------------+ netns B  |
    +----------+            +----------+           +----------+            +----------+
         |                                                ^
         | veth_a_host                                    | veth_b_host
         v
    +-------------------------------------------------------------------------------------+
    |                                   HOST KERNEL                                       |
    |                                                                                     |
    |  +----------------------+      +----------------------+      +---------------------+  |
    |  | TC Ingress Hook eBPF |----->|  BPF Map Lookups     |----->| TC Egress Hook eBPF |  |
    |  | (Policy, LB)         |      | (Identity, Policy)   |      | (Encapsulation)     |  |
    |  +----------------------+      +----------------------+      +---------------------+  |
    |                                                                                     |
    +-------------------------------------------------------------------------------------+
                                         | (VXLAN/Geneve Tunnel if on different node)
                                         v
                                      NETWORK

    Code Example 1: Production-Grade Cilium Helm Configuration

    To deploy Cilium in this mode, a carefully crafted Helm configuration is essential. This example assumes you are deploying into a pre-existing Istio installation.

    yaml
    # values-cilium-production.yaml
    
    # Replace kube-proxy entirely. 'strict' ensures it fails if kube-proxy is detected.
    # 'probe' is safer for migrations, as it uses a health check to decide.
    kubeProxyReplacement: strict
    
    # Enable BPF-based host routing for maximum performance.
    # This avoids using the host's main IP stack for pod-to-pod traffic on the same node.
    bpf:
      # Enable masquerading for traffic leaving the cluster without using iptables.
      masquerade: true
    
    # Use tunneling for the cross-node overlay network.
    # Geneve is slightly more flexible than VXLAN.
    tunnel: geneve
    
    # Enable endpoint routes to avoid an extra hop through the host network stack
    # for traffic destined to local pods.
    endpointRoutes:
      enabled: true
    
    # CRITICAL for Istio compatibility. This tells Cilium to not interfere with
    # Istio's own traffic redirection iptables rules inside the pod's network namespace.
    cni:
      chainingMode: "none"
    
    # Pre-allocate BPF maps to avoid runtime allocation overhead.
    # Important for latency-sensitive workloads.
    preallocateBPFmaps: true
    
    # Enable Hubble for deep network observability.
    hubble:
      enabled: true
      relay:
        enabled: true
      ui:
        enabled: true

    To deploy:

    helm install cilium cilium/cilium --version 1.15.5 -n kube-system -f values-cilium-production.yaml

    Advanced Policy Enforcement with `CiliumNetworkPolicy`

    While standard Kubernetes NetworkPolicy objects work, Cilium introduces a CiliumNetworkPolicy CRD that unlocks the full power of the eBPF dataplane, including L7 awareness.

    Even with Istio handling primary L7 policies, using CiliumNetworkPolicy for L7 can be a powerful defense-in-depth strategy. Cilium integrates with Envoy (the same proxy Istio uses) to enforce these policies, but it does so with eBPF-level efficiency for session initiation.

    Code Example 2: L3/L4/L7 Policy with `CiliumNetworkPolicy`

    Let's define a policy where frontend pods can only call GET /metrics on backend pods, and monitoring pods can call any endpoint on backend pods. All other traffic to backend is denied.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "backend-api-policy"
      namespace: "production"
    spec:
      endpointSelector:
        matchLabels:
          app: backend
      ingress:
        # Rule 1: Allow traffic from 'frontend' pods on port 8080 for a specific HTTP path.
        - fromEndpoints:
            - matchLabels:
                app: frontend
          toPorts:
            - ports:
                - port: "8080"
                  protocol: TCP
              rules:
                http:
                  - method: "GET"
                    path: "/metrics"
    
        # Rule 2: Allow all traffic from 'monitoring' pods on port 8080.
        - fromEndpoints:
            - matchLabels:
                app: monitoring
          toPorts:
            - ports:
                - port: "8080"
                  protocol: TCP

    How it works under the hood:

  • L3/L4 Enforcement (eBPF): The fromEndpoints selectors are translated by the Cilium agent into rules based on security identities. This check (allow from identity_frontend to identity_backend on port 8080) happens in the eBPF program attached to the backend pod's TC hook. This is extremely fast.
  • L7 Handoff (eBPF to Envoy): Because the first rule contains an http section, the eBPF program knows it cannot make the final decision. If a packet matches the L4 part of the rule (from frontend to port 8080), the eBPF program transparently redirects it to an Envoy proxy managed by Cilium. This proxy then performs the L7 inspection (method: "GET", path: "/metrics").
  • Istio Interaction: This Cilium-managed Envoy for L7 policy runs in addition to the Istio sidecar. The flow becomes: Packet -> eBPF -> Cilium Envoy (L7 Policy) -> Istio Sidecar (mTLS, etc.) -> App. While this adds a hop, it provides a powerful security layer at the CNI level, independent of the service mesh.
  • Performance Benchmarking: `iptables` vs. eBPF

    To quantify the performance gains, we conducted a benchmark on a 3-node GKE cluster (e2-standard-4 instances) comparing Calico (iptables mode) with Cilium (kubeProxyReplacement: strict). We used netperf to measure TCP request/response latency between two pods on different nodes.

    The cluster was scaled to 1,000 services to simulate a moderately loaded environment, creating a large number of iptables rules for the Calico test.

    Test Setup:

    * Tool: netperf -t TCP_RR

    * Packet Size: 1 byte

    * Duration: 60 seconds per run

    Results:

    CNI ConfigurationAverage Latency (µs)p99 Latency (µs)Throughput (Transactions/sec)
    Calico (iptables)145.2480.1~6,880
    Cilium (eBPF)88.6195.5~11,280
    Improvement39%59%64%

    Analysis:

    The results are stark. Cilium's eBPF-based dataplane delivered a 39% reduction in average latency and a staggering 59% reduction in P99 latency. The P99 result is particularly important for user-facing services, as it demonstrates a significant reduction in worst-case performance outliers. The 64% increase in transactional throughput shows how much CPU time was reclaimed from iptables processing and made available to the application.

    The reason for this dramatic difference is the O(1) lookup complexity of BPF maps compared to the O(n) traversal of iptables chains. As the number of services and policies grows, the performance gap between iptables and eBPF widens significantly.

    Edge Cases and Production Troubleshooting

    Deploying eBPF at scale requires understanding its unique failure modes and observability tools.

    Edge Case 1: Kernel Version Dependencies

    eBPF is a rapidly evolving kernel technology. Advanced features required by Cilium, such as certain helper functions or map types, depend on a minimum kernel version. Running on older enterprise Linux distributions (e.g., RHEL/CentOS 7 with a 3.x kernel) is a non-starter. A modern kernel (5.4+) is strongly recommended for production. Before deployment, verify your nodes' capabilities.

    bash
    # Check kernel version on all nodes
    kubectl get nodes -o=custom-columns=NODE:.metadata.name,KERNEL:.status.nodeInfo.kernelVersion
    
    # The Cilium CLI can perform a more detailed check
    cilium status --all-nodes

    Edge Case 2: Multi-Cluster MTU Headaches

    When using Cilium's Cluster Mesh feature for multi-cluster networking, traffic between clusters is typically encapsulated in a Geneve or VXLAN tunnel. This encapsulation adds overhead (e.g., 50 bytes for Geneve) to each packet, which can lead to fragmentation if the underlying network's Maximum Transmission Unit (MTU) is not configured correctly. This is a common and difficult-to-diagnose problem in cloud environments where jumbo frames may not be uniformly enabled.

    Solution: Cilium automatically tries to detect the MTU, but in complex networks, you may need to set it manually in the Helm chart. Proactively test with a large ICMP packet to find the path MTU.

    bash
    # From a pod in cluster A, ping a pod's IP in cluster B
    # Start with a large size and decrease until it works.
    ping <pod_ip_cluster_b> -s 1450 -M do

    Troubleshooting with Hubble and `bpftool`

    Observability is where an eBPF-based CNI truly shines, offering visibility directly from the kernel.

    Code Example 3: Tracing a Dropped Packet with Hubble

    Imagine a frontend pod is unable to connect to a backend pod. With iptables, you'd be parsing log files and iptables -vL output. With Hubble, you can observe the traffic flow in real-time.

    bash
    # Install the Hubble CLI
    export HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
    curl -L --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-amd64.tar.gz
    
    # Port-forward to the Hubble Relay service
    kubectl port-forward -n kube-system svc/hubble-relay 4245:80
    
    # Follow the traffic from the frontend pod, filtering for dropped packets
    hubble observe --from-pod production/frontend-7b5b... -n production --verdict DROPPED -f

    The output will provide a clear, human-readable reason for the drop, such as Policy denied by rule default/backend-api-policy:ingress:1.

    Code Example 4: Deep Dive with bpftool

    For the most advanced debugging, you can inspect the BPF maps directly in the kernel on a specific node. This is the ground truth.

    bash
    # Get a shell on the target Kubernetes node
    # Find the Cilium BPF filesystem path
    ls /sys/fs/bpf/tc/globals/
    
    # Dump the Cilium policy map. This shows the compiled policy for identities.
    # The map name might vary slightly based on version.
    bpftool map dump name cilium_policy_0
    
    # Example Output (simplified):
    # key: 0x10 0x01 0x00 0x00 0xf4 0x01 0x00 0x00  value: 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
    # The key contains source identity, destination identity, and port.
    # The value indicates the policy decision (e.g., allow/deny).

    This level of introspection allows engineers to verify exactly how the compiled policy is being represented and applied in the kernel, bypassing all layers of abstraction.

    Conclusion

    For senior engineers operating Kubernetes at scale, moving beyond iptables is not a question of if, but when. The inherent performance limitations and scaling challenges of the netfilter framework create a ceiling for high-throughput, low-latency applications. By adopting an eBPF-native dataplane with a CNI like Cilium, especially in conjunction with a service mesh like Istio, you can unlock significant performance gains, enhance security posture through identity-based controls, and achieve an unparalleled level of network observability directly from the kernel. This architectural pattern represents the current state-of-the-art for production Kubernetes networking, trading the familiarity of iptables for a system that is demonstrably faster, more scalable, and more secure.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles