eBPF & Cilium: High-Throughput K8s Networking by Bypassing iptables

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The `iptables` Bottleneck in Large-Scale Kubernetes

For any engineer who has operated Kubernetes at scale, the performance characteristics of kube-proxy in its default iptables mode become a tangible liability. While functional for smaller clusters, this model introduces significant performance degradation as the number of Services and Pods grows. The core issue lies in the fundamental design of iptables and the Linux Netfilter framework.

Every Kubernetes Service translates into a series of iptables rules within the KUBE-SERVICES chain. For each packet destined for a Service ClusterIP, the kernel must linearly traverse this chain to find a matching rule. Subsequently, it jumps to a corresponding KUBE-SVC-* chain, which contains endpoint rules. This chain is then traversed to select a backend Pod IP. The complexity of this lookup is directly proportional to the number of Services, resulting in O(n) overhead for service routing.

Consider a cluster with 5,000 Services. A packet destined for the 4,999th Service in the chain must pass through 4,998 non-matching rules before finding its target. This traversal happens in kernel space for every single packet, consuming non-trivial CPU cycles and introducing measurable latency.

Visualizing the Problem:

On a large cluster, the output of iptables-save becomes unwieldy, often spanning tens of thousands of lines. The performance impact isn't just theoretical. We can observe it through kernel profiling tools like perf during periods of high network traffic.

bash
# On a cluster node under load
perf top -F 99 -p $(pgrep -f kube-proxy)

# Expected output showing significant time spent in kernel functions related to netfilter/iptables
# ...
#    25.17%  [kernel]  [k] nf_hook_slow
#    15.89%  [kernel]  [k] ipt_do_table
#     8.42%  [kernel]  [k] ip_tables_do_find
# ...

This overhead manifests in several ways:

  • Increased Latency: Higher per-packet processing time directly adds to network latency.
  • Reduced Throughput: CPU contention on worker nodes, as ksoftirqd processes spend more time on NET_RX softirqs for packet processing, can cap network throughput.
  • Slow Pod Startup: When a new Pod is created, kube-proxy on every node must update its iptables rules. With thousands of rules, acquiring the necessary locks and applying changes can become a slow, serialized process, delaying the readiness of new endpoints.
  • This fundamental scaling limitation necessitates a more efficient data plane—one that avoids the linear traversal cost of iptables. This is precisely the problem that Cilium solves using eBPF.

    eBPF and Cilium: A Paradigm Shift in Kernel Networking

    eBPF (extended Berkeley Packet Filter) allows sandboxed programs to be attached to various hook points within the Linux kernel, enabling developers to safely and efficiently extend kernel functionality. Cilium leverages this capability to create a highly efficient networking and security data plane for Kubernetes.

    Instead of routing packets through Netfilter's complex chains, Cilium attaches eBPF programs directly to network device drivers and the Traffic Control (TC) subsystem. The key hook points are:

    * XDP (eXpress Data Path): This hook is located directly within the network driver's receive path, making it the earliest possible point for packet processing. Cilium uses XDP primarily for high-performance DDoS mitigation and LoadBalancer service implementation, as it can make decisions before the packet even enters the main kernel networking stack.

    * TC (Traffic Control): The TC hook cls_bpf allows an eBPF program to be attached to the ingress and egress paths of a network interface (both physical and virtual, like veth pairs). This is where Cilium implements the bulk of its functionality, including routing, policy enforcement, and load balancing.

    The Cilium Data Path:

    A packet entering a Pod's network namespace follows this simplified path:

    • Packet arrives at the host's physical NIC.
  • (Optional) An XDP eBPF program performs initial filtering or LoadBalancer redirection.
    • The packet is passed up to the TC ingress hook on the physical device.
  • The kernel routes the packet to the Pod's veth pair.
  • An eBPF program on the TC ingress hook of the veth endpoint executes.
  • This is where the magic happens: Instead of traversing iptables chains, the eBPF program performs a highly efficient lookup in an eBPF map.
  • Identity-Based Security and O(1) Lookups:

    Cilium's core innovation is its identity-based security model. It decouples security from network location (IP addresses).

    • Cilium monitors the Kubernetes API server for Pods and their labels.
  • It assigns a unique, 16-bit numerical Cilium Identity to each unique set of labels (e.g., app=frontend,env=prod might get identity 12345).
  • This mapping (labels -> identity) is shared across the cluster.
  • When a CiliumNetworkPolicy is created, Cilium translates it into rules based on these numerical identities, not IP addresses.
  • These rules are stored in an eBPF map, which is a key-value store in the kernel. The lookup key might be a tuple of (source_identity, dest_identity, dest_port), and the value is the policy decision (e.g., ALLOW or DENY).
  • Because eBPF maps are implemented as hash tables, lookups are extremely fast—effectively O(1). The eBPF program simply extracts the source identity (already attached to the packet as metadata by Cilium), looks up the policy in the map, and enforces the result. The number of services or policies in the cluster has no impact on the per-packet lookup time.

    Production Implementation: Migrating from an `iptables`-based CNI

    A live migration from a CNI like Calico (in iptables mode) to Cilium is a non-trivial but feasible operation that requires careful planning.

    High-Level Migration Strategy:

  • Preparation: Ensure your kernel version is sufficient (5.4+ recommended for full features). Taint existing nodes to prevent new pods from being scheduled on them during the CNI change.
  • Installation: Install the Cilium Helm chart with cni.chainingMode=true. In this mode, Cilium co-exists with the old CNI but does not manage pod networking itself.
  • Policy Translation: Convert existing kubernetes.io/NetworkPolicy objects to cilium.io/CiliumNetworkPolicy. This is a good time to leverage advanced L7 features.
  • Cordon & Drain: One by one, cordon and drain each node. This evicts existing pods.
  • CNI Reconfiguration: On the drained node, remove the old CNI configuration (/etc/cni/net.d/10-calico.conflist) and restart kubelet.
  • Cilium Full Mode: Update the Cilium DaemonSet on that node to run in full mode (not chained). This is typically done by labeling the node and having the Helm chart use a nodeSelector.
  • Uncordon: Uncordon the node. New pods scheduled here will now be managed by Cilium.
  • Repeat: Repeat this process for all nodes in the cluster.
  • Code Example 1: Production-Grade Cilium Helm Configuration

    This values.yaml snippet demonstrates a configuration for a high-performance cluster, enabling kube-proxy replacement and other advanced features.

    yaml
    # values-production.yaml for Cilium Helm chart
    
    # Replace kube-proxy entirely with Cilium's eBPF implementation
    # This provides a massive performance boost for services.
    kubeProxyReplacement: "strict"
    
    # Required for kube-proxy replacement to talk to the API server
    k8sServiceHost: "your-api-server-endpoint.internal"
    k8sServicePort: "6443"
    
    # Enable BPF-based masquerading for traffic leaving the cluster
    bpf:
      masquerade: true
    
    # Enable Hubble for deep observability
    hubble:
      enabled: true
      relay:
        enabled: true
      ui:
        enabled: true
    
    # Enable high-performance XDP acceleration on physical devices
    # NOTE: Requires a compatible network driver
    x-masq-device: "eth0"
    
    # Increase eBPF map sizes for large clusters
    # These values are examples; tune based on cluster size and connection rate
    # cilium bpf map list -o json | jq '.[] | .name, .max_entries'
    cilium:
      bpf:
        # Connection tracking table size
        ctGlobalTcpMax: 2097152
        ctGlobalUdpMax: 1048576
        # NAT table size
        natMax: 1048576
        # Policy map size
        policyMax: 32768
    
    # Set resource limits for the cilium-agent daemonset
    agent:
      resources:
        requests:
          cpu: 200m
          memory: 256Mi
        limits:
          cpu: 2
          memory: 2Gi

    Deploying this requires careful validation using cilium status and cilium connectivity test to ensure all components are healthy and the data plane is functioning as expected.

    Advanced `CiliumNetworkPolicy` Patterns

    The standard Kubernetes NetworkPolicy is limited to L3/L4 (IP/Port). The CiliumNetworkPolicy CRD unlocks L7-aware enforcement directly in eBPF.

    Code Example 2: L7 Policy Enforcement for Kafka

    Imagine a scenario where a billing-service needs to produce messages to a payments topic in Kafka, but should be prevented from accessing any other topic. This fine-grained control is impossible with standard network policies.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "kafka-producer-policy"
      namespace: "kafka"
    spec:
      # Apply this policy to Kafka broker pods
      endpointSelector:
        matchLabels:
          app: kafka-broker
      # Ingress rules: what can connect TO the brokers
      ingress:
      - fromEndpoints:
        - matchLabels:
            # Allow connections FROM billing-service pods
            io.kubernetes.pod.namespace: billing
            app: billing-service
        toPorts:
        - ports:
          - port: "9092"
            protocol: TCP
          # L7-aware rules for the Kafka protocol
          rules:
            kafka:
            # This is a list of allowed Kafka API requests
            - role: produce
              topic: "payments"
            - role: produce
              topic: "refunds"
            # Also allow necessary metadata requests
            - apiVersion: "*"
              apiKey: "Metadata"
            - apiVersion: "*"
              apiKey: "ApiVersions"

    When a packet arrives at the Kafka port, the eBPF program on the TC hook invokes a Kafka protocol parser (also implemented in eBPF). The parser extracts the API key and topic name from the packet payload and compares it against the allowed rules in the policy map. If the billing-service attempts to produce to a logs topic, the eBPF program will drop the packet directly in the kernel before it ever reaches the Kafka broker process.

    Code Example 3: DNS-aware Egress Policies

    A common security requirement is to restrict egress traffic to specific external domains. IP-based rules are brittle, as IPs can change. Cilium solves this with DNS-aware policies.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "egress-to-github-api"
      namespace: "ci-cd"
    spec:
      endpointSelector:
        matchLabels:
          app: git-runner
      egress:
      - toEndpoints:
        - matchLabels:
            # Allow communication to pods in the same namespace
            'k8s:io.kubernetes.pod.namespace': ci-cd
      - toFQDNs:
        - matchNames:
          # Allow DNS lookups and subsequent TCP connections to this domain
          - "api.github.com"
        toPorts:
        - ports:
          - port: "443"
            protocol: TCP
      # Also need to allow DNS lookups themselves
      - toEndpoints:
        - matchLabels:
            'k8s:io.kubernetes.pod.namespace': kube-system
            'k8s:k8s-app': kube-dns
        toPorts:
        - ports:
          - port: "53"
            protocol: UDP
          rules:
            dns:
            - matchPattern: "*.github.com"

    Cilium's agent snoops DNS responses by attaching an eBPF program to the DNS server's (e.g., CoreDNS) socket. When the git-runner pod resolves api.github.com, Cilium sees the response containing the IP addresses. It then dynamically populates an eBPF map with these allowed egress IPs, associating them with the DNS name and a TTL. The eBPF program on the Pod's egress path checks outgoing packets against this map. If the destination IP is in the map for api.github.com, the packet is allowed. This handles dynamic IP changes gracefully, up to the TTL of the DNS record.

    Performance Benchmarking and Tuning

    To quantify the benefits, we can run a benchmark using netperf between two pods on different nodes.

    Benchmark Scenario:

    * Cluster: 500 nodes, 10,000 pods, 5,000 services.

    * Tool: netperf running in TCP_RR (TCP Request/Response) mode to measure latency and transaction rate.

    * Configurations:

    1. Baseline (iptables): A CNI like Calico in iptables mode.

    2. Cilium (BPF): Cilium with kubeProxyReplacement: "strict".

    3. Cilium (XDP): Cilium with XDP acceleration enabled for LoadBalancer traffic.

    Hypothetical Results (Transactions per Second):

    Test TypeBaseline (iptables)Cilium (BPF)Cilium (XDP)Performance Gain (vs Baseline)
    Pod-to-Pod (Intra-node)45,000 tps/sec65,000 tps/secN/A+44%
    Pod-to-Pod (Inter-node)38,000 tps/sec52,000 tps/secN/A+37%
    Pod-to-Service (ClusterIP)22,000 tps/sec48,000 tps/secN/A+118%
    External-to-Service (LB)18,000 tps/sec41,000 tps/sec55,000 tps/sec+205% (XDP)

    The most dramatic improvement is seen in Service routing (Pod-to-Service and External-to-Service), where Cilium's eBPF-based load balancing completely avoids the iptables traversal bottleneck. The gains in Pod-to-Pod traffic come from a more optimized data path and more efficient policy enforcement.

    Advanced Tuning:

    For extreme workloads, tuning eBPF map sizes is crucial. If the connection tracking table fills up, new connections will be dropped. Monitor map pressure via Cilium's metrics or the CLI:

    bash
    # Check map pressure for a specific cilium agent pod
    kubectl exec -it -n kube-system cilium-xxxx -- cilium bpf map list
    
    # Look for maps like 'cilium_ct_any4_global' that are near their MaxEntries limit.

    If pressure is high, increase the corresponding bpf-ct-global-tcp-max or other map size parameters in the Cilium ConfigMap and restart the agents.

    Observability and Troubleshooting with Hubble

    Because eBPF operates at the kernel level, traditional tools like tcpdump may not show packets that are dropped by an eBPF program. Hubble is Cilium's purpose-built observability tool that provides deep insights into the eBPF data path.

    Code Example 4: Tracing a Dropped Packet with Hubble CLI

    Imagine a developer reports that their frontend pod cannot connect to the backend pod on port 8080. A CiliumNetworkPolicy is likely the cause.

    bash
    # Follow network flows from 'frontend' to 'backend' and only show dropped verdicts
    hubble observe --from-pod my-app/frontend --to-pod my-app/backend --port 8080 -f --verdict DROPPED
    
    # Sample Output
    TIMESTAMP            SOURCE                            DESTINATION                       TYPE      VERDICT   REASON
    Jan 10 14:32:10.123  my-app/frontend-7b8c... (10.0.1.5) -> my-app/backend-5d4f... (10.0.2.10)   L3/L4     DROPPED   Policy denied on ingress

    This immediately confirms a policy issue. To find the exact rule, we can use a more verbose output:

    bash
    # Get the full details of the drop event
    hubble observe --from-pod my-app/frontend --to-pod my-app/backend -o json | jq '.flow.l4.tcp.flags.SYN and .flow.verdict == "DROPPED"'
    
    # ... JSON output will contain details like:
    # "traffic_direction": "INGRESS",
    # "policy_decision": "POLICY_DENIED",
    # "drop_reason_desc": "POLICY_DENIED"

    Hubble's UI provides a graphical representation of these flows, making it trivial to visualize service dependencies and identify unexpected or denied traffic patterns across the entire cluster.

    Edge Cases and Production Caveats

    * Host-Networking Pods: Pods running with hostNetwork: true do not have their own network namespace. Applying policy to them is complex. Cilium's Host-aware policies can target these pods, but the eBPF programs must be attached to the physical host device rather than a veth pair, which can have different performance implications.

    Direct Server Return (DSR) for LoadBalancers: When using kubeProxyReplacement, Cilium can operate LoadBalancer services in DSR mode. In this mode, the request from an external client comes to the node, the eBPF program selects a backend pod, and forwards the packet. The pod then replies directly* to the client, bypassing the original node. This is extremely efficient but has a key trade-off: the backend pod sees the original client's source IP, but the return path is asymmetric, which can be problematic for some stateful network devices.

    * Kernel Version Dependencies: This is the most critical operational concern. eBPF is an evolving kernel feature. Advanced Cilium features are explicitly tied to kernel versions. For example:

    * BPF-based Host Routing (to avoid IP-in-IP encapsulation): Requires kernel >= 5.10.

    * XDP Acceleration: Requires a network driver with eBPF/XDP support.

    * L7 Policy on egress: Requires kernel >= 4.19.

    A heterogeneous cluster with nodes running different kernel versions can lead to inconsistent behavior. Always consult the Cilium documentation for the feature-to-kernel version matrix before enabling advanced functionality.

    * Resource Management: eBPF programs and maps consume kernel memory, which is not managed by the container runtime's cgroups. The memory used by maps is determined by the size parameters in the Cilium ConfigMap. Over-provisioning can waste significant memory on each node, while under-provisioning can lead to dropped connections. Careful monitoring and tuning based on cluster scale are essential for production stability.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles