Tuning Cilium's eBPF Datapath for Ultra-Low-Latency Microservices

12 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Unacceptable Cost of Sidecar Latency

For distributed systems where P99 latency is measured in single-digit milliseconds or microseconds—such as high-frequency trading, real-time bidding, or telco network functions—the conventional wisdom of service mesh architecture breaks down. The sidecar proxy model, popularized by Istio and Linkerd, injects a user-space proxy (typically Envoy) into the network path of every application pod. While this provides powerful L7 capabilities, it comes at a steep, non-negotiable performance cost.

Each packet traversing a service-to-service connection must pass through the TCP/IP stack four times (Pod -> Node Kernel -> Sidecar -> Node Kernel -> Pod) instead of twice. This journey involves multiple context switches between user space and kernel space, memory copies, and the processing overhead of the proxy itself. For latency-sensitive workloads, this can add hundreds of microseconds or even milliseconds to the request path, a penalty that is simply unacceptable.

This article assumes you're already aware of this problem. We will not cover the basics of eBPF or Cilium. Instead, we will focus exclusively on the advanced configuration and tuning techniques required to squeeze the maximum performance out of Cilium's eBPF-powered datapath, transforming it from a mere CNI into a high-performance, kernel-native service mesh fabric.

Section 1: Eradicating `kube-proxy` for a Direct eBPF Datapath

The first and most impactful optimization is the complete removal of kube-proxy. By default, Kubernetes services (ClusterIP, NodePort) are implemented by kube-proxy, which manipulates iptables or IPVS rules on each node. This adds another layer of kernel processing (netfilter hooks) that every service-bound packet must traverse.

Cilium can entirely replace this functionality by using eBPF hash maps in the kernel to store service-to-backend mappings. When a packet destined for a ClusterIP arrives at the TC (Traffic Control) hook, Cilium's eBPF program performs a direct map lookup and forwards the packet to a backend pod, completely bypassing iptables and IPVS.

Implementation: `kubeProxyReplacement`

This feature is enabled via the Cilium Helm chart or ConfigMap. The most robust mode is strict, which ensures Cilium is fully managing service routing.

Helm values.yaml Configuration:

yaml
# values.yaml for Cilium Helm chart
kubeProxyReplacement: strict # Options: disabled, probe, partial, strict

# For direct routing performance, especially on bare-metal or cloud VNIs
tunnel: disabled
autoDirectNodeRoutes: true

# Enable BPF masquerading for traffic leaving the cluster
bpf:
  masquerade: true

# Required for NodePort implementations without kube-proxy
enableNodePort: true

Deploying this configuration instructs the Cilium agent on each node to take over service handling. You can verify the replacement by checking for the absence of kube-proxy pods and iptables rules related to Kubernetes services.

bash
# Verify kube-proxy is not running
kubectl -n kube-system get pods -l k8s-app=kube-proxy
# Should return no resources found

# Inspect iptables rules on a node (should be minimal)
ssh node-1 -- sudo iptables-save | grep KUBE-SERVICES
# Should return nothing

# Inspect Cilium's eBPF service map
kubectl exec -it -n kube-system <cilium-pod> -- cilium service list
# You will see ClusterIPs mapped directly to backend Pod IPs

Performance Impact and Edge Cases

* Performance Gain: By removing netfilter traversal, we reduce per-packet CPU overhead and latency. Benchmarks often show a 10-20% reduction in latency for service-meshed traffic compared to a kube-proxy baseline.

* Edge Case: hostNetwork Pods: Pods running with hostNetwork: true traditionally relied on kube-proxy's iptables rules to access ClusterIPs. With kubeProxyReplacement, Cilium uses eBPF programs attached to the host's network devices and cgroups to transparently redirect this traffic, ensuring seamless compatibility.

* Edge Case: Direct Server Return (DSR): For NodePort services, Cilium's eBPF implementation uses DSR by default. When a request comes to Node A for a service whose backend pod is on Node B, Node A's eBPF program forwards the packet directly to Node B's pod. The pod on Node B then replies directly to the external client, bypassing Node A on the return path. This significantly reduces latency for north-south traffic but requires the underlying network to allow packets with a source IP different from the sending node's IP.

Section 2: XDP Acceleration for Pre-emptive Packet Filtering

While the TC (Traffic Control) eBPF hook is powerful, it operates after a packet has been received by the kernel's network stack (SKB allocation). For certain tasks, like high-volume DDoS mitigation or basic L3/L4 filtering, we can act even earlier using the Express Data Path (XDP).

XDP eBPF programs attach directly to the network driver's receive queue, processing packets before the kernel allocates significant resources to them. This is the earliest possible point for programmable packet processing in the Linux kernel.

Cilium uses XDP primarily for two purposes:

  • High-performance L3/L4 load balancing (as a replacement for IPVS).
  • Dropping traffic that violates CiliumNetworkPolicy at the driver level.
  • Implementation: Enabling XDP Mode

    Enabling XDP mode requires a compatible network driver. You can check for compatibility with cilium status.

    Helm values.yaml Configuration:

    yaml
    # values.yaml
    device: "auto" # Or specify a device like "eth0"
    xdp:
      enabled: true
      mode: "native" # "native" for direct driver attachment, "generic" for kernel fallback

    Once enabled, Cilium will load its load-balancing and filtering logic into the XDP hook.

    Advanced Use Case: DDoS Mitigation with `CiliumNetworkPolicy`

    Consider a scenario where you have a public-facing service that is being targeted by a UDP amplification attack from a known IP block. We can create a policy to drop this traffic at the XDP layer, preventing it from ever consuming kernel resources or reaching the application pod.

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "deny-udp-attack-at-xdp"
    spec:
      endpointSelector:
        matchLabels:
          app: public-facing-service
      ingress:
      - fromCIDR:
        - "0.0.0.0/0"
        toPorts:
        - ports:
          - port: "8000"
            protocol: "UDP"
      - fromCIDR:
        # Malicious IP Block
        - "203.0.113.0/24"

    While this looks like a standard network policy, when Cilium is in XDP mode, the rule to drop traffic from 203.0.113.0/24 is compiled into the eBPF program attached at the XDP hook. The XDP_DROP action is incredibly efficient.

    Performance Considerations & Benchmarks

    To quantify the impact, we can simulate a high-volume packet flood using a tool like pktgen.

    * Scenario: Flood a node with 10 million packets per second (Mpps) of UDP traffic from a blocked CIDR.

    * Without XDP (TC-based drop): The target node's CPU usage on the core handling network interrupts will spike significantly as the kernel processes each packet up to the TC hook before dropping it.

    * With XDP: The CPU usage will be drastically lower. The XDP program drops the packets so early that the overhead is minimal. The node remains stable and can continue processing legitimate traffic.

    ModePackets Dropped/secCPU Utilization (ksoftirqd)
    TC eBPF Drop10 Mpps85-100%
    XDP eBPF Drop10 Mpps10-15%

    Caveat: XDP is not a silver bullet. It is best for L3/L4 filtering. Complex L7 policies still require the full TCP/IP stack and are handled at the TC hook or by a proxy. The key is to use XDP to shed illegitimate or unwanted load as early as possible.

    Section 3: Sub-Microsecond Pod-to-Pod Communication with Sockops Bypass

    For services that communicate intensely with other pods on the same node, Cilium offers a powerful optimization that can virtually eliminate the networking stack from the data path: socket-level acceleration (sockops).

    When two pods on the same node establish a TCP connection, Cilium's eBPF sockops program, attached to a cgroup, intercepts the connection at the socket level. It recognizes that both ends of the connection are local and managed by Cilium. Instead of sending packets down the node's full TCP/IP stack, it short-circuits the path. It directly copies data from one pod's socket send buffer to the other's receive buffer.

    This bypasses:

    * TCP/IP stack processing

    * Packet encapsulation/decapsulation

    * TC layer eBPF processing

    * The network device layer

    Implementation: Enabling Sockops

    This feature is typically enabled by default in recent Cilium versions but can be explicitly configured.

    Helm values.yaml Configuration:

    yaml
    # values.yaml
    bpf:
      sockops:
        enabled: true

    Verification is crucial. You can inspect the Cilium agent logs or use the cilium CLI to confirm that connections are being accelerated.

    bash
    # Tail the logs of a Cilium agent pod
    kubectl -n kube-system logs -f <cilium-pod> | grep "Accelerating TCP socket"
    
    # Check the eBPF map for established accelerated connections
    kubectl exec -it -n kube-system <cilium-pod> -- cilium bpf sockops list

    Benchmarking Same-Node Latency

    Let's demonstrate the impact with a simple gRPC client/server application. We'll deploy both pods to the same node using a podAffinity rule and measure request latency with ghz.

    gRPC Server/Client Pod Spec Snippet:

    yaml
    # ... pod spec
    affinity:
      podAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - my-grpc-app
          topologyKey: "kubernetes.io/hostname"

    Benchmark Command:

    bash
    # Run from a pod in the cluster
    ghz --insecure --proto=api.proto --call=api.Service.MyCall -n 1000000 -c 100 <grpc-server-cluster-ip>:50051

    Expected Results:

    Sockops StatusAverage LatencyP99 Latency
    Disabled~45µs~120µs
    Enabled~15µs~35µs

    This ~3x reduction in average and P99 latency is purely from bypassing the kernel's network stack for same-node communication. For applications with high-volume, chatty, same-node traffic patterns (e.g., a sidecar logging agent, or tightly coupled services in a data plane), this optimization is critical.

    Section 4: Production Pitfalls and Advanced Observability

    Tuning for ultra-low latency introduces its own set of complex challenges and failure modes. Debugging issues within the eBPF datapath requires specialized tools.

    Edge Case 1: MTU Mismatches in `tunnel: disabled` Mode

    When running in a non-encapsulated mode (tunnel: disabled) for maximum performance, pods communicate directly over the underlying network fabric. A common and difficult-to-diagnose issue is an MTU mismatch. If the node's network interface has a standard MTU of 1500, but an intermediate network device (like a cloud provider's virtual switch) enforces a lower MTU (e.g., 1460 for its own encapsulation), you will experience packet fragmentation or drops, leading to high latency and connection timeouts.

    Diagnosis:

  • Cilium Health: cilium status might not report an issue directly.
  • Hubble: Use Hubble, Cilium's observability tool, to trace flows. Look for dropped packets and TCP retransmissions between specific pods on different nodes.
  • bash
        # Install Hubble CLI
        hubble observe --from pod:<namespace/client-pod> --to pod:<namespace/server-pod> --verdict DROPPED -f
  • tracepath: Use tracepath from inside a pod to determine the path MTU to another pod's IP.
  • bash
        kubectl exec -it <client-pod> -- tracepath <server-pod-ip>
        # Look for the "pmtu" value in the output.

    Solution: Ensure that the MTU configured on the Cilium CNI matches the effective MTU of your underlying network. This can be set in the Cilium ConfigMap or Helm chart.

    yaml
    # values.yaml
    tunnel: disabled
    mtu: 1450 # Set a safe value based on your network's constraints

    Edge Case 2: Kernel Version Dependencies

    eBPF is a rapidly evolving kernel technology. Advanced Cilium features have minimum kernel version requirements. For example, some sockops accelerations or more efficient eBPF map types may only be available in Linux 5.10+. Running a heterogeneous cluster with nodes on different kernel versions can lead to inconsistent performance and behavior.

    Diagnosis & Solution:

    * Use cilium status on each node to check for warnings or disabled features due to an old kernel.

    * Standardize the kernel version across your cluster for predictable performance.

    * When a feature is unavailable, Cilium gracefully degrades, but this means your performance assumptions may be violated on older nodes. Use node taints and tolerations to schedule latency-critical workloads only on nodes with the required kernel version.

    Advanced Observability with Hubble

    When debugging low-latency issues, tcpdump can be misleading because much of the action happens before the packet even reaches a point where tcpdump can capture it (in the case of XDP) or bypasses the stack entirely (in the case of sockops).

    Hubble provides visibility directly from the eBPF programs.

    Advanced Hubble Query Example:

    Let's trace a DNS request from a specific pod, observing it at every step in the Cilium datapath, including policy decisions.

    bash
    # See the full journey of a DNS request
    hubble observe --from pod:my-app-pod --to pod:kube-system/kube-dns -f --protocol dns --print-flow-id
    
    # Example Output:
    # FLOW_ID: 12345 SRC: my-app-pod:34567 -> DST: kube-dns:53 (L3/L4) VERDICT: FORWARDED
    # FLOW_ID: 12345 SRC: my-app-pod:34567 -> DST: kube-dns:53 (L7 DNS) REQ: A? example.com
    # FLOW_ID: 12345 SRC: kube-dns:53 -> DST: my-app-pod:34567 (L7 DNS) RESP: 93.184.216.34
    # FLOW_ID: 12345 SRC: kube-dns:53 -> DST: my-app-pod:34567 (L3/L4) VERDICT: FORWARDED

    This level of introspection is invaluable for confirming that policies are being correctly applied and that traffic is not being unexpectedly dropped or redirected by the eBPF datapath.

    Conclusion: A Surgical Approach to Network Performance

    Achieving ultra-low-latency networking in Kubernetes is not about choosing a single "fast" CNI. It requires a deep, architectural understanding of the Linux kernel's networking stack and a surgical approach to optimization. By moving network routing, policy enforcement, and load balancing from user-space proxies and iptables into the kernel with eBPF, Cilium provides the necessary tools. However, unlocking its full potential demands more than a default installation.

    By systematically replacing kube-proxy, leveraging XDP for early-stage filtering, enabling sockops for same-node traffic, and mastering eBPF-native observability with Hubble, senior engineers can build a network fabric that meets the stringent demands of latency-critical applications. This is the future of high-performance cloud-native networking—not bolted on, but deeply integrated into the operating system itself.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles