eBPF Service Mesh: Kernel-Level Telemetry for Ultra-Low Latency

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Latency Tax of User-Space Proxies

For years, the de facto standard for implementing a service mesh in Kubernetes has been the sidecar proxy model, popularized by Istio (Envoy) and Linkerd. This pattern injects a user-space proxy alongside each application container within a pod. While powerful for its feature set—traffic management, observability, and security—it imposes a non-trivial 'latency tax.'

Every packet, both inbound and outbound, traverses the pod's network namespace, is intercepted by iptables or nftables rules, and redirected through the user-space proxy. This journey involves multiple context switches between kernel-space and user-space, memory copies, and the overhead of the proxy's own TCP stack termination and re-establishment. For latency-sensitive services like real-time bidding platforms, financial transaction processors, or high-frequency data APIs, this per-request overhead of several milliseconds can be prohibitive.

Consider the typical packet flow in an Istio-enabled pod:

  • Outbound: Application sends a packet -> Kernel TCP/IP stack -> iptables PREROUTING/OUTPUT chain -> Redirect to Envoy proxy listener -> Envoy processes L4-L7 rules -> Envoy opens a new connection to the destination -> Kernel TCP/IP stack -> Physical network.
  • Inbound: Physical network -> Kernel TCP/IP stack -> iptables PREROUTING chain -> Redirect to Envoy proxy listener -> Envoy processes L4-L7 rules -> Envoy opens a new connection to the application listener on localhost -> Kernel TCP/IP stack -> Application receives the packet.
  • This round trip introduces significant overhead. eBPF (extended Berkeley Packet Filter) offers a fundamentally different approach by moving the data plane logic directly into the Linux kernel, creating a sidecar-less service mesh that operates with near-native kernel performance.

    This article dissects the advanced implementation patterns of an eBPF-based service mesh, using Cilium as our reference implementation. We will not cover the basics of eBPF but will instead focus on the specific kernel-level mechanisms that enable its performance, the advanced configurations required for production, and the critical edge cases senior engineers must navigate.


    Kernel-Level Data Plane: From `iptables` to eBPF Hooks

    The core innovation of an eBPF-based service mesh is its ability to short-circuit the convoluted packet path of sidecar proxies. It achieves this by attaching lightweight, sandboxed eBPF programs to strategic hooks within the kernel's networking stack.

    The Traffic Control (TC) Hook: The Primary Interception Point

    Instead of relying on iptables, Cilium attaches eBPF programs to the Traffic Control (TC) ingress and egress hooks on each pod's virtual ethernet (veth) device. This hook (cls_bpf) is executed very early in the networking stack, before iptables and much of the IP layer processing.

    When a packet leaves a pod:

    • The packet hits the TC egress hook on the pod's veth interface.
    • The attached eBPF program executes.
  • This program has direct access to the packet data (sk_buff). It can parse up to L4 headers (and with more work, L7 for certain protocols like HTTP/gRPC) to make policy decisions.
    • Crucially, the eBPF program uses BPF maps (kernel-space key-value stores) to look up the security identity of the source and destination, connection tracking state, and service-to-backend mappings.
    • Based on this lookup, the program can:

    * Allow: Return TC_ACT_OK, letting the packet proceed through the normal stack.

    * Deny: Return TC_ACT_SHOT, dropping the packet immediately.

    * Redirect: Use the bpf_redirect_peer() helper to forward the packet directly to the destination pod's veth pair on the same node, completely bypassing the host's upper TCP/IP stack. This is a massive performance gain for intra-node communication.

    This model eliminates the user-space/kernel-space context switches for policy enforcement and basic load balancing.

    Socket Operations (`sock_ops`): Accelerating Same-Node Communication

    For even greater performance, Cilium leverages the sock_ops eBPF hook, which attaches to a control group (cgroup) and triggers on socket events like connect(), accept(), and state changes.

    When an application in Pod A attempts to connect() to a service IP that resolves to Pod B on the same node:

  • The connect() syscall triggers the sock_ops eBPF program.
    • The program inspects the destination IP and port.
    • Using a BPF map that contains the service-to-backend mappings, it recognizes the destination is a local pod.
  • Instead of allowing the kernel to build a full TCP connection through the host's network stack, the eBPF program uses the bpf_sock_map_update() helper. This helper directly links the sockets of the client (Pod A) and the server (Pod B) within a special BPF_MAP_TYPE_SOCKMAP.
  • Subsequent send() and recv() calls on these sockets now transfer data directly between the pods' memory, bypassing the entire TCP/IP stack, TC hooks, and even the veth devices. This is as close to IPC (Inter-Process Communication) as you can get over a networked abstraction.
  • This socket-level acceleration provides the lowest possible latency for intra-node traffic, a common scenario in densely packed Kubernetes clusters.


    Production Implementation: Cilium Sidecar-less Service Mesh

    Let's move from theory to a concrete, production-grade implementation. We assume a running Kubernetes cluster and Helm installed.

    Step 1: Advanced Cilium Configuration

    Enabling Cilium's service mesh capabilities requires a specific Helm configuration. We'll enable Hubble for observability, mutual TLS (mTLS) for security, and L7 policy enforcement.

    Here is a values.yaml for a production-grade deployment:

    yaml
    # values.yaml for Cilium Helm Chart
    
    kubeProxyReplacement: strict
    hostServices:
      enabled: true
    externalIPs:
      enabled: true
    nodePort:
      enabled: true
    hostPort:
      enabled: true
    bpf:
      # Pre-allocation of BPF maps can improve performance by avoiding runtime map creation overhead
      preallocateMaps: true
    # Enable Hubble for deep observability into eBPF-driven flows
    hubble:
      enabled: true
      # UI for visualization
      ui:
        enabled: true
      # Relay for cluster-wide visibility
      relay:
        enabled: true
      metrics:
        enabled:
          - dns:query;ignoreAAAA
          - drop
          - tcp
          - flow
          - port-distribution
          - icmp
          - http
    
    # Enable Sidecar-less Service Mesh capabilities
    # This replaces the need for an Istio/Linkerd sidecar for many use cases
    serviceMesh:
      enabled: true
    
    # Enable L7 policy enforcement
    # Note: This requires more CPU/memory in the agent
    policyEnforcementMode: "default"
    
    # Mutual TLS (mTLS) configuration using Cilium's built-in CA
    # For production, you'd integrate this with SPIFFE/SPIRE or a custom CA
    securityContext:
      privileged: true # Required for the agent to load eBPF programs
    
    # Enable mTLS for the entire cluster by default
    # This can be overridden per-pod with annotations
    autoMTLS:
      enabled: true
      # Allow connections from pods without mTLS (for migration)
      allowForNonIdentity: false
      # The CA certificate will be mounted into pods
      certManager:
        # Use Cilium's built-in CA for simplicity
        # In production, use a managed CA like Vault or cert-manager with a root CA
        type: "cilium"

    Deploy Cilium with this configuration:

    bash
    helm repo add cilium https://helm.cilium.io/
    helm install cilium cilium/cilium --version 1.15.5 --namespace kube-system -f values.yaml

    This configuration replaces kube-proxy entirely with eBPF for service load balancing, enables Hubble for deep network flow visibility, and activates the sidecar-less mTLS and L7 policy features.

    Step 2: Implementing Complex L7 Network Policies

    With the eBPF data plane active, we can now define granular L7 policies. Cilium's CiliumNetworkPolicy CRD allows us to specify rules based on Kubernetes labels, service accounts, and now, HTTP paths and methods.

    Consider a scenario with three microservices:

    * api-gateway: Public-facing, receives user traffic.

    * order-service: Handles order creation and retrieval.

    * inventory-service: Manages product stock.

    We want to enforce the following rules:

  • The api-gateway can call GET /orders/{id} and POST /orders on the order-service.
  • The order-service can call POST /inventory/decrement on the inventory-service.
  • All other traffic, including DELETE requests or direct calls to inventory-service from the gateway, must be blocked.
    • All allowed traffic must be secured with mTLS.

    Here's the CiliumNetworkPolicy to implement this:

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "order-service-policy"
      namespace: "production"
    spec:
      endpointSelector:
        matchLabels:
          app: order-service
      ingress:
        - fromEndpoints:
            - matchLabels:
                app: api-gateway
          toPorts:
            - ports:
                - port: "8080"
                  protocol: TCP
              rules:
                http:
                  - method: "GET"
                    path: "/orders/.*"
                  - method: "POST"
                    path: "/orders"
    
    ---
    
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "inventory-service-policy"
      namespace: "production"
    spec:
      endpointSelector:
        matchLabels:
          app: inventory-service
      ingress:
        - fromEndpoints:
            - matchLabels:
                app: order-service
          toPorts:
            - ports:
                - port: "8080"
                  protocol: TCP
              rules:
                http:
                  - method: "POST"
                    path: "/inventory/decrement"

    How it works in the kernel:

    When api-gateway sends a POST /orders request to order-service:

  • The packet hits the TC egress hook on the api-gateway pod's veth.
  • The eBPF program identifies the source (api-gateway) and destination (order-service) identities.
  • The autoMTLS: true setting ensures the eBPF program encrypts the packet payload using keys established during a one-time TLS handshake, managed by the Cilium agent.
  • The packet arrives at the TC ingress hook on the order-service pod's veth.
    • The eBPF program decrypts the payload.
    • Because this is HTTP traffic on port 8080, a specialized eBPF program (a BPF tail call) is invoked to parse the HTTP headers.
  • It matches POST and /orders against the policy stored in a BPF map.
    • The match is successful, and the packet is forwarded to the application.

    If api-gateway tried to send a DELETE /orders request, step 7 would fail, and the eBPF program would drop the packet, sending a TCP RST back to the client. The application inside order-service would never even see the request.


    Performance Benchmarking: eBPF vs. Sidecar Proxy

    To quantify the performance difference, we can run a controlled benchmark. We'll set up two identical Kubernetes clusters, one with Cilium in sidecar-less mode and one with Istio in its default sidecar proxy mode.

    Test Setup:

    * Client: A simple load-generating pod running wrk2.

    * Server: An Nginx pod serving a static 1KB file.

    * Tool: wrk2 for generating constant throughput and measuring latency distribution.

    * Scenario: Intra-node communication to highlight the best-case performance for eBPF's socket acceleration.

    Benchmark Command:

    bash
    # Run from the client pod
    wrk2 -t4 -c100 -d30s -R1000 http://nginx-server/1k.bin

    This command uses 4 threads, 100 concurrent connections, runs for 30 seconds, and maintains a constant rate of 1000 requests per second.

    Hypothetical but Realistic Results:

    MetricIstio (Envoy Sidecar)Cilium (eBPF Sidecar-less)Improvement
    Mean Latency3.2 ms0.4 ms8x
    p90 Latency5.8 ms0.7 ms8.3x
    p99 Latency11.5 ms1.1 ms10.5x
    p99.9 Latency25.1 ms1.9 ms13.2x
    CPU Usage (Agent)~150m per sidecar~250m per agent (node)Varies

    Analysis:

    The results are stark. The eBPF data plane shows an order-of-magnitude reduction in latency, especially in the tail (p99, p99.9). This is the direct result of eliminating the user-space proxy, context switches, and TCP stack traversals.

    The CPU usage model also shifts. With sidecars, CPU cost scales with the number of pods. With Cilium, the cost is concentrated in the per-node agent, making it more efficient for high-density nodes.


    Advanced Edge Cases and Production Considerations

    While the performance benefits are clear, operating an eBPF-based service mesh in production requires a deep understanding of its unique challenges.

    1. Kernel Version Dependencies and CO-RE

    eBPF is a rapidly evolving kernel feature. The availability of specific hooks (sock_ops), helpers (bpf_redirect_peer), and map types is tied to the kernel version. A feature that works on a 5.10 kernel might not be available on a 4.19 kernel.

    Problem: Historically, this meant compiling eBPF programs for each target kernel version, an operational nightmare.

    Solution: BPF CO-RE (Compile Once – Run Everywhere):

    Cilium heavily relies on CO-RE. This approach uses BTF (BPF Type Format), which is debugging information embedded in the kernel that describes its internal data structures. The eBPF loader in the Cilium agent can read this BTF data at runtime and perform on-the-fly relocations in the compiled eBPF bytecode to match the memory layout of the running kernel.

    Production Implication: You MUST run a modern Linux distribution with a kernel that has BTF support enabled (typically 5.4+). Running on older, enterprise kernels without BTF will severely limit Cilium's functionality and may force it to fall back to less performant modes.

    2. Debugging the In-Kernel Data Plane

    When a request is dropped, where do you look? There's no proxy log to kubectl logs. Debugging becomes a systems-level task.

    * Hubble: This is your first port of call. Hubble's UI and CLI provide a high-level view of network flows, policy verdicts (allowed/denied), and even L7 request data. It builds its view by reading from a special BPF perf event buffer.

    bash
        # See real-time flow verdicts for a pod
        hubble observe --pod production/api-gateway-xyz -f

    * cilium monitor: For lower-level event tracing, this command provides a firehose of Cilium agent events, including packet drops and their reasons.

    * bpftool: This is the ultimate power tool. You can use it to inspect the state of the eBPF programs and maps on a node.

    bash
        # List all BPF programs attached to a pod's veth
        bpftool net list dev vethXXXX
        
        # Dump the contents of the connection tracking map
        # First, find the map ID
        bpftool map list | grep cilium_ct_any4_global
        # Then dump its contents
        bpftool map dump id <MAP_ID>

    This level of debugging requires a strong understanding of both eBPF and kernel networking.

    3. L7 Policy on Encrypted (TLS/HTTPS) Traffic

    eBPF's L7 parsing works brilliantly for plaintext protocols like HTTP, gRPC, and Kafka. However, if the application itself encrypts its traffic (end-to-end TLS), the eBPF program at the TC hook only sees encrypted gibberish. It cannot inspect the HTTP path or headers.

    The Trade-off:

    To enforce L7 policies on this traffic, you lose some of the sidecar-less purity. You have two main options:

  • Terminate TLS at the Application: The application handles TLS, and Cilium's mTLS secures the pod-to-pod hop. This is the most secure but prevents L7 policy enforcement.
  • Introduce a User-Space Proxy for TLS Termination: Cilium can be configured to route specific HTTPS traffic through an embedded Envoy proxy for TLS termination and L7 inspection. This reintroduces a user-space component but confines it to only the traffic that absolutely requires it, rather than all traffic by default.
  • This is a critical architectural decision. For services that need both end-to-end encryption and L7 routing, you may need to combine eBPF for the L3/L4 data plane with a targeted proxy for L7.

    yaml
    # Example CiliumNetworkPolicy showing an L7 rule
    # that would require a proxy if traffic is TLS-encrypted
    # by the application itself.
    ...
          toPorts:
            - ports:
                - port: "443"
                  protocol: TCP
              # This rule implies that Cilium must be able to parse the TLS
              # either via kernel-level kTLS acceleration (very new)
              # or by directing the flow to an Envoy listener.
              rules:
                http:
                  - headerMatches:
                      - name: ":authority"
                        value: "api.internal.com"
    ...

    Conclusion: A New Frontier in Service Mesh Performance

    eBPF-based, sidecar-less service meshes represent a paradigm shift in how we build and operate cloud-native data planes. By moving policy enforcement, load balancing, and observability into the Linux kernel, they offer an order-of-magnitude reduction in latency and a more efficient resource model compared to traditional user-space proxies.

    This performance, however, comes with a new set of complexities. Senior engineers must be prepared to engage with the system at a lower level, understanding kernel dependencies, eBPF debugging tools, and the nuanced trade-offs around L7 policy enforcement for encrypted traffic.

    For applications where every microsecond of latency matters, the investment is undeniable. The eBPF service mesh is not a replacement for all use cases, but it is a powerful, production-ready tool that pushes the boundary of what's possible in high-performance distributed systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles