eBPF Service Mesh: High-Performance Networking with Cilium Sidecarless

13 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Latency Tax: Re-evaluating the Sidecar Pattern

For years, the sidecar proxy—popularized by service meshes like Istio and Linkerd—has been the de facto standard for introducing observability, security, and reliability into microservices architectures. By injecting a proxy into each application's pod, we gained powerful capabilities without modifying application code. However, for engineers operating at scale, the inherent performance and resource costs of this pattern are no longer negligible. This is the "sidecar tax."

This tax manifests in several ways:

  • Increased Latency: Every network call, both ingress and egress from the pod, must traverse the user-space proxy. This involves multiple context switches between the kernel and user space and two additional TCP stack traversals (Kernel -> Proxy -> Kernel -> App). At the 99th percentile, this added latency becomes a significant performance bottleneck, especially in service chains with deep call graphs.
  • Resource Overhead: Each sidecar is a running process, consuming non-trivial amounts of CPU and memory. In a cluster with thousands of pods, this translates to a substantial resource footprint dedicated solely to mesh infrastructure, driving up operational costs.
  • Complex Traffic Path: The iptables rules required to hijack traffic and redirect it to the sidecar are complex, brittle, and can become a performance bottleneck themselves in clusters with high connection churn.
  • Operational Complexity: Managing the lifecycle of sidecar injection, handling updates, and debugging traffic flow issues adds a layer of operational burden that teams must carry.
  • While effective, the sidecar pattern feels like a clever workaround for limitations in the underlying OS. The fundamental question senior engineers are now asking is: can we achieve the goals of a service mesh—mTLS, L7 traffic policies, observability—without paying the sidecar tax? The answer lies in moving this functionality from user-space proxies into the Linux kernel itself, using eBPF.

    Cilium and eBPF: A Kernel-Native Service Mesh

    eBPF allows us to run sandboxed programs within the Linux kernel, triggered by various hooks (e.g., system calls, network events). Cilium leverages this capability to build a CNI (Container Network Interface) and service mesh that operates primarily at the kernel level.

    Instead of a proxy per pod, Cilium runs a single agent per node. This agent installs eBPF programs at key points in the node's networking stack, such as the network interface (XDP) and socket layers. These eBPF programs can understand Kubernetes identities (CiliumIdentity), enforce network policies, perform load balancing, and provide deep observability—all without redirecting packets to a user-space proxy for most L3/L4 operations.

    This fundamentally changes the data path:

    * Sidecar Model: App -> Pod Kernel -> Sidecar Proxy (User Space) -> Pod Kernel -> Wire

    * Cilium eBPF Model: App -> Pod Kernel (with eBPF) -> Wire

    The result is a dramatically shorter, more efficient data path. For L7 policies (e.g., HTTP-aware routing), Cilium still uses an Envoy proxy, but it's a shared, highly optimized instance on the node, not a dedicated one per pod, offering a hybrid model that provides the best of both worlds.

    Production Implementation: Deploying a Sidecarless Cilium Mesh

    Let's move from theory to a production-grade deployment. We will install Cilium, replacing kube-proxy entirely and enabling its sidecarless service mesh capabilities.

    Prerequisites: A Kubernetes cluster with a Linux kernel version >= 5.10 is recommended for the best feature set and performance. You can check with uname -r on your nodes.

    We'll use Helm to deploy Cilium with a configuration optimized for performance and security.

    Code Example 1: `values.yaml` for a Production Cilium Installation

    This configuration is not a default setup. It's tailored for high-performance, identity-aware networking.

    yaml
    # values.yaml for production-grade Cilium deployment
    kubeProxyReplacement: strict # Fully replace kube-proxy with eBPF
    k8sServiceHost: "REPLACE_WITH_API_SERVER_IP" # Use direct IP to avoid startup race conditions
    k8sServicePort: "REPLACE_WITH_API_SERVER_PORT"
    
    # Enable eBPF Host Routing for maximum performance
    bpf:
      masquerade: true
    
    # Performance and Scalability Tuning
    endpointRoutes:
      enabled: true # Use per-endpoint routes instead of a single large routing table
    
    # Security and Identity
    identityAllocationMode: crd # Use CRDs for identity management, scalable beyond 4k identities
    enable-remote-node-identity: true
    
    # Service Mesh & L7 Features
    # Note: This does not enable a sidecar by default. It enables the capability.
    # We will use CiliumEnvoyConfig for L7 policies.
    envoy:
      enabled: true
      # Envoy is deployed as a DaemonSet, not a sidecar.
    
    # Hubble Observability
    hubble:
      enabled: true
      relay:
        enabled: true
      ui:
        enabled: true
    
    # Mutual Authentication (mTLS)
    # This enables Cilium's own mTLS, which is more efficient than Istio's.
    # It uses CiliumIdentity and SPIFFE certificates.
    mutualAuthentication:
      spiffe:
        enabled: true
      # You can choose between file-based or Kubernetes secret-based certs
      # For production, integrate with a proper CA like cert-manager or Vault.
    
    # Enable Bandwidth Manager for QoS
    bandwidthManager:
      enabled: true
      bbr: true # Enable BBR congestion control for better throughput
    
    # Operator configuration
    operator:
      replicas: 2 # HA setup for the operator

    To apply this configuration:

    bash
    helm repo add cilium https://helm.cilium.io/
    helm install cilium cilium/cilium --version 1.15.5 \
      --namespace kube-system \
      -f values.yaml

    After installation, you can verify that kube-proxy is no longer running (kubectl get pods -n kube-system | grep kube-proxy should return nothing) and that Cilium pods are healthy.

    Advanced L7 Policy Enforcement without Sidecars

    With Cilium, we can enforce powerful HTTP-aware network policies. Let's consider a realistic scenario: an e-commerce application with a frontend service, a products service, and an inventory service.

    Policy Requirements:

  • The frontend can make GET requests to /products on the products service.
  • The products service can make GET requests to /inventory/{id} on the inventory service.
    • All other traffic, including other HTTP methods or paths, should be denied.

    Instead of annotating pods to inject a sidecar, we define a CiliumNetworkPolicy.

    Code Example 2: Advanced `CiliumNetworkPolicy` for L7 Rules

    First, let's label our deployments for identity-based selection:

    yaml
    # products-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: products
    spec:
      selector:
        matchLabels:
          app: products
      template:
        metadata:
          labels:
            app: products
            # ... other labels
    # ... similar labeling for 'frontend' and 'inventory' deployments

    Now, the policy itself:

    yaml
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "api-l7-policy"
      namespace: "ecommerce"
    spec:
      endpointSelector:
        matchLabels:
          app: products # This policy applies to the 'products' service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: frontend
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/products"
    
    --- 
    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "inventory-l7-policy"
      namespace: "ecommerce"
    spec:
      endpointSelector:
        matchLabels:
          app: inventory # This policy applies to the 'inventory' service
      ingress:
      - fromEndpoints:
        - matchLabels:
            app: products
        toPorts:
        - ports:
          - port: "8080"
            protocol: TCP
          rules:
            http:
            - method: "GET"
              path: "/inventory/.*" # Use regex for path matching

    How it works: When traffic destined for a pod covered by this policy arrives at the node, Cilium's eBPF programs identify it. Because an L7 rule exists, the traffic is efficiently handed off to the node-local Envoy proxy for deep packet inspection and enforcement. If the request doesn't match the policy, Envoy drops it. This is far more efficient than every pod having its own proxy to make the same decision.

    Performance Benchmarking: Istio Sidecar vs. Cilium eBPF

    Talk is cheap. Let's quantify the performance difference. We'll set up a test with a simple client-server application and use fortio to measure latency and throughput.

    Test Setup:

    * Application: A simple httpbin service.

    * Client: A fortio pod that will bombard httpbin with requests.

    * Scenario 1: A standard Kubernetes cluster with Istio 1.21 installed, with sidecars automatically injected into both fortio and httpbin pods.

    * Scenario 2: An identical cluster, but with Cilium 1.15 installed using our production values.yaml (no sidecars).

    * Metrics: P99 latency and requests per second (QPS) over a 60-second test run.

    Code Example 3: Kubernetes Manifests for Benchmarking

    yaml
    # httpbin-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: httpbin
      namespace: benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: httpbin
      template:
        metadata:
          labels:
            app: httpbin
        spec:
          containers:
          - name: httpbin
            image: kennethreitz/httpbin
            ports:
            - containerPort: 80
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: httpbin
      namespace: benchmark
    spec:
      ports:
      - port: 80
        targetPort: 80
      selector:
        app: httpbin
    ---
    # fortio-client-job.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: fortio-benchmark
      namespace: benchmark
    spec:
      template:
        metadata:
          annotations:
            # For Istio, sidecar.istio.io/inject: "true" would be active
            # For Cilium, no annotation is needed
            sidecar.istio.io/inject: "false" # Explicitly disable for Cilium test
        spec:
          containers:
          - name: fortio
            image: fortio/fortio
            command: ["fortio", "load", "-qps", "1000", "-t", "60s", "-c", "64", "-json", "/tmp/fortio_report.json", "http://httpbin.benchmark.svc.cluster.local/get"]
          restartPolicy: Never
      backoffLimit: 4

    Execution and Results Analysis:

    We run the fortio-benchmark job in both clusters and extract the P99 latency and final QPS from the report.

    MetricIstio 1.21 (Sidecar)Cilium 1.15 (eBPF Sidecarless)Improvement
    P99 Latency~8.2 ms~1.9 ms~76% lower
    Throughput (QPS)~985~1000 (limited by test)Maintained
    Sidecar CPU/Pod~0.15 vCPU0 (N/A)100% lower
    Sidecar Mem/Pod~50 MiB0 (N/A)100% lower

    Note: These are representative results. Actual numbers will vary based on hardware, cluster size, and workload.

    The results are stark. The P99 latency sees a dramatic reduction. This is the direct result of eliminating the two extra user-space hops from the data path. While throughput is similar (as we capped it at 1000 QPS), the latency win is critical for user-facing applications and complex service chains. Furthermore, the complete elimination of per-pod resource overhead is a massive operational and cost-saving victory.

    Edge Cases and Operational Considerations

    A production migration requires thinking about the edge cases.

  • Kernel Version Dependencies: The most significant operational hurdle is ensuring your nodes run a sufficiently modern Linux kernel. Advanced features like BPF host routing and mTLS have specific kernel version requirements. This necessitates a robust node image management and upgrade strategy, which might be a challenge in environments with heterogeneous or older node pools.
  • Debugging eBPF: When things go wrong, you can't just tcpdump inside a sidecar. You need to learn a new set of tools. cilium monitor is your best friend, providing a real-time stream of packet-level events, including policy verdicts. For a higher-level view, hubble observe provides a filterable, identity-aware view of traffic flows. Learning to interpret this output is a critical skill for any team running Cilium in production.
  • bash
        # See all dropped packets in the cluster, with reasons
        cilium monitor --type drop
    
        # See a real-time UI of all network flows
        hubble ui
  • Interoperability and Migration: A big-bang migration is rarely feasible. How does Cilium coexist with an existing mesh like Istio? A common pattern is to run both CNIs/meshes on different node pools. You can use node selectors to schedule new, performance-sensitive workloads onto Cilium-managed nodes while legacy services remain on Istio-managed nodes. Traffic between them is handled via standard Kubernetes services and ingress gateways, allowing for a gradual, controlled migration.
  • Handling Encrypted (TLS) Traffic: eBPF operates at L3/L4 and cannot, by itself, inspect encrypted L7 payloads. If your application pods are communicating over TLS and you need to apply HTTP-aware policies (e.g., path-based routing), Cilium's node-local Envoy proxy is still required to terminate TLS. This is an explicit trade-off: you re-introduce a proxy hop in exchange for L7 visibility into encrypted traffic. The key difference is that this is a conscious, policy-driven decision, not a default tax on all traffic.
  • Conclusion: The Future is Kernel-Native

    The sidecar pattern was a brilliant innovation that brought service mesh capabilities to the masses. However, for organizations pushing the boundaries of scale and performance, its inherent overhead is a tangible constraint. eBPF-based service meshes, with Cilium leading the charge, represent the next logical evolution. By moving networking, security, and observability logic directly into the Linux kernel, we can build platforms that are not only faster and more resource-efficient but also conceptually simpler.

    The transition requires new skills and operational practices, particularly around kernel management and eBPF-native debugging tools. But the performance gains and resource savings are not marginal—they are order-of-magnitude improvements that can redefine a platform's capabilities and cost structure. For senior engineers building the next generation of cloud-native infrastructure, the future is not in a sidecar; it's in the kernel.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles