Kernel-Level Service Mesh Telemetry with eBPF: A Sidecar-less Future

16 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Inherent Overhead of the Sidecar Pattern

For years, the sidecar proxy model, popularized by service meshes like Istio and Linkerd, has been the de facto standard for bringing observability, security, and reliability to microservices. By injecting a user-space proxy (typically Envoy) alongside each application container, we gain a language-agnostic way to manage traffic. However, for senior engineers operating at scale, the performance and resource tax of this pattern is a well-understood and often painful reality.

Every network packet destined for an application pod must traverse the following path:

  • Kernel TCP/IP stack -> User-space Envoy proxy (sidecar)
  • Envoy processes the packet (mTLS, routing, metrics collection)
  • Envoy -> Kernel loopback interface -> User-space application

This round trip involves multiple context switches between kernel and user-space, data copies between kernel and user memory, and a separate TCP connection termination inside the pod. For latency-sensitive services, this added hop can introduce a non-trivial p99 latency penalty. Furthermore, the aggregate CPU and memory consumption of thousands of sidecar proxies across a large cluster represents a significant operational cost.

This is the fundamental problem that eBPF aims to solve in the cloud-native networking space. By moving the data plane logic from a user-space proxy into the Linux kernel itself, we can create a 'sidecar-less' architecture that is fundamentally more efficient. This article will bypass the introductory concepts and dive straight into the advanced implementation patterns for achieving this.

Architectural Showdown: Sidecar vs. eBPF Data Plane

Let's visualize the data path difference. We assume a baseline familiarity with Kubernetes networking.

Sidecar Model (e.g., Istio with Envoy):

mermaid
graph TD
    subgraph Node
        subgraph Pod
            App[Application Container]
            Envoy[Envoy Sidecar]
        end
        Kernel[Kernel Space]
    end

    IngressPacket[External Request] -->|1. NIC| Kernel
    Kernel -->|2. iptables redirect| Envoy
    Envoy -->|3. Process & Forward| Kernel
    Kernel -->|4. Loopback| App

    style Envoy fill:#f9f,stroke:#333,stroke-width:2px
    style App fill:#ccf,stroke:#333,stroke-width:2px
  • Path: The request hits the node's kernel, iptables rules redirect it to the Envoy proxy's listener in user-space. Envoy applies its logic and then sends the traffic back through the kernel's loopback interface to the application container. The response path is the reverse.
  • Overhead: Two full TCP/IP stack traversals per request inside the pod, multiple data copies between kernel and user-space, and the resource footprint of the Envoy process itself.
  • eBPF Sidecar-less Model (e.g., Cilium):

    mermaid
    graph TD
        subgraph Node
            subgraph Pod
                App[Application Container]
            end
            subgraph Kernel Space
                TC[TC Hook + eBPF]
                Socket[Socket Hook + eBPF]
            end
        end
    
        IngressPacket[External Request] -->|1. NIC| TC
        TC -->|2. eBPF processes packet| Socket
        Socket -->|3. Direct delivery| App
    
        style TC fill:#9f9,stroke:#333,stroke-width:2px
        style App fill:#ccf,stroke:#333,stroke-width:2px
  • Path: The request hits the Traffic Control (TC) layer of the kernel, where our eBPF program is attached. The program can inspect, modify, and make policy decisions on the packet (sk_buff) directly in the kernel. It can then forward the packet directly to the application's socket, bypassing the user-space detour entirely.
  • Efficiency: Minimal context switching. No extra data copies. The logic runs in the highly optimized kernel execution context. Identity and policy are managed via eBPF maps, which provide a high-speed key/value store accessible from both kernel and user-space.
  • Implementation Deep Dive: Capturing HTTP/1.1 Telemetry

    The most challenging aspect of eBPF-based observability is parsing L7 protocols. Kernel-level eBPF programs see a stream of TCP packets (sk_buff structs), not a clean, reassembled HTTP request. A single HTTP request can be fragmented across multiple TCP segments. Our eBPF program must handle this reassembly to extract meaningful telemetry like method, path, and status code.

    We will implement a simplified system that:

    • Attaches an eBPF program to the TC ingress/egress hooks on a pod's veth pair.
    • Identifies and tracks HTTP/1.1 connections.
    • Parses HTTP requests and responses from raw TCP data.
    • Stores telemetry data (request start/end timestamps, path, method, status) in an eBPF map.
    • A user-space Go application reads from this map to export metrics.

    The eBPF Kernel-Space Program (C)

    This C code is compiled using Clang/LLVM into eBPF bytecode. We use libbpf headers for helpers and annotations.

    Key Concepts in the Code:

    * SEC("tc"): Attaches the handle_tc function to the TC hook.

    * BPF Maps: We use several maps:

    * http_transactions: A hash map to store ongoing transaction state, keyed by a connection tuple (struct conn_tuple_t).

    * perf_events: A perf buffer to send completed transaction data to user-space asynchronously.

    * bpf_skb_load_bytes(): A BPF helper to safely read data from the packet buffer.

    * Bounded Loops: The eBPF verifier forbids unbounded loops. We use #pragma unroll to tell the compiler to unroll the loop, ensuring the verifier can prove the program will terminate.

    c
    // SPDX-License-Identifier: GPL-2.0
    #include <vmlinux.h>
    #include <bpf/bpf_helpers.h> 
    #include <bpf/bpf_tracing.h> 
    #include <bpf/bpf_core_read.h>
    
    #define MAX_PATH_LEN 256
    #define MAX_METHOD_LEN 8
    
    // Connection identifier tuple
    struct conn_tuple_t {
        __u32 saddr;
        __u32 daddr;
        __u16 sport;
        __u16 dport;
    };
    
    // Transaction state stored in the map
    struct http_transaction_t {
        __u64 start_time_ns;
        char method[MAX_METHOD_LEN];
        char path[MAX_PATH_LEN];
    };
    
    // Data sent to user-space via perf buffer
    struct http_event_t {
        struct conn_tuple_t tuple;
        __u64 start_time_ns;
        __u64 end_time_ns;
        __u16 status_code;
        char method[MAX_METHOD_LEN];
        char path[MAX_PATH_LEN];
    };
    
    // BPF hash map to store ongoing HTTP transactions
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 10240);
        __type(key, struct conn_tuple_t);
        __type(value, struct http_transaction_t);
    } http_transactions SEC(".maps");
    
    // BPF perf buffer to send events to user-space
    struct {
        __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
        __uint(key_size, sizeof(int));
        __uint(value_size, sizeof(int));
    } perf_events SEC(".maps");
    
    // Simple helper to check for HTTP method prefixes
    static __always_inline bool is_http_method(const char *buf) {
        if ((buf[0] == 'G' && buf[1] == 'E' && buf[2] == 'T') ||
            (buf[0] == 'P' && buf[1] == 'O' && buf[2] == 'S' && buf[3] == 'T') ||
            (buf[0] == 'P' && buf[1] == 'U' && buf[2] == 'T') ||
            (buf[0] == 'D' && buf[1] == 'E' && buf[2] == 'L' && buf[3] == 'E' && buf[4] == 'T' && buf[5] == 'E')) {
            return true;
        }
        return false;
    }
    
    // Simple helper to parse HTTP status code
    static __always_inline __u16 parse_status_code(const char* buf) {
        // HTTP/1.1 200 OK
        // 0123456789
        if (buf[0] != 'H' || buf[1] != 'T' || buf[2] != 'T' || buf[3] != 'P') {
            return 0;
        }
        __u16 status = (buf[9] - '0') * 100 + (buf[10] - '0') * 10 + (buf[11] - '0');
        return status;
    }
    
    SEC("tc")
    int handle_tc(struct __sk_buff *skb) {
        void *data_end = (void *)(long)skb->data_end;
        void *data = (void *)(long)skb->data;
    
        struct ethhdr *eth = data;
        if ((void *)eth + sizeof(*eth) > data_end) return TC_ACT_OK;
    
        struct iphdr *iph = (struct iphdr *)(eth + 1);
        if ((void *)iph + sizeof(*iph) > data_end) return TC_ACT_OK;
    
        if (iph->protocol != IPPROTO_TCP) return TC_ACT_OK;
    
        struct tcphdr *tcph = (struct tcphdr *)((void *)iph + iph->ihl * 4);
        if ((void *)tcph + sizeof(*tcph) > data_end) return TC_ACT_OK;
    
        // Calculate payload start and length
        char *payload = (char *)((void *)tcph + tcph->doff * 4);
        int payload_len = skb->len - (payload - (char *)data);
        if (payload_len <= 0) return TC_ACT_OK;
    
        struct conn_tuple_t conn_tuple = {};
        conn_tuple.saddr = iph->saddr;
        conn_tuple.daddr = iph->daddr;
        conn_tuple.sport = tcph->source;
        conn_tuple.dport = tcph->dest;
    
        // Check if it's a request (client -> server)
        if (is_http_method(payload)) {
            struct http_transaction_t tx = {};
            tx.start_time_ns = bpf_ktime_get_ns();
    
            // Naive parsing for demonstration. A real implementation needs more robustness.
            int i, j;
            #pragma unroll
            for (i = 0; i < MAX_METHOD_LEN - 1; i++) {
                if (payload[i] == ' ') break;
                tx.method[i] = payload[i];
            }
            tx.method[i] = '\0';
            i++; // skip space
    
            #pragma unroll
            for (j = 0; j < MAX_PATH_LEN - 1; j++) {
                if (payload[i + j] == ' ') break;
                tx.path[j] = payload[i + j];
            }
            tx.path[j] = '\0';
    
            bpf_map_update_elem(&http_transactions, &conn_tuple, &tx, BPF_ANY);
            return TC_ACT_OK;
        }
    
        // Check if it's a response (server -> client)
        __u16 status_code = parse_status_code(payload);
        if (status_code >= 100 && status_code < 600) {
            // Invert tuple to find the original request
            struct conn_tuple_t original_tuple = {};
            original_tuple.saddr = conn_tuple.daddr;
            original_tuple.daddr = conn_tuple.saddr;
            original_tuple.sport = conn_tuple.dport;
            original_tuple.dport = conn_tuple.sport;
            
            struct http_transaction_t *tx = bpf_map_lookup_elem(&http_transactions, &original_tuple);
            if (tx) {
                struct http_event_t event = {};
                event.tuple = original_tuple;
                event.start_time_ns = tx->start_time_ns;
                event.end_time_ns = bpf_ktime_get_ns();
                event.status_code = status_code;
                __builtin_memcpy(event.method, tx->method, MAX_METHOD_LEN);
                __builtin_memcpy(event.path, tx->path, MAX_PATH_LEN);
    
                bpf_perf_event_output(skb, &perf_events, BPF_F_CURRENT_CPU, &event, sizeof(event));
                bpf_map_delete_elem(&http_transactions, &original_tuple);
            }
        }
    
        return TC_ACT_OK;
    }
    
    char _license[] SEC("license") = "GPL";

    Critical Limitation & Edge Case: The code above is deliberately simplified. It assumes a full HTTP request/response fits within a single TCP packet. In production, this is rarely true. A robust implementation would need to manage a per-connection state buffer, likely in a BPF_MAP_TYPE_PERCPU_ARRAY, to reassemble TCP segments before attempting to parse the HTTP headers. This is a non-trivial engineering challenge within the constraints of the eBPF verifier.

    The User-Space Agent (Go)

    This Go program uses the cilium/ebpf library to load, attach, and read data from our eBPF program.

    go
    package main
    
    import (
    	"bytes"
    	"encoding/binary"
    	"errors"
    	"log"
    	"net"
    	"os"
    	"os/signal"
    	"syscall"
    
    	"github.com/cilium/ebpf"
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/perf"
    	"github.com/cilium/ebpf/rlimit"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf http_parser.c -- -I/usr/include/bpf
    
    const (
    	IfceName = "eth0" // Interface to attach the TC program to
    )
    
    func main() {
    	// Allow the current process to lock memory for eBPF maps.
    	if err := rlimit.RemoveMemlock(); err != nil {
    		log.Fatal(err)
    	}
    
    	// Load pre-compiled programs and maps into the kernel.
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %s", err)
    	}
    	defer objs.Close()
    
    	iface, err := net.InterfaceByName(IfceName)
    	if err != nil {
    		log.Fatalf("lookup network iface %q: %s", IfceName, err)
    	}
    
    	// Attach the TC program to the ingress and egress hooks of the interface.
    	qlen, err := link.AttachTCX(link.TCXOptions{
    		Program:   objs.HandleTc,
    		Interface: iface.Index,
    		Attach:    ebpf.AttachTCXEgress,
    	})
    	if err != nil {
    		log.Fatalf("could not attach TC program: %s", err)
    	}
    	defer qlen.Close()
    
    	log.Printf("Attached TC program to iface %q (index %d)", iface.Name, iface.Index)
    
    	// Open a perf reader from the BPF map.
    	rd, err := perf.NewReader(objs.PerfEvents, os.Getpagesize())
    	if err != nil {
    		log.Fatalf("creating perf event reader: %s", err)
    	}
    	defer rd.Close()
    
    	// Set up a signal handler to exit gracefully.
    	stopper := make(chan os.Signal, 1)
    	signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
    
    	go func() {
    		<-stopper
    		log.Println("Received signal, exiting...")
    		rd.Close()
    	}()
    
    	log.Println("Waiting for events...")
    
    	var event bpfHttpEventT
    	for {
    		record, err := rd.Read()
    		if err != nil {
    			if errors.Is(err, perf.ErrClosed) {
    				return
    			}
    			log.Printf("reading from perf buffer: %s", err)
    			continue
    		}
    
    		if record.LostSamples != 0 {
    			log.Printf("perf event ring buffer full, dropped %d samples", record.LostSamples)
    			continue
    		}
    
    		// Parse the raw data from the perf buffer into our Go struct.
    		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    			log.Printf("parsing perf event: %s", err)
    			continue
    		}
    
    		printEvent(event)
    	}
    }
    
    // The Go struct must match the C struct layout exactly.
    // We use fixed-size arrays to ensure memory alignment.
    type bpfHttpEventT struct {
    	Tuple struct {
    		Saddr uint32
    		Daddr uint32
    		Sport uint16
    		Dport uint16
    	}
    	Start_time_ns uint64
    	End_time_ns   uint64
    	Status_code   uint16
    	Method        [8]byte
    	Path          [256]byte
    }
    
    func printEvent(e bpfHttpEventT) {
    	durationMs := float64(e.End_time_ns-e.Start_time_ns) / 1000000.0
    	srcIP := net.IP(binary.BigEndian.AppendUint32(nil, e.Tuple.Saddr)).String()
    	dstIP := net.IP(binary.BigEndian.AppendUint32(nil, e.Tuple.Daddr)).String()
    	srcPort := binary.BigEndian.Uint16([]byte{byte(e.Tuple.Sport >> 8), byte(e.Tuple.Sport)})
    	dstPort := binary.BigEndian.Uint16([]byte{byte(e.Tuple.Dport >> 8), byte(e.Tuple.Dport)})
    
    	method := string(e.Method[:bytes.IndexByte(e.Method[:], 0)])
    	path := string(e.Path[:bytes.IndexByte(e.Path[:], 0)])
    
    	log.Printf("HTTP Tx: %s %s:%d -> %s:%d | %d | %s %s | %.3fms",
    		method, srcIP, srcPort, dstIP, dstPort, e.Status_code, method, path, durationMs)
    }

    This agent handles loading the eBPF bytecode, attaching it, and then entering a loop to read events from the perf buffer. In a production system, instead of logging, printEvent would format this data into Prometheus metrics or OpenTelemetry traces.

    Performance Analysis: The Kernel Advantage

    Real-world benchmarks (e.g., from the Cilium project) consistently show significant performance gains. In a sidecar-less model, you can expect:

    * Lower Latency: P99 latency can be reduced by 50-75% for services that are sensitive to network hops, as the entire user-space proxy detour is eliminated.

    * Reduced CPU Overhead: The CPU cost per request is lower because all processing happens in the kernel. This frees up CPU cycles on the node for the actual applications. A large cluster can reclaim a substantial number of cores.

    * Lower Memory Footprint: Eliminating thousands of Envoy proxy processes drastically reduces the memory overhead per node, allowing for higher pod density.

    These gains are not theoretical. They are the primary driver for the adoption of eBPF-based networking and service meshes in high-performance computing environments and large-scale cloud providers.

    Advanced Edge Case: Handling TLS Encrypted Traffic

    Our TC-based example has a glaring flaw: it cannot inspect L7 data if the traffic is encrypted with TLS. The eBPF program at the TC layer sees only gibberish ciphertext.

    This is a major challenge. The solution is to move the observation point from the network socket to the encryption library itself using uprobes (user-space probes).

    The Uprobe Strategy:

  • Identify Target Functions: We target the read/write functions within the user-space TLS library that the application is using (e.g., SSL_read and SSL_write in OpenSSL).
  • Attach eBPF Probes: We use kprobes (for kernel functions) or uprobes (for user-space functions) to attach eBPF programs to the entry and exit points of these functions.
  • Capture Plaintext Data: The eBPF program attached to the return of SSL_read can access the buffer containing the decrypted, plaintext data. Similarly, the program attached to the entry of SSL_write can access the plaintext data just before it's encrypted.
  • Correlate with Connection: The program then uses BPF helpers like bpf_get_current_pid_tgid() and inspects file descriptors to correlate this plaintext data with the underlying kernel socket connection tuple (IPs/ports), allowing it to be matched with the network-level data captured at the TC layer.
  • Implementation Complexity & Trade-offs:

    * Fragility: This technique is highly dependent on the specific version and implementation of the TLS library. An update to OpenSSL could change function signatures or struct layouts, breaking the eBPF program. CO-RE (Compile Once - Run Everywhere) with BTF (BPF Type Format) helps mitigate this but doesn't eliminate the risk entirely.

    * Language Specificity: This approach is more complex for garbage-collected languages like Go, which has its own crypto/tls implementation. You would need to probe the specific functions within Go's runtime.

    * Alternative: Projects like Cilium and Istio's ambient mesh solve this by managing mTLS themselves. When they are the endpoint for mTLS termination, their data plane has access to the plaintext and can apply L7 logic before re-encrypting traffic to the application. This is a more robust but also more architecturally integrated solution.

    Production Considerations

    Deploying eBPF at scale requires careful consideration:

    * Kernel Version: eBPF is a rapidly evolving subsystem. The availability of crucial BPF helpers, map types, and program types is tied to specific kernel versions. A minimum of Linux 4.19 is often required for basic networking, with 5.10+ being preferable for modern features.

    * Verifier Constraints: The eBPF verifier is your biggest hurdle. It enforces strict rules to ensure kernel safety: no unbounded loops, limited stack size (~512 bytes), and strict memory access checks. Writing complex logic requires breaking it down into smaller, verifiable pieces.

    * CO-RE and BTF: For portable eBPF programs that don't need to be recompiled for every kernel version, relying on CO-RE is essential. This requires that the target kernels are compiled with BTF type information enabled.

    * Memory Locking (RLIMIT_MEMLOCK): eBPF maps are locked in memory to prevent them from being swapped to disk. The user-space process that loads the eBPF programs needs adequate memlock resource limits, which often needs to be configured in the systemd unit or container runtime.

    Conclusion: The Inevitable Shift

    The move from sidecar proxies to in-kernel data planes with eBPF represents a fundamental architectural shift in cloud-native infrastructure. While the implementation is undeniably more complex and requires a deeper understanding of Linux internals, the performance and efficiency benefits are too significant to ignore at scale. By eliminating the user-space detour, eBPF-based service meshes provide a data plane that is an order of magnitude faster and more resource-efficient.

    For senior engineers and architects, understanding the low-level mechanics of eBPF—from TC and XDP hooks to map-based state management and the challenges of L7 parsing—is no longer a niche skill. It is the key to building the next generation of performant, secure, and observable distributed systems.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles