eBPF for Service Mesh: Kernel-Level Sidecar Latency Analysis

October 8, 2025

17 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The User-Space Blind Spot in Service Mesh Observability

In any production Kubernetes environment running a service mesh, the P99 latency of the sidecar proxy is a metric of paramount importance. Engineering teams rely on metrics scraped from the proxy's admin endpoint (e.g., /stats/prometheus in Envoy) to build SLOs and diagnose performance regressions. However, these metrics, while valuable, are fundamentally incomplete. They measure the time from when a request is fully read from a socket by the proxy to when the proxy finishes writing the corresponding response to another socket. This measurement omits the significant time spent within the kernel's network stack.

Consider the lifecycle of a single request from Pod A to Pod B:

App A (User-space): write()s data to its socket.

Kernel (Pod A's netns): The data traverses the TCP/IP stack.

Sidecar A (User-space): read()s data from its socket, processes it (TLS termination, routing, policy enforcement), and write()s it to the outbound socket.

Kernel (Pod A's netns): The data again traverses the TCP/IP stack to leave the pod.

... (Network transit) ...

Kernel (Pod B's netns): Data arrives and moves up the TCP/IP stack.

Sidecar B (User-space): read()s the data, processes it, and write()s it to the socket connected to App B.

Kernel (Pod B's netns): The data traverses the local TCP/IP stack (e.g., via the loopback interface).

App B (User-space): Finally read()s the data from its socket.

Standard proxy metrics only cover the duration of steps 3 and 7. The kernel-level transit times in steps 2, 4, 6, and 8 are completely invisible. This "observability gap" can conceal critical latency sources, such as buffer contention, context switching overhead, and TCP retransmissions, leading to an inaccurate understanding of the true performance cost of the mesh.

This is where eBPF (extended Berkeley Packet Filter) provides a superior solution. By attaching lightweight, sandboxed programs directly to kernel functions, we can trace network operations at their source, capturing high-precision timestamps with negligible overhead and without any modification to the application or the sidecar proxy.

Our goal is to precisely measure the following two latencies for every pod in the mesh:

App-to-Sidecar Latency: The time from an application write() until the corresponding read() completes in its local sidecar.

Sidecar-to-App Latency: The time from an inbound write() by the sidecar until the corresponding read() completes in the local application.

This article details the implementation of an eBPF-based monitor to capture these metrics, using libbpf with CO-RE (Compile Once - Run Everywhere) for maximum portability across kernel versions.

Architecture of the eBPF Latency Monitor

Our solution consists of two components:

eBPF Kernel Program (tcplat.bpf.c): A C program containing the eBPF logic. It will be compiled into an eBPF object file. We will attach kprobes (kernel probes) to tcp_sendmsg and tcp_recvmsg to intercept socket operations.

User-space Agent (tcplat.c): A C program that loads the eBPF object file into the kernel, manages BPF maps, and reads data from the kernel via a ring buffer. This agent is responsible for correlating events, calculating latencies, and exporting them.

We will leverage several BPF map types:

* BPF_MAP_TYPE_HASH: To store in-flight request data, keyed by a unique socket identifier.

* BPF_MAP_TYPE_RINGBUF: A high-performance, lock-free mechanism for sending event data from kernel to user-space.

We will also use CO-RE with BTF (BPF Type Format) to ensure our eBPF program can run on different kernel versions without recompilation. This is a non-negotiable requirement for any production-grade eBPF application.

Kernel-Side Implementation: Tracing TCP Events

First, let's define the data structures and maps in our kernel program. The core challenge is correlating a send event with its corresponding recv event. A naive approach might use a 4-tuple (src/dst IP/port), but this is brittle in the face of connection reuse (HTTP Keep-Alive, HTTP/2). A more robust method involves using the TCP sequence and acknowledgment numbers.

Here is the header and map definition for tcplat.bpf.c:

// tcplat.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h> 
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

// Event structure sent to user-space
struct event {
    u64 sk; // Kernel address of the socket struct, used as a unique ID
    u64 ts_us; // Timestamp in microseconds
    u32 pid;
    u32 tcp_seq;
    u32 tcp_ack_seq;
    u16 sport;
    u16 dport;
    u8 is_send; // 1 for send, 0 for recv
};

// BPF ring buffer for sending events to user-space
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");

// A map to track the PID for a given socket. This helps us
// attribute events to either the app or the sidecar.
struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__uint(max_entries, 8192);
	__type(key, u64);
	__type(value, u32);
} sock_pids SEC(".maps");

const volatile u64 target_cgroup = 0;

The event struct captures all necessary information. We use the kernel address of the struct sock (sk) as a stable identifier for a connection. The target_cgroup is a placeholder for filtering; in our user-space loader, we will set this to the cgroup ID of the pod we want to monitor.

Now, let's implement the kprobes.

Probing `tcp_sendmsg`

We attach a kprobe to tcp_sendmsg to capture outbound data events. We must filter these events to only capture those originating from processes within our target cgroup (i.e., our target pod).

// tcplat.bpf.c (continued)
SEC("kprobe/tcp_sendmsg")
int BPF_KPROBE(tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size)
{
    u64 cgroup_id = bpf_get_current_cgroup_id();
    if (target_cgroup && cgroup_id != target_cgroup) {
        return 0;
    }

    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 sk_addr = (u64)sk;
    bpf_map_update_elem(&sock_pids, &sk_addr, &pid, BPF_ANY);

    // Get socket details
    u16 sport = BPF_CORE_READ(sk, __sk_common.skc_num);
    u16 dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
    dport = __bpf_ntohs(dport);

    // Get TCP sequence number
    struct tcp_sock *ts = (struct tcp_sock *)sk;
    u32 seq = BPF_CORE_READ(ts, write_seq);

    // Populate and submit the event
    struct event *e;
    e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) {
        return 0;
    }

    e->sk = sk_addr;
    e->ts_us = bpf_ktime_get_ns() / 1000;
    e->pid = pid;
    e->sport = sport;
    e->dport = dport;
    e->is_send = 1;
    e->tcp_seq = seq;
    e->tcp_ack_seq = 0; // Not relevant for send

    bpf_ringbuf_submit(e, 0);
    return 0;
}

Key Implementation Details:

* CO-RE: We use BPF_CORE_READ macros to access kernel struct members. This allows libbpf to perform relocations at load time, adapting our program to the specific kernel version on the host.

* Cgroup Filtering: bpf_get_current_cgroup_id() is the key to isolating traffic to a specific pod. The user-space agent will find the cgroup ID for a target pod and pass it to the eBPF program.

* PID Tracking: We store the PID associated with a socket in the sock_pids map. This is crucial for distinguishing between the application process and the sidecar process when they use the same socket (e.g., on the loopback interface).

* TCP Sequence Number: We capture write_seq from struct tcp_sock. This value represents the sequence number of the next byte of data to be sent.

Probing `tcp_recvmsg`

The tcp_recvmsg probe is more complex. We need to capture the acknowledgment number to correlate it with a prior send.

// tcplat.bpf.c (continued)
SEC("kprobe/tcp_recvmsg")
int BPF_KPROBE(tcp_recvmsg, struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len)
{
    u64 cgroup_id = bpf_get_current_cgroup_id();
    if (target_cgroup && cgroup_id != target_cgroup) {
        return 0;
    }

    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 sk_addr = (u64)sk;
    bpf_map_update_elem(&sock_pids, &sk_addr, &pid, BPF_ANY);

    // Get socket details
    u16 sport = BPF_CORE_READ(sk, __sk_common.skc_num);
    u16 dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
    dport = __bpf_ntohs(dport);

    // Get TCP acknowledgment sequence number
    struct tcp_sock *ts = (struct tcp_sock *)sk;
    u32 ack_seq = BPF_CORE_READ(ts, rcv_nxt);

    // Populate and submit the event
    struct event *e;
    e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) {
        return 0;
    }

    e->sk = sk_addr;
    e->ts_us = bpf_ktime_get_ns() / 1000;
    e->pid = pid;
    e->sport = sport;
    e->dport = dport;
    e->is_send = 0;
    e->tcp_seq = 0; // Not relevant for recv
    e->tcp_ack_seq = ack_seq;

    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Here, we capture rcv_nxt, which is the sequence number the kernel expects to receive next, effectively acting as an acknowledgment for data received up to that point.

User-Space Agent: Correlation and Metrics Export

The user-space agent has several responsibilities:

Parse command-line arguments (e.g., target pod name).
Find the cgroup ID of the target pod.

Load and attach the eBPF program using libbpf.

Poll the ring buffer for events.

Correlate send and recv events to calculate latency.

Aggregate latencies into a histogram.
Periodically expose metrics for Prometheus scraping.

Below is a conceptual C implementation for the core logic. A production agent might be written in Go or Rust for better concurrency and ecosystem support, but C demonstrates the direct libbpf interaction.

Loading and Setup

// tcplat.c (simplified)
#include <stdio.h>
#include <unistd.h>
#include <bpf/libbpf.h>
#include "tcplat.skel.h"

// Event struct must match the one in the BPF program
struct event {
    __u64 sk;
    __u64 ts_us;
    __u32 pid;
    __u32 tcp_seq;
    __u32 tcp_ack_seq;
    __u16 sport;
    __u16 dport;
    __u8 is_send;
};

// A simple hash table to store pending send events
// In production, use a more robust hash table implementation
#define MAX_PENDING 1024
struct pending_send {
    __u64 sk;
    __u32 seq;
    __u64 ts_us;
};
struct pending_send pending_sends[MAX_PENDING];

int handle_event(void *ctx, void *data, size_t data_sz) {
    const struct event *e = data;

    if (e->is_send) {
        // Store send event, keyed by socket + sequence number
        // This is a simplified example. A real implementation needs a proper hash map.
        for (int i=0; i < MAX_PENDING; i++) {
            if (pending_sends[i].sk == 0) { // Find empty slot
                pending_sends[i].sk = e->sk;
                pending_sends[i].seq = e->tcp_seq;
                pending_sends[i].ts_us = e->ts_us;
                break;
            }
        }
    } else { // is_recv
        // Find matching send event
        for (int i=0; i < MAX_PENDING; i++) {
            // A recv with ack_seq=X correlates to a send with seq=X
            if (pending_sends[i].sk == e->sk && pending_sends[i].seq == e->tcp_ack_seq) {
                __u64 latency_us = e->ts_us - pending_sends[i].ts_us;
                printf("Latency for PID %u on sk 0x%llx: %llu us\n", e->pid, e->sk, latency_us);
                
                // TODO: Add latency_us to a histogram

                // Clear the pending send slot
                pending_sends[i].sk = 0;
                break;
            }
        }
    }
    return 0;
}

int main(int argc, char **argv) {
    struct tcplat_bpf *skel;
    struct ring_buffer *rb = NULL;
    int err;

    // ... libbpf setup, error handling ...
    skel = tcplat_bpf__open();
    if (!skel) { /* handle error */ }

    // --- ADVANCED: Set target cgroup ID ---
    // In a real agent, you would query the Kubernetes API or CRI
    // to get the cgroup path for a pod, then read the cgroup ID.
    // For example: cat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podXXX.slice/cri-containerd-YYY.scope/cgroup.procs
    // Then find the cgroup inode number.
    // skel->rodata->target_cgroup = get_cgroup_id_for_pod("my-pod");

    err = tcplat_bpf__load(skel);
    if (err) { /* handle error */ }

    err = tcplat_bpf__attach(skel);
    if (err) { /* handle error */ }

    rb = ring_buffer__new(bpf_map__fd(skel->maps.rb), handle_event, NULL, NULL);
    if (!rb) { /* handle error */ }

    while (true) {
        err = ring_buffer__poll(rb, 100 /* timeout, ms */);
        if (err == -EINTR) {
            err = 0;
            break;
        }
        if (err < 0) {
            printf("Error polling ring buffer: %d\n", err);
            break;
        }
        // TODO: Periodically export metrics from histogram
    }

    tcplat_bpf__destroy(skel);
    return -err;
}

The Correlation Logic: A Deeper Look

The most complex part is handle_event. The logic relies on the fundamental TCP property that the acknowledgment number (ack_seq) in a receiver's segment indicates the next sequence number (seq) it expects from the sender.

* When we see a send event, we record its sk (socket address), tcp_seq, and timestamp.

* When we see a recv event on the same socket (sk), we look for a stored send event where the send's tcp_seq matches the recv's tcp_ack_seq.

* When a match is found, the latency is recv_timestamp - send_timestamp.

Edge Case: This simple correlation works for basic request/response patterns but can be ambiguous with TCP windowing and delayed ACKs. For HTTP/2 or gRPC, where multiple streams are multiplexed over a single TCP connection, this TCP-level correlation measures the latency of raw data chunks, not logical application-level messages. A more advanced solution might require attaching uprobes to the gRPC or Envoy libraries to understand message boundaries, but this significantly increases complexity and reduces portability. For measuring raw sidecar proxying overhead, the TCP-level analysis is a powerful and less intrusive starting point.

Performance Considerations and Production Deployment

Overhead Analysis

The performance impact of this solution is exceptionally low:

* CPU Overhead: Each eBPF program execution is measured in nanoseconds. The kprobes on tcp_sendmsg/tcp_recvmsg are on the hot path for networking, but their simplicity (a few reads, a map update, and a ring buffer reserve/submit) keeps the overhead to a minimum, typically <1% CPU even under high network load. This is orders ofmagnitude less than the sidecar proxy itself.

* Memory Overhead: The sock_pids hash map and the ring buffer consume a pre-allocated, fixed amount of non-swappable kernel memory. A 256KB ring buffer and a map for 8192 connections consume a trivial amount of memory per node.

Deployment in Kubernetes

This monitoring agent should be deployed as a DaemonSet in your Kubernetes cluster. This ensures an instance of the agent runs on every node, capable of monitoring any pod scheduled on it.

The DaemonSet's pod spec requires elevated privileges:

yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-sidecar-monitor
  namespace: kube-system
spec:
  # ... selector, template ...
  template:
    spec:
      hostPID: true
      containers:
      - name: monitor-agent
        image: my-registry/ebpf-monitor-agent:latest
        securityContext:
          privileged: true # or specific capabilities
          capabilities:
            add:
              - SYS_ADMIN
              - BPF
        volumeMounts:
        - name: bpf-fs
          mountPath: /sys/fs/bpf
        - name: cgroup-fs
          mountPath: /sys/fs/cgroup
          readOnly: true
      volumes:
      - name: bpf-fs
        hostPath:
          path: /sys/fs/bpf
      - name: cgroup-fs
        hostPath:
          path: /sys/fs/cgroup

* Privileges: CAP_SYS_ADMIN and CAP_BPF are required to load eBPF programs and interact with the BPF filesystem.

* Filesystem Mounts: Access to /sys/fs/bpf and /sys/fs/cgroup from the host is necessary.

* HostPID: hostPID: true is needed so the agent can see all processes on the host to map PIDs to pods if necessary.

Integrating with Prometheus

The user-space agent should expose an HTTP endpoint (e.g., /metrics) for Prometheus to scrape. The agent would maintain a histogram of the calculated latencies (e.g., using a library like HdrHistogram_c). On scrape, it would format this data into the Prometheus histogram format.

Example Prometheus Metrics:

text

# HELP sidecar_app_to_proxy_latency_us Latency from application write to sidecar read
# TYPE sidecar_app_to_proxy_latency_us histogram
sidecar_app_to_proxy_latency_us_bucket{pod="my-app-pod-1", namespace="default", le="100"} 23
sidecar_app_to_proxy_latency_us_bucket{pod="my-app-pod-1", namespace="default", le="500"} 150
# ... other buckets
sidecar_app_to_proxy_latency_us_sum{pod="my-app-pod-1", namespace="default"} 123456
sidecar_app_to_proxy_latency_us_count{pod="my-app-pod-1", namespace="default"} 180

The agent would need to enrich the events with Kubernetes metadata (pod name, namespace) by querying the Kubelet API on the node or watching the main Kubernetes API server.

Using this data, you can build powerful Grafana dashboards and precise alerts based on P99 sidecar latency, as measured from the kernel, giving you a complete and accurate picture of your service mesh's performance.

Conclusion

Standard service mesh observability tooling provides a valuable but incomplete view of performance. The latency introduced by the kernel's network stack represents a significant blind spot that can hide the true cost of a sidecar proxy. By leveraging eBPF with libbpf and CO-RE, senior engineers can build highly efficient, portable, and non-intrusive monitoring tools that close this observability gap.

This kernel-level approach, tracing TCP socket operations and correlating them with sequence numbers, provides a ground-truth measurement of the latency between an application and its sidecar. While the implementation requires a deep understanding of both kernel internals and user-space tooling, the resulting data fidelity is unparalleled, enabling precise performance tuning, capacity planning, and robust SLOs for critical, mesh-enabled infrastructure.