eBPF for Zero-Overhead Kubernetes Network Observability

15 min read
Goh Ling Yong
Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Tax: Moving Beyond the Sidecar

For years, the service mesh sidecar pattern, primarily using proxies like Envoy or Linkerd, has been the de facto standard for achieving deep network observability in Kubernetes. It's a powerful model that provides rich L7 metrics, distributed tracing, and mutual TLS without requiring application code changes. However, this power comes at a significant cost—the "observability tax."

Every pod running a sidecar proxy incurs non-trivial CPU and memory overhead. More critically, every network packet to and from the application container must traverse the user-space proxy, adding latency to the request path. In a complex microservices architecture with dozens of service hops, this latency accumulates. Scaling this model means scaling the number of proxies, leading to significant resource consumption and operational complexity across the cluster.

This is where eBPF (extended Berkeley Packet Filter) presents a paradigm shift. By running sandboxed programs directly within the Linux kernel, eBPF allows us to observe network traffic at its source, eliminating the need for per-pod user-space proxies. This article is not an introduction to eBPF; it's a guide for senior engineers on the architectural patterns and implementation details required to build a production-viable, eBPF-based network observability agent for Kubernetes.

We will dissect the data path, build a functional traffic tracer in C and Go, tackle the critical challenge of correlating kernel data with Kubernetes metadata, and explore advanced techniques for inspecting encrypted traffic.

Architectural Contrast: Sidecar vs. eBPF Data Path

To appreciate the performance gains, it's essential to visualize the difference in data flow.

Sidecar-based Data Path:

A request from Pod A to Pod B in a typical service mesh follows a convoluted path:

  • Application in Pod A sends a request to service-b.default.svc.cluster.local.
  • The request is intercepted by the pod's network namespace iptables rules.
  • iptables redirects the traffic to the Envoy sidecar proxy running within the same pod.
    • Envoy (a user-space process) receives the packet, processes it (for metrics, tracing, policy), and then initiates a new connection to the destination.
    • The new connection goes back through the pod's network stack and out to the node's root network namespace.
    • The packet is routed to Pod B's node, where it's again intercepted by Pod B's Envoy sidecar before finally reaching the application.

    This involves multiple context switches between user-space and kernel-space and two full TCP stack traversals for a single logical request.

    eBPF-based Data Path:

    With an eBPF agent running as a DaemonSet on each node:

    • Application in Pod A sends a request to Pod B.
    • The packet traverses the kernel's networking stack as usual.
    • An eBPF program, attached to a kernel hook (e.g., a Traffic Control hook or a kernel function probe on socket calls), executes.
    • The eBPF program observes the packet/data, records relevant metadata (source/destination IP/port, etc.), and sends this information to a user-space agent via a high-performance ring buffer.
    • The original packet continues on its path to Pod B, completely unmodified and without leaving the kernel for observability purposes.

    The key distinction is observation without interception. The eBPF program acts as a tap, not a proxy. This fundamentally reduces latency, eliminates per-pod resource overhead, and simplifies the networking path.

    Implementing a Kernel-Level Traffic Tracer

    Let's build the core components of our observability agent. We need two parts: an eBPF program written in C that runs in the kernel, and a user-space controller written in Go that loads the eBPF program and processes the data it collects.

    Choosing the Right eBPF Hooks

    Our goal is to capture TCP connection data. Several hook points are viable, each with trade-offs:

    * Traffic Control (TC): eBPF programs can be attached to the TC ingress/egress hooks. This is ideal for packet-level analysis, as it sees every sk_buff. It's powerful for L3/L4 metrics but requires parsing TCP segments to reconstruct L7 data.

    * Socket Operations: Attaching to BPF_PROG_TYPE_SOCK_OPS allows you to trigger eBPF programs on socket events like state changes. Useful for tracking connection lifecycles.

    * Kernel Probes (kprobes): These allow us to instrument almost any function in the kernel. By attaching to syscalls related to socket operations like tcp_sendmsg and tcp_recvmsg, we can inspect data as it's being sent or received by an application. This is a powerful way to get at application-level data buffers.

    For our purposes, kprobes on socket functions provide the best balance, giving us access to the data buffers just before they are segmented into TCP packets.

    Code Example 1: The eBPF C Program (`bpf_tracer.c`)

    We will use libbpf and the CO-RE (Compile Once – Run Everywhere) pattern to ensure our program is portable across different kernel versions. The program will:

  • Use kprobes to trace tcp_connect and tcp_close to track connection lifecycles.
  • Use a kretprobe on inet_csk_accept to capture incoming connections.
  • Use a kprobe on tcp_sendmsg to capture outbound data events.
  • Use a BPF_MAP_TYPE_HASH to store active connection information.
  • Use a BPF_MAP_TYPE_RINGBUF to send events efficiently to our user-space Go application.
  • c
    // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>
    #include <bpf/bpf_core_read.h>
    
    // Event structure sent to user-space
    struct event {
        u32 pid;
        u32 saddr;
        u32 daddr;
        u16 sport;
        u16 dport;
        u8 comm[16];
        u8 event_type; // 1: connect, 2: accept, 3: send, 4: close
    };
    
    // Force emitting struct event into the ELF
    struct event *unusedevent __attribute__((unused));
    
    // Ring buffer for sending events to user-space
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024); // 256 KB
    } rb SEC(".maps");
    
    // Hash map to store information about active connections
    struct sock_key {
        u32 saddr;
        u32 daddr;
        u16 sport;
        u16 dport;
    };
    
    struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __uint(max_entries, 8192);
        __type(key, struct sock *);
        __type(value, struct sock_key);
    } active_conns SEC(".maps");
    
    // Helper to submit an event to user-space
    static __always_inline void submit_event(void *ctx, struct sock *sk, u8 type) {
        struct event *e;
        e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
        if (!e) {
            return;
        }
    
        struct sock_key key = {};
        BPF_CORE_READ_INTO(&key.saddr, &sk->__sk_common, skc_rcv_saddr);
        BPF_CORE_READ_INTO(&key.daddr, &sk->__sk_common, skc_daddr);
        BPF_CORE_READ_INTO(&key.sport, &sk->__sk_common, skc_num);
        BPF_CORE_READ_INTO(&key.dport, &sk->__sk_common, skc_dport);
        key.dport = bpf_ntohs(key.dport); // dport is network byte order
    
        e->saddr = key.saddr;
        e->daddr = key.daddr;
        e->sport = key.sport;
        e->dport = key.dport;
        e->pid = bpf_get_current_pid_tgid() >> 32;
        e->event_type = type;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
        bpf_ringbuf_submit(e, 0);
    }
    
    SEC("kprobe/tcp_connect")
    int BPF_KPROBE(tcp_connect, struct sock *sk)
    {
        submit_event(ctx, sk, 1);
        return 0;
    }
    
    SEC("kretprobe/inet_csk_accept")
    int BPF_KRETPROBE(inet_csk_accept_ret, struct sock *newsk)
    {
        if (newsk == NULL) {
            return 0;
        }
        submit_event(ctx, newsk, 2);
        return 0;
    }
    
    SEC("kprobe/tcp_sendmsg")
    int BPF_KPROBE(tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size)
    {
        // In a real implementation, you might capture some of the `msg` data here.
        // For this example, we just log the send event.
        submit_event(ctx, sk, 3);
        return 0;
    }
    
    SEC("kprobe/tcp_close")
    int BPF_KPROBE(tcp_close, struct sock *sk)
    {
        submit_event(ctx, sk, 4);
        return 0;
    }
    
    char LICENSE[] SEC("license") = "GPL";

    Key Points in the C Code:

    * vmlinux.h: This header is generated by bpftool and contains kernel type definitions, which is essential for CO-RE. It allows us to access kernel structs like struct sock in a portable way.

    * BPF_CORE_READ_INTO: This helper macro is used for CO-RE. It safely reads fields from kernel structs, even if the struct layout changes between kernel versions.

    * bpf_ringbuf_reserve/submit: The modern, preferred way to send data to user-space. It's more efficient and guarantees no data loss compared to the older bpf_perf_event_output.

    * We are tracing key TCP functions to get a complete picture of a connection's lifecycle.

    The User-space Controller: Ingesting Data and Adding Context

    The kernel program provides raw, low-level data (IPs, ports, PIDs). This data is useless for Kubernetes observability without context. A pod's IP can change, and we need to know which Service, Deployment, and Namespace it belongs to.

    This is the primary job of our user-space Go controller: to enrich the eBPF data with Kubernetes metadata.

    Code Example 2: The Go User-space Controller (`main.go`)

    This Go program will use the cilium/ebpf library to load and manage the eBPF program and the official Kubernetes Go client to watch for cluster changes.

    go
    package main
    
    import (
    	"bytes"
    	"context"
    	"encoding/binary"
    	"fmt"
    	"log"
    	"net"
    	"os"
    	"os/signal"
    	"sync"
    	"syscall"
    
    	"github.com/cilium/ebpf/link"
    	"github.com/cilium/ebpf/ringbuf"
    	"github.com/cilium/ebpf/rlimit"
    	"k8s.io/api/core/v1"
    	"k8s.io/apimachinery/pkg/fields"
    	"k8s.io/client-go/kubernetes"
    	"k8s.io/client-go/rest"
    	"k8s.io/client-go/tools/cache"
    )
    
    //go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf bpf_tracer.c -- -I./headers
    
    // Event mirrors the C struct
    type Event struct {
    	Pid      uint32
    	Saddr    uint32
    	Daddr    uint32
    	Sport    uint16
    	Dport    uint16
    	Comm     [16]byte
    	EventType uint8
    }
    
    // K8sMetadataCache holds the mapping from IP to Pod info
    type K8sMetadataCache struct {
    	mu   sync.RWMutex
    	pods map[string]PodInfo
    }
    
    type PodInfo struct {
    	Name      string
    	Namespace string
    	NodeName  string
    }
    
    func (c *K8sMetadataCache) Get(ip string) (PodInfo, bool) {
    	c.mu.RLock()
    	defer c.mu.RUnlock()
    	info, found := c.pods[ip]
    	return info, found
    }
    
    func (c *K8sMetadataCache) Run(ctx context.Context) {
    	config, err := rest.InClusterConfig()
    	if err != nil {
    		log.Fatalf("Failed to get in-cluster config: %v", err)
    	}
    	clientset, err := kubernetes.NewForConfig(config)
    	if err != nil {
    		log.Fatalf("Failed to create clientset: %v", err)
    	}
    
    	podListWatcher := cache.NewListWatchFromClient(
    		clientset.CoreV1().RESTClient(),
    		"pods",
    		v1.NamespaceAll,
    		fields.Everything(),
    	)
    
    	_, controller := cache.NewInformer(
    		podListWatcher,
    		&v1.Pod{},
    		0, // resync period
    		cache.ResourceEventHandlerFuncs{
    			AddFunc: func(obj interface{}) {
    				pod := obj.(*v1.Pod)
    				c.mu.Lock()
    				defer c.mu.Unlock()
    				if pod.Status.PodIP != "" {
    					c.pods[pod.Status.PodIP] = PodInfo{Name: pod.Name, Namespace: pod.Namespace, NodeName: pod.Spec.NodeName}
    				}
    			},
    			UpdateFunc: func(oldObj, newObj interface{}) {
    				pod := newObj.(*v1.Pod)
    				c.mu.Lock()
    				defer c.mu.Unlock()
    				if pod.Status.PodIP != "" {
    					c.pods[pod.Status.PodIP] = PodInfo{Name: pod.Name, Namespace: pod.Namespace, NodeName: pod.Spec.NodeName}
    				}
    			},
    			DeleteFunc: func(obj interface{}) {
    				pod := obj.(*v1.Pod)
    				c.mu.Lock()
    				defer c.mu.Unlock()
    				if pod.Status.PodIP != "" {
    					delete(c.pods, pod.Status.PodIP)
    				}
    			},
    		},
    	)
    
    	log.Println("Starting Kubernetes metadata cache...")
    	go controller.Run(ctx.Done())
    }
    
    func main() {
    	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
    	defer stop()
    
    	if err := rlimit.RemoveMemlock(); err != nil {
    		log.Fatal(err)
    	}
    
    	objs := bpfObjects{}
    	if err := loadBpfObjects(&objs, nil); err != nil {
    		log.Fatalf("loading objects: %v", err)
    	}
    	defer objs.Close()
    
    	// Attach kprobes
    	kpConnect, err := link.Kprobe("tcp_connect", objs.TcpConnect, nil)
    	if err != nil {
    		log.Fatalf("attaching tcp_connect: %v", err)
    	}
    	defer kpConnect.Close()
    
    	kpClose, err := link.Kprobe("tcp_close", objs.TcpClose, nil)
    	if err != nil {
    		log.Fatalf("attaching tcp_close: %v", err)
    	}
    	defer kpClose.Close()
    
    	kpSend, err := link.Kprobe("tcp_sendmsg", objs.TcpSendmsg, nil)
    	if err != nil {
    		log.Fatalf("attaching tcp_sendmsg: %v", err)
    	}
    	defer kpSend.Close()
    
    	kretAccept, err := link.Kretprobe("inet_csk_accept", objs.InetCskAcceptRet, nil)
    	if err != nil {
    		log.Fatalf("attaching inet_csk_accept: %v", err)
    	}
    	defer kretAccept.Close()
    
    	rd, err := ringbuf.NewReader(objs.Rb)
    	if err != nil {
    		log.Fatalf("opening ringbuf reader: %s", err)
    	}
    	defer rd.Close()
    
    	// Start K8s metadata cache
    	k8sCache := &K8sMetadataCache{pods: make(map[string]PodInfo)}
    	k8sCache.Run(ctx)
    
    	log.Println("Waiting for events...")
    
    	for {
    		select {
    		case <-ctx.Done():
    			return
    		default:
    			record, err := rd.Read()
    			if err != nil {
    				// ... handle error ...
    				continue
    			}
    
    			var event Event
    			if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
    				log.Printf("parsing ringbuf event: %s", err)
    				continue
    			}
    
    			// Enrich with K8s metadata
    			srcIP := intToIP(event.Saddr).String()
    			dstIP := intToIP(event.Daddr).String()
    			srcInfo, srcFound := k8sCache.Get(srcIP)
    			dstInfo, dstFound := k8sCache.Get(dstIP)
    
    			fmt.Printf("[%s] PID: %d, Comm: %s, %s:%d (%s) -> %s:%d (%s)\n",
    				getEventType(event.EventType),
    				event.Pid,
    				string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)]),
    				srcIP, event.Sport,
    				getPodName(srcInfo, srcFound),
    				dstIP, event.Dport,
    				getPodName(dstInfo, dstFound),
    			)
    		}
    	}
    }
    
    // Helper functions
    func intToIP(ipNum uint32) net.IP {
    	ip := make(net.IP, 4)
    	binary.LittleEndian.PutUint32(ip, ipNum)
    	return ip
    }
    
    func getEventType(t uint8) string {
    	switch t {
    	case 1: return "CONNECT"
    	case 2: return "ACCEPT"
    	case 3: return "SEND"
    	case 4: return "CLOSE"
    	default: return "UNKNOWN"
    	}
    }
    
    func getPodName(info PodInfo, found bool) string {
    	if !found {
    		return "<external>"
    	}
    	return fmt.Sprintf("%s/%s", info.Namespace, info.Name)
    }

    Key Points in the Go Code:

    * bpf2go: This command from the cilium/ebpf library is crucial. It compiles the C code, generates Go bindings for the eBPF maps and programs, and embeds the compiled eBPF bytecode into the Go binary. This creates a self-contained agent.

    * K8sMetadataCache: This is the brain of the enrichment process. It uses a client-go Informer to efficiently watch for Pod events from the Kubernetes API server. It maintains an in-memory map of Pod IP -> Pod Metadata. This is far more efficient than querying the API for every event.

    * The Enrichment Loop: The main loop reads events from the ring buffer, parses the binary data, and then uses the source and destination IPs to look up metadata in the K8sMetadataCache. This transforms raw network tuples into meaningful, Kubernetes-aware observability data.

    Advanced Challenge: Inspecting Encrypted TLS/HTTPS Traffic

    Our current kprobe on tcp_sendmsg sees the raw TCP payload. For HTTPS traffic, this payload is encrypted. Kernel-level probes are blind to the application-level data (e.g., HTTP headers, gRPC method names).

    This is a significant limitation. The production-grade solution is to move our probes from the kernel's TCP layer up to the user-space encryption libraries. We can use uprobes (user-space probes) to instrument functions within shared libraries like OpenSSL or GnuTLS.

    The most common functions to trace are SSL_read and SSL_write. By attaching a uprobe to the return of SSL_read and the entry of SSL_write, we can access the memory buffers containing the unencrypted, plaintext data.

    Code Example 3: eBPF C Program with Uprobes for OpenSSL

    Let's modify our eBPF program to include a uprobe. Note that attaching uprobes is more complex as you need to specify the path to the shared library.

    c
    // Add this to your bpf_tracer.c
    
    // Event for plaintext data
    struct data_event {
        u32 pid;
        u64 len;
        u8 data[256]; // Capture a sample of the data
    };
    
    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024);
    } data_rb SEC(".maps");
    
    // Uprobe for SSL_write - captures plaintext before encryption
    SEC("uprobe//usr/lib/x86_64-linux-gnu/libssl.so.1.1:SSL_write")
    int BPF_UPROBE(probe_ssl_write, void *ssl, const void *buf, int num) {
        struct data_event *e;
        e = bpf_ringbuf_reserve(&data_rb, sizeof(*e), 0);
        if (!e) {
            return 0;
        }
    
        e->pid = bpf_get_current_pid_tgid() >> 32;
        e->len = num;
        
        // Safely read a portion of the user-space buffer
        bpf_probe_read_user(&e->data, sizeof(e->data), buf);
    
        bpf_ringbuf_submit(e, 0);
        return 0;
    }
    
    // Uretprobe for SSL_read - captures plaintext after decryption
    SEC("uretprobe//usr/lib/x86_64-linux-gnu/libssl.so.1.1:SSL_read")
    int BPF_URETPROBE(probe_ssl_read_ret, int ret) {
        // The buffer is passed as an argument, but accessing it on return is complex.
        // A common pattern is to store the buffer pointer in a map on entry
        // and retrieve it on exit. For simplicity, we'll omit the full implementation here.
        // This highlights the advanced nature of uprobe programming.
        return 0;
    }

    Production Considerations for Uprobes:

    * Library Path: The path to libssl.so can vary between distributions and container images. A production agent needs to dynamically find this path for each process it wants to instrument.

    * Symbol Offsets: The function name SSL_write is a symbol. The agent must resolve this symbol to a memory offset within the library for the target process.

    * Argument Passing: Different CPU architectures have different calling conventions for passing function arguments (e.g., in registers vs. on the stack). The eBPF program must correctly read these arguments.

    * Data Correlation: Correlating the data from SSL_write with the underlying TCP connection from our kprobes requires passing connection identifiers between the probes, typically using a pid_tgid keyed eBPF map.

    Performance and Edge Cases in Production

    * CPU Overhead: A well-written eBPF observability agent has minimal overhead. Unlike a sidecar which processes every byte of a request, eBPF programs execute for a very short duration. The CPU cost is typically <1-2% per node, even under heavy network load, a fraction of the 10-20% often seen with service meshes.

    * Kernel Version Drift & CO-RE: We've used CO-RE, but it's critical to have a build pipeline that generates vmlinux.h from target kernel headers and tests against a range of kernel versions. Without CO-RE, you would need to recompile your eBPF program for every kernel version you support, which is operationally untenable.

    * Short-Lived Connections: High-frequency, short-lived connections (e.g., in some RPC protocols) can churn eBPF maps. Using BPF_MAP_TYPE_LRU_HASH can help automatically evict stale entries, preventing map exhaustion.

    * TCP Packet Reassembly: Our simple tcp_sendmsg probe doesn't handle the case where a single L7 request (e.g., an HTTP POST with a large body) is split across multiple sendmsg calls. A production agent needs a user-space component that understands TCP sequencing and can reassemble these segments into a complete application-level message before parsing.

    * Security Context: Loading eBPF programs requires the CAP_BPF and CAP_SYS_ADMIN capabilities. The agent must be deployed as a DaemonSet with a carefully restricted, privileged ServiceAccount. The eBPF verifier provides a strong security boundary by ensuring the kernel program cannot crash the kernel, but the ability to load programs is still highly privileged.

    Conclusion: The Future is in the Kernel

    eBPF is not a simple replacement for a service mesh; it is a fundamental shift in how we approach cloud-native observability. It trades the apparent simplicity of the sidecar model for unparalleled performance and system-wide visibility. The initial development complexity of building an eBPF agent is higher, requiring deep expertise in systems programming, kernel internals, and Kubernetes.

    However, the operational benefits are immense: reduced cluster-wide resource consumption, lower request latency, and a single point of instrumentation per node that is completely decoupled from application deployment lifecycles. For organizations operating at scale where performance is paramount, the observability tax of sidecars is becoming untenable. By moving intelligence from user-space proxies into the kernel, eBPF provides a path to a more efficient and powerful future for network observability in Kubernetes.

    Found this article helpful?

    Share it with others who might benefit from it.

    More Articles