eBPF for Granular K8s Pod Network Observability Without Sidecars

September 29, 2025

18 min read

Goh Ling Yong

Technology enthusiast and software architect specializing in AI-driven development tools and modern software engineering practices. Passionate about the intersection of artificial intelligence and human creativity in building tomorrow's digital solutions.

The Observability Gap and The Sidecar Tax

In modern Kubernetes environments, understanding pod-to-pod communication is non-negotiable for debugging, security, and performance tuning. The default solution for achieving this level of L4/L7 visibility has been the service mesh, with Istio and Linkerd leading the charge. By injecting a proxy sidecar (like Envoy) into every application pod, they intercept all network traffic, providing rich telemetry, mTLS, and advanced traffic management.

However, this power comes at a cost—a phenomenon often called the "sidecar tax." This tax manifests in several ways:

Increased Latency: Every network packet must traverse the userspace proxy, adding microseconds to milliseconds of latency to every single request. For latency-sensitive services, this is a significant burden.

Resource Consumption: Each sidecar is another process consuming CPU and memory within the pod's cgroup. Across a large cluster, this adds up to a substantial amount of resource overhead that could otherwise be used by the applications themselves.

Operational Complexity: Managing the lifecycle, configuration, and updates of a service mesh and its sidecars is a complex task. It introduces another critical component into the data path that can fail, require debugging, and complicate the application's network environment.

This is where eBPF (extended Berkeley Packet Filter) offers a revolutionary alternative. By running sandboxed programs directly within the Linux kernel, eBPF can observe and manipulate network traffic with near-native performance, completely bypassing the need for userspace proxies for observability. This article presents a production-focused pattern for building a lightweight, high-performance pod-to-pod network observability agent using eBPF, Go, and the Kubernetes API.

We will not cover the basics of eBPF. This guide assumes you understand what eBPF is, the verifier, and the general architecture of kernel-space programs and user-space controllers. Instead, we dive directly into a non-trivial, end-to-end implementation.

Core Architecture: Kernel Hooks, CO-RE, and a Go Controller

Our goal is to capture every TCP connection (v4) initiated and accepted by any process within our Kubernetes cluster and correlate that activity with the source and destination pods.

Our architecture consists of two main components deployed as a Kubernetes DaemonSet:

The eBPF Program (C): A small, efficient C program that attaches to kernel functions (kprobes) related to TCP connections. It collects raw data like source/destination IPs, ports, and process IDs (PIDs).

The Userspace Controller (Go): A Go application that loads the eBPF program into the kernel, listens for events sent from it, and enriches this raw kernel-level data with Kubernetes-specific context (Pod names, namespaces, labels, etc.) by querying the Kubernetes API server.

To ensure our eBPF program is portable across different kernel versions without needing to be recompiled on each node, we will leverage CO-RE (Compile Once - Run Everywhere). This relies on BTF (BPF Type Format), a debugging data format that allows our eBPF loader to understand kernel data structures at runtime and perform necessary relocations. This is a critical pattern for deploying eBPF in production across a potentially heterogeneous cluster of nodes.

We will attach our probes to the following kernel functions:

* tcp_v4_connect: A kprobe at the entry of this function gives us the destination IP and port when a connection is initiated. A kretprobe at its exit tells us if the connection was successful.

* inet_csk_accept: A kprobe here will capture incoming connections being accepted by a listening socket.

Why these functions instead of attaching to the network interface (e.g., TC hooks)? Because these syscall-level hooks give us the crucial process context—the PID of the process initiating or accepting the connection. This PID is our key to linking kernel activity back to a specific Kubernetes pod.

The eBPF Program (C)

Let's build the kernel-side logic. We'll use standard C with libbpf headers. The program will define BPF maps to store state and communicate with userspace.

File: bpf_trace.c

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

// Event structure sent to userspace
struct event {
    u64 ts_ns;
    u32 pid;
    u32 net_ns_inum;
    u8 comm[16];
    u32 saddr;
    u32 daddr;
    u16 sport;
    u16 dport;
    u8 event_type; // 1 for connect, 2 for accept, 3 for close
};

// BPF ring buffer for sending events to userspace
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");

// Map to track ongoing connection attempts
struct connect_info {
    struct sock *sk;
    u16 dport;
    u32 daddr;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u64);
    __type(value, struct connect_info);
} active_connects SEC(".maps");

// Helper to get network namespace inode number
static __always_inline u32 get_netns_inum() {
    struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    struct nsproxy *ns_proxy;
    struct net *net_ns;
    unsigned int inum;

    BPF_CORE_READ_INTO(&ns_proxy, task, nsproxy);
    if (!ns_proxy) return 0;

    BPF_CORE_READ_INTO(&net_ns, ns_proxy, net_ns);
    if (!net_ns) return 0;

    BPF_CORE_READ_INTO(&inum, net_ns, ns.inum);
    return inum;
}

// Kprobe on tcp_v4_connect
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(kprobe__tcp_v4_connect, struct sock *sk, struct sockaddr *uaddr)
{
    u64 id = bpf_get_current_pid_tgid();
    struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;

    // Basic filtering for valid address family
    if (addr->sin_family != AF_INET) {
        return 0;
    }

    struct connect_info info = {};
    info.sk = sk;
    info.daddr = addr->sin_addr.s_addr;
    info.dport = bpf_ntohs(addr->sin_port);

    bpf_map_update_elem(&active_connects, &id, &info, BPF_ANY);
    return 0;
}

// Kretprobe on tcp_v4_connect
SEC("kretprobe/tcp_v4_connect")
int BPF_KRETPROBE(kretprobe__tcp_v4_connect, int ret)
{
    u64 id = bpf_get_current_pid_tgid();
    struct connect_info *info = bpf_map_lookup_elem(&active_connects, &id);

    if (!info) {
        return 0; // Not tracked
    }

    // Connection failed, cleanup and return
    if (ret != 0) {
        bpf_map_delete_elem(&active_connects, &id);
        return 0;
    }

    // Connection successful, get full tuple and send event
    struct sock *sk = info->sk;
    struct inet_sock *inet = inet_sk(sk);
    u16 sport = 0;
    u32 saddr = 0;

    BPF_CORE_READ_INTO(&sport, inet, inet_sport);
    BPF_CORE_READ_INTO(&saddr, sk, __sk_common.skc_rcv_saddr);

    struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) {
        bpf_map_delete_elem(&active_connects, &id);
        return 0;
    }

    e->ts_ns = bpf_ktime_get_ns();
    e->pid = id >> 32;
    e->net_ns_inum = get_netns_inum();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    e->saddr = saddr;
    e->daddr = info->daddr;
    e->sport = bpf_ntohs(sport);
    e->dport = info->dport;
    e->event_type = 1; // connect

    bpf_ringbuf_submit(e, 0);
    bpf_map_delete_elem(&active_connects, &id);
    return 0;
}

// Kprobe on inet_csk_accept
SEC("kprobe/inet_csk_accept")
int BPF_KPROBE(kprobe__inet_csk_accept, struct sock *sk)
{
    u64 id = bpf_get_current_pid_tgid();
    struct inet_sock *inet = inet_sk(sk);
    struct sock *new_sk = (struct sock *)BPF_PROBE_READ_RET();

    if (!new_sk) {
        return 0;
    }

    u16 sport = 0, dport = 0;
    u32 saddr = 0, daddr = 0;

    BPF_CORE_READ_INTO(&sport, new_sk, __sk_common.skc_num);
    BPF_CORE_READ_INTO(&dport, new_sk, __sk_common.skc_dport);
    BPF_CORE_READ_INTO(&saddr, new_sk, __sk_common.skc_rcv_saddr);
    BPF_CORE_READ_INTO(&daddr, new_sk, __sk_common.skc_daddr);

    struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e) {
        return 0;
    }

    e->ts_ns = bpf_ktime_get_ns();
    e->pid = id >> 32;
    e->net_ns_inum = get_netns_inum();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    e->saddr = saddr;
    e->daddr = daddr;
    e->sport = sport;
    e->dport = bpf_ntohs(dport);
    e->event_type = 2; // accept

    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Key Implementation Details:

* vmlinux.h: This header is generated by bpftool and contains all kernel type definitions for a specific kernel version. CO-RE uses this to understand the structure of things like struct sock and struct task_struct at compile time.

* BPF_CORE_READ_INTO: This macro is the heart of CO-RE. It safely reads fields from kernel structs, even if their layout changes across kernel versions.

* active_connects map: We need a temporary map to correlate the entry and exit of tcp_v4_connect. At the entry (kprobe), we store the destination details. At the exit (kretprobe), we retrieve them, get the source details (which are only available after the connection is established), and send the full event.

* get_netns_inum(): This helper function is critical. It walks the task_struct to find the network namespace inode number. This inode is a unique identifier for a pod's network sandbox on a given node, which our userspace controller will use for correlation.

* rb (Ring Buffer): We use a BPF_MAP_TYPE_RINGBUF, a modern and efficient mechanism for sending data from kernel to userspace. It's lock-free and less prone to event loss than older perf buffers.

To compile this, you'll need clang, llvm, and libbpf. You'll also need to generate the vmlinux.h header.

bash

# Install dependencies (on Debian/Ubuntu)
apt-get install -y clang llvm libelf-dev linux-headers-$(uname -r) libbpf-dev

# Generate vmlinux.h
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

# Compile the eBPF program
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I. -c bpf_trace.c -o bpf_trace.o

The Userspace Controller (Go)

Now for the Go application that loads and interacts with our eBPF program. We will use the excellent cilium/ebpf library.

File: main.go

package main

import (
	"bytes"
	"context"
	"encoding/binary"
	"errors"
	"fmt"
	"log"
	"net"
	"os"
	"os/signal"
	"strings"
	"syscall"

	"github.com/cilium/ebpf/link"
	"github.com/cilium/ebpf/ringbuf"
	"github.com/cilium/ebpf/rlimit"
	"k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/cache"
)

//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall" bpf ./bpf_trace.c -- -I./

// Event mirrors the C struct
type Event struct {
	TsNs      uint64
	Pid       uint32
	NetNsInum uint32
	Comm      [16]byte
	Saddr     uint32
	Daddr     uint32
	Sport     uint16
	Dport     uint16
	EventType uint8
}

// PodInfo holds enriched Kubernetes metadata
type PodInfo struct {
	Namespace string
	Name      string
	PodIP     string
}

// netNsCache maps network namespace inode number to PodInfo
type netNsCache struct {
	store cache.SharedIndexInformer
}

func newNetNsCache(clientset *kubernetes.Clientset) *netNsCache {
	podListWatcher := cache.NewListWatchFromClient(
		clientset.CoreV1().RESTClient(),
		"pods",
		metav1.NamespaceAll,
		nil,
	)

	informer := cache.NewSharedIndexInformer(
		podListWatcher,
		&v1.Pod{},
		0, // resync period
		cache.Indexers{"netns": func(obj interface{}) ([]string, error) {
			p := obj.(*v1.Pod)
			if p.Status.PodIP == "" || p.Status.Phase != v1.PodRunning {
				return nil, nil
			}
			// This is a simplification. A robust implementation would read /proc/[pid]/ns/net
			// after finding a PID in the pod's cgroup. For this example, we assume IP correlates.
			// A real implementation needs to map IP to NetNS inode.
			return []string{p.Status.PodIP}, nil // Index by IP for now
		}},
	)

	go informer.Run(context.Background().Done())

	if !cache.WaitForCacheSync(context.Background().Done(), informer.HasSynced) {
		log.Fatal("Failed to sync pod cache")
	}

	return &netNsCache{store: informer}
}

// A real implementation would map NetNS inode to Pod. Here we simplify by mapping IP.
func (c *netNsCache) GetPodByIP(ip string) (PodInfo, bool) {
	items, err := c.store.GetIndexer().ByIndex("netns", ip)
	if err != nil || len(items) == 0 {
		return PodInfo{}, false
	}
	p := items[0].(*v1.Pod)
	return PodInfo{Namespace: p.Namespace, Name: p.Name, PodIP: p.Status.PodIP}, true
}

func main() {
	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
	defer stop()

	// Allow the current process to lock memory for eBPF maps.
	if err := rlimit.RemoveMemlock(); err != nil {
		log.Fatal(err)
	}

	// Load pre-compiled programs and maps into the kernel.
	objs := bpfObjects{}
	if err := loadBpfObjects(&objs, nil); err != nil {
		log.Fatalf("loading objects: %v", err)
	}
	defer objs.Close()

	// Attach kprobes
	kpConnect, err := link.Kprobe("tcp_v4_connect", objs.KprobeTcpV4Connect, nil)
	if err != nil {
		log.Fatalf("attaching kprobe tcp_v4_connect: %s", err)
	}
	defer kpConnect.Close()

	kretpConnect, err := link.Kretprobe("tcp_v4_connect", objs.KretprobeTcpV4Connect, nil)
	if err != nil {
		log.Fatalf("attaching kretprobe tcp_v4_connect: %s", err)
	}
	defer kretpConnect.Close()

	kpAccept, err := link.Kprobe("inet_csk_accept", objs.KprobeInetCskAccept, nil)
	if err != nil {
		log.Fatalf("attaching kprobe inet_csk_accept: %s", err)
	}
	defer kpAccept.Close()

	// Set up Kubernetes client
	config, err := rest.InClusterConfig()
	if err != nil {
		log.Fatalf("getting in-cluster config: %s", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("creating clientset: %s", err)
	}

	podCache := newNetNsCache(clientset)

	// Open a ringbuf reader from userspace RINGBUF map.
	rd, err := ringbuf.NewReader(objs.Rb)
	if err != nil {
		log.Fatalf("opening ringbuf reader: %s", err)
	}
	defer rd.Close()

	go func() {
		<-ctx.Done()
		rd.Close()
	}()

	log.Println("Waiting for events...")

	var event Event
	for {
		record, err := rd.Read()
		if err != nil {
			if errors.Is(err, ringbuf.ErrClosed) {
				log.Println("Received signal, exiting...")
				return
			}
			log.Printf("reading from reader: %s", err)
			continue
		}

		if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
			log.Printf("parsing ringbuf event: %s", err)
			continue
		}

		processEvent(event, podCache)
	}
}

func processEvent(event Event, podCache *netNsCache) {
	srcIP := intToIP(event.Saddr).String()
	dstIP := intToIP(event.Daddr).String()

	srcPod, srcFound := podCache.GetPodByIP(srcIP)
	dstPod, dstFound := podCache.GetPodByIP(dstIP)

	var srcID, dstID string
	if srcFound {
		srcID = fmt.Sprintf("%s/%s", srcPod.Namespace, srcPod.Name)
	} else {
		srcID = srcIP
	}

	if dstFound {
		dstID = fmt.Sprintf("%s/%s", dstPod.Namespace, dstPod.Name)
	} else {
		dstID = dstIP
	}

	eventType := "UNKNOWN"
	switch event.EventType {
	case 1:
		eventType = "CONNECT"
	case 2:
		eventType = "ACCEPT"
	}

	log.Printf("[%s] %s -> %s | %s:%d -> %s:%d | PID: %d | Comm: %s",
		eventType,
		srcID,
		dstID,
		srcIP, event.Sport,
		dstIP, event.Dport,
		event.Pid,
		string(event.Comm[:bytes.IndexByte(event.Comm[:], 0)]),
	)
}

func intToIP(ipNum uint32) net.IP {
	ip := make(net.IP, 4)
	binary.BigEndian.PutUint32(ip, ipNum)
	return ip
}

Key Implementation Details:

* go:generate: This command uses bpf2go to compile the C code and embed it into a Go file (bpf_bpfel_x86.go), along with Go structs that mirror the eBPF maps and types. This simplifies loading and interaction immensely.

* rlimit.RemoveMemlock(): eBPF requires locked memory for its maps. This function raises the memlock rlimit for our process.

* cilium/ebpf/link: This package provides a clean, high-level API for attaching eBPF programs to kernel hooks (Kprobe, Kretprobe, etc.). It handles the low-level details and ensures cleanup on exit.

* Kubernetes Enrichment: This is the most critical part of the userspace controller. We create a SharedIndexInformer from client-go to maintain a local, in-memory cache of all pods in the cluster. When we receive an event from the kernel, we use the source and destination IPs to look up the corresponding pod information from our cache. This is far more efficient than querying the API server for every event.

* NetNS to Pod Mapping (Simplification): The code above uses the Pod IP as an index. This is a simplification that works in many CNI configurations but isn't foolproof (e.g., hostNetwork pods). A truly robust implementation would need a more complex mapping. It would involve listing processes in /proc, checking their /proc/[pid]/cgroup to map them to a pod's cgroup, and then reading /proc/[pid]/ns/net to get the network namespace inode. This inode would then be the key in our cache. However, for clarity, the IP-based approach is shown.

* ringbuf.NewReader: We create a reader to efficiently pull event data from the BPF ring buffer map defined in our C code.

Production Deployment as a DaemonSet

To monitor all nodes in the cluster, we deploy our agent as a DaemonSet. This ensures one instance of our Go controller runs on every node, loading the eBPF program into that node's kernel.

File: daemonset.yaml

yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-net-observer
  namespace: kube-system
  labels:
    app: ebpf-net-observer
spec:
  selector:
    matchLabels:
      app: ebpf-net-observer
  template:
    metadata:
      labels:
        app: ebpf-net-observer
    spec:
      tolerations:
      - operator: Exists
      hostPID: true
      hostNetwork: true
      containers:
      - name: observer
        image: <your-registry>/ebpf-net-observer:latest
        securityContext:
          privileged: true
          # Or more fine-grained capabilities:
          # capabilities:
          #   add:
          #   - SYS_ADMIN
          #   - BPF
        volumeMounts:
        - name: bpf-fs
          mountPath: /sys/fs/bpf
      serviceAccountName: ebpf-observer-sa
      volumes:
      - name: bpf-fs
        hostPath:
          path: /sys/fs/bpf
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ebpf-observer-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ebpf-observer-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: ebpf-observer-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: ebpf-observer-role
subjects:
- kind: ServiceAccount
  name: ebpf-observer-sa
  namespace: kube-system

Key Deployment Details:

* DaemonSet: Ensures our agent runs on every node.

* hostPID: true: Allows the agent to see all process IDs on the host, which is necessary to map PIDs from eBPF events to containers.

* securityContext: { privileged: true }: This is the simplest way to grant the necessary permissions. eBPF operations require CAP_SYS_ADMIN and CAP_BPF. In a production environment, you should avoid full privileged mode and instead grant only the necessary capabilities. This is a critical security consideration.

* RBAC: We create a ServiceAccount, ClusterRole, and ClusterRoleBinding to grant our agent read-only access to Pod objects across the entire cluster. This is required for the enrichment process.

Performance Analysis vs. Sidecar Proxies

Let's quantify the "sidecar tax" and compare it to our eBPF approach.

Metric	Sidecar Proxy (e.g., Envoy)	eBPF Observability Agent
Data Path Latency	High. Adds 0.5ms - 5ms+ per hop. Every packet is redirected to a userspace process, copied, processed, and sent back to the kernel.	Near-zero. eBPF probes are passive hooks. They read data but do not intercept or modify packets. The data path is untouched. Latency addition is measured in nanoseconds.
CPU Overhead	Medium to High. Each sidecar runs as a separate process, consuming CPU for proxying, TLS termination, and telemetry generation.	Very Low. The in-kernel eBPF program is JIT-compiled and highly efficient. The main overhead is the userspace Go agent, which is lightweight and primarily waits for events and processes a local cache.
Memory Overhead	Medium. Each sidecar can consume 50MB - 200MB+ of RAM. This is multiplied by the number of pods in the cluster.	Low. The eBPF maps consume a fixed, pre-allocated amount of locked kernel memory (e.g., a few MB). The Go agent's memory usage is dominated by the pod cache, which is a single instance per node, not per pod.
Intrusiveness	High. Requires pod spec modifications (injection), complicates network policies, and changes the application's network view of the world (`localhost`).	Low. Completely transparent to the application. No code or configuration changes are needed in the application pods.

Benchmark Scenario: Imagine a simple request-response service. A wrk benchmark might show:

* Baseline (No Proxy): p99 latency of 2ms.

* With Istio Sidecar: p99 latency of 4.5ms.

* With eBPF Agent: p99 latency of 2.05ms.

The eBPF agent adds negligible latency, whereas the sidecar more than doubles it. For high-throughput, low-latency services, this difference is monumental.

Advanced Edge Cases and Considerations

This implementation is a solid foundation, but a production-ready system must handle several edge cases:

Kernel Version Skew: While CO-RE provides excellent portability, it's not magic. Major changes in kernel structures can still break eBPF programs. Production systems often ship multiple eBPF object files compiled against different baseline kernel versions and have the userspace agent probe the running kernel to load the most compatible one.

Encrypted Traffic (TLS/mTLS): Our current kprobe approach operates at L4. It sees the encrypted TCP stream but has no visibility into the L7 data (e.g., HTTP headers, gRPC methods). To get L7 visibility, you must move up the stack and use uprobes (userspace probes) to attach to SSL/TLS library functions in application processes (e.g., SSL_read, SSL_write in OpenSSL). This adds significant complexity, as you need to handle different libraries, versions, and languages (e.g., Go's built-in crypto stack).

High Churn and Map Contention: On a node with thousands of short-lived connections, the BPF maps (active_connects) can become a bottleneck. You must carefully size your maps and consider using more advanced per-CPU map types to reduce lock contention. Similarly, a high event rate can overrun the ring buffer. Your userspace agent must be fast enough to consume events, and the buffer must be sized appropriately.

PID and Network Namespace Correlation: As mentioned, robustly mapping a kernel event to a Kubernetes pod is non-trivial. The IP-based method is a good starting point. The next level involves a PID-to-Pod mapping, which requires inspecting the cgroup hierarchy from /proc. The most robust solutions, like those in Cilium or Falco, build a sophisticated in-memory graph of containers, processes, and network identifiers on each host and update it in real-time.

By leveraging eBPF, we've built a powerful, low-overhead network observability tool that provides deep insights without the performance penalty of traditional service meshes. It's a prime example of how eBPF is shifting the paradigm of cloud-native networking, security, and observability, moving logic from complex userspace sidecars into the efficient, programmable kernel.