Advanced eBPF for Kubernetes Pod-to-Pod Traffic Observability
The Observability Gap: Beyond Sidecar Proxies
In modern Kubernetes environments, understanding pod-to-pod communication is non-negotiable for debugging, security, and performance tuning. The default solution for L7 observability has been the service mesh, which injects a sidecar proxy (like Envoy) into every application pod. This model, while powerful, introduces significant overhead: increased resource consumption (CPU/memory), added network latency for every request, and operational complexity. For many use cases, particularly those focused on L3/L4 network flow analysis, the sidecar is overkill.
This is where eBPF (extended Berkeley Packet Filter) provides a paradigm shift. By running sandboxed programs directly within the Linux kernel, eBPF allows us to instrument network activity at its source with minimal performance impact. We can achieve deep visibility without modifying application code or injecting proxies. This post dissects the implementation of a production-grade eBPF-based traffic monitor, moving from kernel-level packet data to rich, Kubernetes-aware observability.
We will not cover the basics of eBPF. The assumption is you understand what eBPF is, the role of the verifier, and the general concept of maps and probes. Instead, we focus on the architectural patterns and code required to build a real-world tool.
Architectural Contrast: Sidecar vs. eBPF
Let's visualize the data path difference:
Sidecar (e.g., Istio) Data Path:
[ Pod A ] -> [ vethA ] -> [ iptables redirect ] -> [ Envoy Proxy (userspace) ] -> [ iptables ] -> [ Network ]
Every packet traverses the network stack, is redirected by iptables to a userspace process (the proxy), processed, and then re-injected into the stack. This context switching and data copying is the primary source of overhead.
eBPF Data Path:
[ Pod A ] -> [ vethA (eBPF `tc` hook) ] -> [ Network ]
Our eBPF program attaches to the Traffic Control (tc
) hook on the pod's virtual ethernet device (veth
). It can read (and even manipulate) packet data directly as it transits the kernel's network stack. Data is passed to a userspace controller via highly efficient eBPF maps, avoiding per-packet context switching.
The performance gains are substantial, often reducing added latency from milliseconds to microseconds.
Core Implementation: A Pod-to-Pod Traffic Monitor
Our goal is to build a tool that reports traffic flows like (Source Pod, Source Namespace) -> (Destination Pod, Destination Namespace)
. This requires two main components:
tc
hook.We'll use the cilium/ebpf
library in Go, which provides excellent abstractions for managing eBPF objects and CO-RE (Compile Once - Run Everywhere).
1. The Kernel-Side eBPF Program (`bpf_program.c`)
This C code will be compiled into an eBPF object file. It defines the map to communicate with userspace and the program to attach to the tc
hook.
We use a BPF_MAP_TYPE_PERF_EVENT_ARRAY
map. This is a highly efficient, lock-free, per-CPU buffer for sending event data to userspace. It's ideal for high-throughput event streams like network flows.
// bpf_program.c
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
// Define a struct for the data we want to send to userspace.
// The __attribute__((packed)) is important to prevent padding issues.
struct flow_data {
__u32 src_ip;
__u32 dst_ip;
__u16 src_port;
__u16 dst_port;
__u64 bytes;
} __attribute__((packed));
// Perf event map to send data to userspace.
// The key is the CPU ID, the value is the data itself.
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
__uint(max_entries, 1024);
} perf_events SEC(".maps");
// The SEC("tc") annotation tells the loader to treat this as a TC program.
SEC("tc")
int tc_monitor(struct __sk_buff *skb) {
// Get pointers to the start and end of the packet data.
void *data_end = (void *)(long)skb->data_end;
void *data = (void *)(long)skb->data;
// First, parse the Ethernet header.
struct ethhdr *eth = data;
if ((void *)eth + sizeof(*eth) > data_end) {
return TC_ACT_OK; // Not a valid Ethernet frame
}
// We only care about IPv4 for this example.
if (eth->h_proto != bpf_htons(ETH_P_IP)) {
return TC_ACT_OK;
}
// Parse the IP header.
struct iphdr *ip = data + sizeof(*eth);
if ((void *)ip + sizeof(*ip) > data_end) {
return TC_ACT_OK; // Not a valid IP packet
}
// We only care about TCP traffic for simplicity.
if (ip->protocol != IPPROTO_TCP) {
return TC_ACT_OK;
}
// Parse the TCP header.
struct tcphdr *tcp = (void *)ip + sizeof(*ip);
if ((void *)tcp + sizeof(*tcp) > data_end) {
return TC_ACT_OK; // Not a valid TCP segment
}
// Populate our data struct.
struct flow_data flow = {};
flow.src_ip = ip->saddr;
flow.dst_ip = ip->daddr;
flow.src_port = tcp->source;
flow.dst_port = tcp->dest;
flow.bytes = skb->len;
// Submit the event to the perf buffer.
// The BPF_F_CURRENT_CPU flag uses the current CPU as the map key.
// The second argument is a context pointer, which we don't need here.
bpf_perf_event_output(skb, &perf_events, BPF_F_CURRENT_CPU, &flow, sizeof(flow));
// TC_ACT_OK tells the kernel to let the packet proceed unmodified.
return TC_ACT_OK;
}
// Required license for the eBPF program to be loaded.
char LICENSE[] SEC("license") = "GPL";
To compile this, you need clang
and llvm
. You'll also need kernel headers, typically from a package like linux-headers-generic
.
# Compile the C code into an eBPF object file
clang -O2 -g -target bpf -c bpf_program.c -o bpf_program.o
2. The Userspace Controller (`main.go`)
This Go program is the brain of the operation. It performs several critical tasks:
bpf_program.o
, loads the program and map definitions into the kernel.tc_monitor
program to the tc
ingress and egress hooks on all relevant network interfaces (e.g., all veth
pairs).client-go
to watch for pod creations, deletions, and updates, maintaining an in-memory cache that maps Pod IPs to their metadata (name, namespace, labels).flow_data
structs from the eBPF map.flow_data
event, it looks up the source and destination IPs in its Kubernetes cache and logs a human-readable flow description.Here's a simplified but functional implementation:
// main.go
package main
import (
"bytes"
"context"
"encoding/binary"
"errors"
"log"
"net"
"os"
"os/signal"
"sync"
"syscall"
"github.com/cilium/ebpf"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/perf"
"github.com/vishvananda/netlink"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf bpf_program.c -- -I/usr/include -I/usr/include/x86_64-linux-gnu
// PodInfo holds the metadata we care about.
type PodInfo struct {
Name string
Namespace string
IP string
}
// ipCache is a thread-safe cache for mapping IP addresses to PodInfo.
var (
ipCache = make(map[string]PodInfo)
cacheLock sync.RWMutex
)
func main() {
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
// Start the Kubernetes pod watcher in the background.
go watchKubernetesPods(ctx)
// Load the eBPF objects from the compiled file.
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading eBPF objects: %v", err)
}
defer objs.Close()
// Attach the eBPF program to all existing and future veth interfaces.
if err := attachToAllVeth(ctx, objs.TcMonitor); err != nil {
log.Fatalf("attaching to veth interfaces: %v", err)
}
// Set up the perf event reader.
rd, err := perf.NewReader(objs.PerfEvents, os.Getpagesize())
if err != nil {
log.Fatalf("creating perf event reader: %v", err)
}
defer rd.Close()
log.Println("eBPF Traffic Monitor is running. Press Ctrl+C to exit.")
// Main event loop.
for {
select {
case <-ctx.Done():
log.Println("Received stop signal, exiting.")
return
default:
record, err := rd.Read()
if err != nil {
if errors.Is(err, perf.ErrClosed) {
return
}
log.Printf("reading from perf buffer: %v", err)
continue
}
if record.LostSamples > 0 {
log.Printf("dropped %d samples due to full buffer", record.LostSamples)
}
var flow struct {
SrcIP uint32
DstIP uint32
SrcPort uint16
DstPort uint16
Bytes uint64
}
// The byte order from the kernel is network byte order (Big Endian).
reader := bytes.NewReader(record.RawSample)
if err := binary.Read(reader, binary.BigEndian, &flow); err != nil {
log.Printf("parsing flow data: %v", err)
continue
}
// Correlate IPs with Pod info.
logFlow(flow.SrcIP, flow.DstIP, flow.SrcPort, flow.DstPort, flow.Bytes)
}
}
}
// logFlow enriches kernel-level data with Kubernetes metadata.
func logFlow(srcIP, dstIP uint32, srcPort, dstPort uint16, bytes uint64) {
cacheLock.RLock()
defer cacheLock.RUnlock()
srcPod, srcFound := ipCache[intToIP(srcIP).String()]
dstPod, dstFound := ipCache[intToIP(dstIP).String()]
srcDesc := "<External>"
if srcFound {
srcDesc = srcPod.Namespace + "/" + srcPod.Name
}
dstDesc := "<External>"
if dstFound {
dstDesc = dstPod.Namespace + "/" + dstPod.Name
}
log.Printf("Flow: %s:%d -> %s:%d (%d bytes)",
srcDesc, srcPort, dstDesc, bytes)
}
// watchKubernetesPods sets up a watcher to keep the ipCache updated.
func watchKubernetesPods(ctx context.Context) {
// ... (Implementation for connecting to Kubernetes API)
// This function would use client-go to create an Informer for Pods.
// On Add/Update events, it would add/update the pod's IP in the ipCache.
// On Delete events, it would remove the pod's IP from the ipCache.
// This is a standard but non-trivial piece of code, omitted for brevity.
// A real implementation would handle multiple IPs per pod and dual-stack.
log.Println("Kubernetes pod watcher started.")
// For demonstration, let's add a dummy entry.
cacheLock.Lock()
ipCache["10.42.0.15"] = PodInfo{Name: "my-app-pod-1", Namespace: "default", IP: "10.42.0.15"}
ipCache["10.42.0.16"] = PodInfo{Name: "my-db-pod-1", Namespace: "default", IP: "10.42.0.16"}
cacheLock.Unlock()
<-ctx.Done()
log.Println("Kubernetes pod watcher stopped.")
}
// attachToAllVeth attaches the TC program to all veth interfaces.
func attachToAllVeth(ctx context.Context, prog *ebpf.Program) error {
links, err := netlink.LinkList()
if err != nil {
return err
}
for _, link := range links {
if link.Type() == "veth" {
// Attach to both ingress and egress qdiscs (traffic shapers).
qdisc := &netlink.Tbf{
LinkAttrs: netlink.LinkAttrs{Index: link.Attrs().Index},
Parent: netlink.HANDLE_ROOT,
Handle: netlink.MakeHandle(1, 0),
Limit: 1000000,
Rate: 1000000000,
Buffer: 1600,
}
if err := netlink.QdiscAdd(qdisc); err != nil && !errors.Is(err, os.ErrExist) {
log.Printf("could not add qdisc to %s: %v", link.Attrs().Name, err)
continue
}
filter := &netlink.BpfFilter{
LinkAttrs: netlink.LinkAttrs{Index: link.Attrs().Index},
Parent: netlink.HANDLE_MIN_INGRESS,
Handle: netlink.MakeHandle(0, 1),
Protocol: syscall.ETH_P_ALL,
Priority: 1,
Fd: prog.FD(),
Name: "tc-monitor-ingress",
DirectAction: true,
}
if err := netlink.FilterAdd(filter); err != nil && !errors.Is(err, os.ErrExist) {
log.Printf("could not attach filter to %s: %v", link.Attrs().Name, err)
}
}
}
return nil
}
// intToIP converts a uint32 IP address to a net.IP.
func intToIP(ipInt uint32) net.IP {
ip := make(net.IP, 4)
binary.BigEndian.PutUint32(ip, ipInt)
return ip
}
This setup provides a powerful foundation. When a pod sends a TCP packet, our eBPF program captures the details, sends them to the Go controller, which then enriches the data with Kubernetes context and logs it.
Advanced Scenario 1: Handling Cross-Node Traffic
Our current implementation works perfectly for pods on the same node. But what happens when Pod A on Node 1 talks to Pod B on Node 2? The CNI plugin (e.g., Cilium, Calico, Flannel) encapsulates the original packet inside another packet (e.g., using VXLAN or Geneve) to route it between nodes.
The Problem: The tc
hook on Pod A's veth
sees the outer* packet's destination IP, which is the IP of Node 2, not Pod B. Our monitor would incorrectly report Pod A -> Node 2
.
The Solution: We must attach another eBPF program at a later stage in the packet processing pipeline, specifically on the physical network interface (eth0
) where the encapsulation happens. This program's job is to parse the encapsulation header (e.g., VXLAN) to find the inner* packet's true destination IP.
Here's a conceptual eBPF C snippet for parsing a VXLAN header:
#include <linux/udp.h>
#include <linux/if_vlan.h>
// Simplified VXLAN header struct
struct vxlanhdr {
__be32 vx_flags;
__be32 vx_vni;
};
SEC("tc_vxlan")
int tc_vxlan_parser(struct __sk_buff *skb) {
void *data_end = (void *)(long)skb->data_end;
void *data = (void *)(long)skb->data;
// Assume we've already parsed Eth and IP headers to get here.
struct udphdr *udp = /* ... pointer to UDP header ... */;
if ((void *)udp + sizeof(*udp) > data_end) {
return TC_ACT_OK;
}
// Check if it's the standard VXLAN port (4789)
if (udp->dest != bpf_htons(4789)) {
return TC_ACT_OK;
}
// Find the inner packet after the VXLAN header
struct vxlanhdr *vxlan = (void *)udp + sizeof(*udp);
if ((void *)vxlan + sizeof(*vxlan) > data_end) {
return TC_ACT_OK;
}
// The inner packet starts after the VXLAN header.
// We need to re-parse from the inner Ethernet header.
struct ethhdr *inner_eth = (void *)vxlan + sizeof(*vxlan);
if ((void *)inner_eth + sizeof(*inner_eth) > data_end) {
return TC_ACT_OK;
}
// Now we can parse the inner IP and TCP headers as before.
struct iphdr *inner_ip = (void *)inner_eth + sizeof(*inner_eth);
// ... and so on ...
// Now we have the true source and destination IPs from the inner packet.
// We can send this data to userspace.
__u32 true_src_ip = inner_ip->saddr;
__u32 true_dst_ip = inner_ip->daddr;
// ... populate flow struct and bpf_perf_event_output ...
return TC_ACT_OK;
}
A production system would need both hooks: one on the veth
for same-node traffic and one on the physical NIC for cross-node traffic, with logic in the userspace controller to de-duplicate or merge the data.
Performance Considerations and Production Hardening
Building an eBPF tool is one thing; running it reliably in production is another.
* Kernel Version Compatibility & CO-RE: eBPF capabilities are tightly coupled to kernel versions. Hard-coding struct offsets will break as kernels are updated. The solution is CO-RE (Compile Once - Run Everywhere). By compiling with BTF (BPF Type Format) debug information, the eBPF loader (like cilium/ebpf
) can perform runtime relocations, adjusting the eBPF bytecode to match the struct layouts of the running kernel. Our use of bpf2go
and cilium/ebpf
handles this automatically, but it's crucial to understand the mechanism.
* CPU Overhead: eBPF programs are designed to be fast, but they are not free. The eBPF verifier enforces constraints (e.g., no unbounded loops, limited instruction count) to prevent kernel lockups. However, a complex program on a high-throughput interface can still consume significant CPU. Always profile your eBPF programs using tools like bpftool prog profile
.
* Memory Overhead: The size of eBPF maps is a primary driver of memory usage. A perf buffer needs to be large enough to handle bursts of traffic without dropping samples, but not so large it wastes memory. For stateful maps (e.g., tracking active flows), consider BPF_MAP_TYPE_LRU_HASH
to automatically evict old entries.
* Race Conditions: Our userspace IP cache is a classic source of race conditions. A pod can be deleted, and its IP reassigned, while a network event for the old pod is still in the perf buffer. The controller must handle this gracefully. A common pattern is to timestamp events in the kernel and correlate them with the state of the Kubernetes object at that time, or to simply accept a small window of potential inaccuracy during pod churn.
* Security Context: Loading eBPF programs requires elevated privileges (CAP_BPF
or CAP_SYS_ADMIN
). Your controller should be deployed as a DaemonSet with a carefully restricted security context. The principle of least privilege is paramount. The eBPF verifier is your primary safety net, ensuring that your program cannot corrupt kernel memory or crash the system.
A Note on L7 and Encrypted Traffic
What about HTTP/gRPC visibility? This is significantly more complex.
* Plaintext: For unencrypted traffic, you can use kprobes
on tcp_sendmsg
/tcp_recvmsg
and attempt to reassemble the TCP stream in your eBPF program or in userspace. This is stateful, complex, and resource-intensive.
Encrypted (TLS): You cannot inspect TLS traffic at the tc
hook. The only viable eBPF approach is to use uprobes
(user-space probes) to attach to the read/write functions of the SSL/TLS library being used by the application (e.g., OpenSSL's SSL_read
/SSL_write
). This gives you access to the data before encryption and after* decryption. However, this is extremely fragile. It depends on the specific library, its version, and the symbols it exposes. It is a powerful but brittle technique reserved for highly controlled environments.
While the challenges are significant, the payoff of using eBPF for network observability is immense. It provides a level of performance and transparency that is impossible to achieve with traditional userspace agents. By moving instrumentation out of the application pod and into the shared kernel, we build a more efficient, secure, and powerful foundation for the future of cloud-native systems.