eBPF for Granular K8s Network Policy & High-Fidelity Observability
Beyond IP Tables: The Kernel-Level Revolution in Cloud-Native Networking
As senior engineers managing complex Kubernetes environments, we've all encountered the limitations of default NetworkPolicy. While essential for L3/L4 segmentation, it operates on a primitive understanding of traffic: IP addresses and ports. In a dynamic microservices architecture where pods are ephemeral and IP addresses are meaningless, this model quickly breaks down. We need identity-based, application-aware security.
The typical next step has been to adopt a service mesh like Istio. While powerful, this introduces significant operational complexity: sidecar injection, proxy configuration management, increased resource consumption, and an added layer of latency for every network call. What if we could achieve L7-aware policy enforcement and deep observability with near-zero overhead, directly within the Linux kernel?
This is the promise of eBPF (extended Berkeley Packet Filter). By attaching small, sandboxed programs to kernel hooks, we can inspect, filter, and even modify network packets before they traverse the traditional networking stack. This article is not an introduction to eBPF; it's a deep dive into its practical, production-grade application for security and observability in Kubernetes. We will dissect how tools like Cilium leverage eBPF for advanced L7 policies and then build our own custom eBPF-based observability tool from scratch using C and Go to solve a real-world debugging challenge.
The Kernel-Level Advantage: eBPF Data Plane vs. Sidecar Proxies
The fundamental difference between an eBPF-based data plane (like Cilium) and a sidecar-based one (like Istio) lies in the execution context. A sidecar is a user-space proxy running alongside your application container. An eBPF program is JIT-compiled and runs inside the kernel.
Data Path Comparison:
* Packet from Pod A leaves its network namespace.
* iptables rules redirect the packet to the Envoy sidecar's listener in Pod A.
* Envoy (user-space) processes the packet, applies L7 policies, and performs TLS termination/origination.
* Envoy sends the packet back into the kernel to be routed to Pod B.
* The packet arrives at Pod B's Envoy sidecar.
* Envoy in Pod B processes the packet and forwards it to the application container via the loopback interface.
This involves multiple kernel-to-user-space context switches, adding measurable latency (typically milliseconds) and consuming significant CPU/memory for each proxy.
* Packet from Pod A is generated by the application.
* An eBPF program attached to a low-level kernel hook (e.g., the TC hook or a socket hook) intercepts the packet.
* The eBPF program, running in the kernel context, inspects the packet.
* It makes a policy decision based on pod identity (derived from CNI information) and L7 data (by parsing protocols like HTTP/gRPC in-kernel).
* If allowed, the packet is forwarded directly to Pod B's network interface, often bypassing large parts of the iptables and upper networking stack.
This model is orders of magnitude more efficient. Context switches are eliminated, and since the logic runs in the kernel, the security boundary is stronger—policy is enforced before the packet even enters the target pod's network namespace.
| Metric | Sidecar Proxy (Istio) | eBPF Data Plane (Cilium) | Advantage |
|---|---|---|---|
| Added Latency | 1-10 ms per hop | < 1 ms per hop | eBPF (10x-100x lower latency) |
| CPU Overhead | High (per-proxy) | Low (shared per-node) | eBPF (Dramatically lower resource cost) |
| Memory Overhead | High (per-proxy) | Low (shared per-node) | eBPF (Efficient memory usage via maps) |
| Security Context | User-space | Kernel-space | eBPF (Earlier, more secure enforcement) |
Deep Dive: Implementing L7-Aware Policies with Cilium
Let's move from theory to a concrete production scenario. Imagine a multi-tenant environment with a billing-service. We need to enforce a critical security rule: only pods with the label app: payment-processor are allowed to make POST requests to the /api/v1/charge endpoint. All other pods, even if they are in the same namespace, should be blocked from accessing this specific endpoint, and the payment-processor itself should be blocked from accessing any other endpoints on the billing-service.
A standard Kubernetes NetworkPolicy is useless here. It can restrict access to the billing-service pod on its port, but it has no visibility into the HTTP method or path.
CiliumNetworkPolicy Implementation
Cilium extends Kubernetes with a CiliumNetworkPolicy CRD that leverages eBPF to understand application-layer protocols. Here is the manifest to enforce our rule:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "billing-service-l7-policy"
namespace: "finance"
spec:
endpointSelector:
matchLabels:
app: billing-service
ingress:
- fromEndpoints:
- matchLabels:
app: payment-processor
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: "POST"
path: "/api/v1/charge"
Deconstructing the eBPF Magic
How does this YAML translate to kernel-level enforcement?
ipBlock rules in iptables that constantly change.billing-service pod. This hook allows eBPF to see all network packets entering or leaving the pod's network namespace.POST /api/v1/charge HTTP/1.1 line, it checks its policy rules (also stored in an eBPF map). It looks up the source pod's numeric identity and confirms it corresponds to app: payment-processor. It matches the method and path against the allowlist. If all conditions match, it allows the packet to proceed. If not, it drops the packet and can even inject a TCP reset to terminate the connection cleanly. All of this happens in microseconds, within the kernel, without the application or any user-space proxy ever seeing the denied request.This is the power of eBPF: expressing high-level, application-aware rules that are compiled down to highly efficient, sandboxed kernel bytecode.
Building a Custom eBPF Observability Tool for DNS Latency
While Cilium is phenomenal, the true power of eBPF is realized when you build custom tools to solve unique problems. Let's tackle a common and frustrating issue: debugging intermittent service discovery latency.
Problem: A service is experiencing occasional high latency. You suspect slow DNS lookups from within the pod, but instrumenting every application with DNS timing metrics is impractical, and tools like tcpdump are too crude and generate massive files that are difficult to analyze at scale.
Solution: We will build a lightweight, eBPF-based tool that traces all DNS queries (over UDP port 53) originating from a node, measures the latency between the request and response, and reports any queries exceeding a certain threshold. We will use C for the eBPF kernel program and Go with the libbpf-go library for the user-space controller.
1. The eBPF Kernel Program (C)
This C code will be compiled into eBPF bytecode. We'll use kprobes to attach to kernel functions responsible for sending and receiving UDP packets.
dns_tracker.c
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
// Max DNS packet size
#define MAX_DNS_SIZE 512
// Key for tracking ongoing DNS queries
struct dns_query_key_t {
u32 net_ns;
u32 pid;
u16 sport;
u16 id; // DNS transaction ID
};
// Value storing the start timestamp
struct dns_query_val_t {
u64 ts;
};
// Data sent to user-space via perf buffer
struct dns_event_t {
u64 latency_ns;
u32 pid;
u32 net_ns;
u16 id;
char comm[TASK_COMM_LEN];
char qname[128];
};
// Hash map to store start timestamps of DNS queries
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096);
__type(key, struct dns_query_key_t);
__type(value, struct dns_query_val_t);
} ongoing_dns_queries SEC(".maps");
// Perf event array to send data to user-space
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(int));
__uint(value_size, sizeof(u32));
} events SEC(".maps");
// Helper to get network namespace inode number
static __always_inline u32 get_net_ns(struct sock *sk) {
struct sk_buff *skb;
if (!sk) return 0;
return BPF_CORE_READ(sk, sk_net.net, ns.inum);
}
// Attach to the entry of the udp_send_skb function
SEC("kprobe/udp_send_skb")
int BPF_KPROBE(kprobe__udp_send_skb, struct sk_buff *skb) {
struct udphdr *udph;
void *data;
u16 dport, sport, id;
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 pid = pid_tgid >> 32;
// Load UDP header
bpf_skb_load_bytes(skb, skb->transport_header, &udph, sizeof(*udph));
dport = bpf_ntohs(udph->dest);
// We only care about DNS queries
if (dport != 53) {
return 0;
}
// Load DNS header to get transaction ID
data = (void *)udph + sizeof(*udph);
bpf_skb_load_bytes(skb, skb->transport_header + sizeof(*udph), &id, sizeof(id));
struct dns_query_key_t key = {};
key.net_ns = get_net_ns(skb->sk);
key.pid = pid;
key.sport = bpf_ntohs(udph->source);
key.id = id;
struct dns_query_val_t val = {};
val.ts = bpf_ktime_get_ns();
bpf_map_update_elem(&ongoing_dns_queries, &key, &val, BPF_ANY);
return 0;
}
// Attach to the entry of the udp_queue_rcv_skb function
SEC("kprobe/udp_queue_rcv_skb")
int BPF_KPROBE(kprobe__udp_queue_rcv_skb, struct sock *sk, struct sk_buff *skb) {
struct udphdr *udph;
void *data;
u16 sport, dport, id;
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 pid = pid_tgid >> 32;
bpf_skb_load_bytes(skb, skb->transport_header, &udph, sizeof(*udph));
sport = bpf_ntohs(udph->source);
// We only care about DNS responses
if (sport != 53) {
return 0;
}
data = (void *)udph + sizeof(*udph);
bpf_skb_load_bytes(skb, skb->transport_header + sizeof(*udph), &id, sizeof(id));
struct dns_query_key_t key = {};
key.net_ns = get_net_ns(sk);
key.pid = pid; // Note: This PID might not be the original requester's PID.
// We use net_ns and sport for more reliable matching.
key.sport = bpf_ntohs(udph->dest);
key.id = id;
struct dns_query_val_t *val_ptr;
val_ptr = bpf_map_lookup_elem(&ongoing_dns_queries, &key);
if (!val_ptr) {
return 0;
}
u64 start_ts = val_ptr->ts;
u64 end_ts = bpf_ktime_get_ns();
bpf_map_delete_elem(&ongoing_dns_queries, &key);
u64 latency = end_ts - start_ts;
// Only report slow queries (e.g., > 50ms)
if (latency < 50000000) {
return 0;
}
struct dns_event_t event = {};
event.latency_ns = latency;
event.pid = key.pid; // Use the original PID from the key
event.net_ns = key.net_ns;
event.id = key.id;
bpf_get_current_comm(&event.comm, sizeof(event.comm));
// In a real tool, we would parse the qname from the packet data here.
// For brevity, we'll skip the complex parsing logic.
// bpf_skb_load_bytes(...)
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Key Concepts in the C Code:
* vmlinux.h: This header is generated by bpftool and contains all kernel type definitions. It's essential for CO-RE (Compile Once – Run Everywhere) to work, making our program portable across different kernel versions.
* kprobe: A dynamic tracing mechanism to attach our eBPF program to the entry (kprobe) or exit (kretprobe) of almost any kernel function.
* BPF_MAP_TYPE_HASH: A key-value store accessible from both our eBPF program and user-space. We use it to store the timestamp when a DNS query is sent.
* BPF_MAP_TYPE_PERF_EVENT_ARRAY: A high-performance, lockless way to send data from the kernel program to our user-space application.
* bpf_ktime_get_ns(): A helper function to get a monotonic timestamp.
* CO-RE Helpers (BPF_CORE_READ): Safely read kernel struct members, even if the struct layout changes between kernel versions.
2. The User-space Controller (Go)
This Go application will load, attach, and listen for events from our eBPF program.
main.go
package main
import (
"bytes"
"encoding/binary"
"errors"
"fmt"
"log"
"os"
"os/signal"
"syscall"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/perf"
"golang.org/x/sys/unix"
)
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang -cflags "-O2 -g -Wall -Werror" bpf dns_tracker.c -- -I./headers
const TASK_COMM_LEN = 16
type dnsEventT struct {
LatencyNs uint64
Pid uint32
NetNs uint32
Id uint16
Comm [TASK_COMM_LEN]byte
Qname [128]byte
}
func main() {
// Handle Ctrl+C
stopper := make(chan os.Signal, 1)
signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)
// Increase rlimit memory lock for eBPF maps
if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &unix.Rlimit{Cur: unix.RLIM_INFINITY, Max: unix.RLIM_INFINITY}); err != nil {
log.Fatalf("failed to set rlimit: %v", err)
}
// Load pre-compiled BPF objects
objs := bpfObjects{}
if err := loadBpfObjects(&objs, nil); err != nil {
log.Fatalf("loading objects: %v", err)
}
defer objs.Close()
// Attach kprobe for sending UDP packets
kpSend, err := link.Kprobe("udp_send_skb", objs.KprobeUdpSendSkb, nil)
if err != nil {
log.Fatalf("attaching kprobe udp_send_skb: %v", err)
}
defer kpSend.Close()
// Attach kprobe for receiving UDP packets
kpRecv, err := link.Kprobe("udp_queue_rcv_skb", objs.KprobeUdpQueueRcvSkb, nil)
if err != nil {
log.Fatalf("attaching kprobe udp_queue_rcv_skb: %v", err)
}
defer kpRecv.Close()
// Open a perf event reader from the BPF map
rd, err := perf.NewReader(objs.Events, os.Getpagesize())
if err != nil {
log.Fatalf("creating perf event reader: %v", err)
}
defer rd.Close()
go func() {
<-stopper
log.Println("Received signal, exiting...")
rd.Close()
}()
log.Println("Waiting for events... Press Ctrl+C to exit.")
var event dnsEventT
for {
record, err := rd.Read()
if err != nil {
if errors.Is(err, perf.ErrClosed) {
return
}
log.Printf("reading from perf buffer: %s", err)
continue
}
if record.LostSamples > 0 {
log.Printf("perf buffer dropped %d samples", record.LostSamples)
continue
}
// Parse the raw data into our struct
if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
log.Printf("parsing perf event: %s", err)
continue
}
fmt.Printf("Slow DNS Query Detected! Latency: %.2fms, PID: %d, Comm: %s, NetNS: %d\n",
float64(event.LatencyNs)/1000000.0,
event.Pid,
unix.ByteSliceToString(event.Comm[:]),
event.NetNs,
)
}
}
3. Compilation and Deployment in Kubernetes
Build Steps:
vmlinux.h: bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.hgo generate will execute the bpf2go command, compiling dns_tracker.c and embedding it into a Go file (bpf_bpfel.go).go build -o dns-trackerProduction Deployment Pattern:
This tool must run on every node to capture all DNS activity. The correct Kubernetes pattern is a DaemonSet.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dns-tracker
namespace: kube-system
labels:
app: dns-tracker
spec:
selector:
matchLabels:
app: dns-tracker
template:
metadata:
labels:
app: dns-tracker
spec:
hostPID: true
hostNetwork: true
tolerations:
- operator: Exists
containers:
- name: dns-tracker-container
image: your-repo/dns-tracker:latest
securityContext:
privileged: true
volumeMounts:
- name: bpf-fs
mountPath: /sys/fs/bpf
readOnly: false
volumes:
- name: bpf-fs
hostPath:
path: /sys/fs/bpf
Critical DaemonSet settings:
* hostPID: true: Allows the process to see all PIDs on the host, necessary for our kprobes.
* securityContext.privileged: true: Required to load eBPF programs and access kernel debugging features.
* Volume Mount for /sys/fs/bpf: This is the virtual filesystem where eBPF maps and programs are pinned, allowing them to persist.
Running this DaemonSet across your cluster provides a powerful, low-overhead tool for instantly identifying which pods and processes are suffering from slow DNS resolution, a task that is notoriously difficult with traditional tools.
Advanced Edge Cases and Performance Considerations
Building robust eBPF tooling requires a deep understanding of the kernel and the eBPF runtime's constraints.
* The eBPF Verifier: Before any eBPF program is loaded, it undergoes a rigorous static analysis by the kernel's verifier. The verifier ensures the program is safe to run by checking for unbounded loops, out-of-bounds memory access, and null pointer dereferences. A common mistake is iterating over a packet's data without explicit boundary checks, which the verifier will reject. For example, a for loop without a #pragma unroll directive or a hard-coded upper bound will fail verification.
* Map Contention at Scale: Our ongoing_dns_queries map is a BPF_MAP_TYPE_HASH. On a node with tens of thousands of DNS queries per second, this single map can become a point of lock contention as multiple CPUs try to update it simultaneously. For such high-throughput scenarios, a better choice is BPF_MAP_TYPE_PERCPU_HASH. This map type creates a separate hash map for each CPU. The eBPF program updates its local CPU's map, which is a lock-free operation. User-space then iterates over all per-CPU maps to get the full picture. This trades a small amount of memory for a significant performance gain.
* Kernel Version Dependencies and CO-RE: Historically, a major pain point of eBPF was that programs had to be recompiled for the specific kernel version they were running on due to changes in kernel data structures. CO-RE (Compile Once – Run Everywhere) solves this. By using vmlinux.h which contains BTF (BPF Type Format) data, libbpf can perform runtime relocations. It understands the structure of the kernel on the target machine and automatically adjusts memory offsets in your eBPF program to match. This is why our C code uses BPF_CORE_READ—it's a macro that enables this relocation magic.
* Tail Calls for Complex Logic: eBPF programs have a stack size limit of 512 bytes and an instruction limit (1 million instructions since kernel 5.3). To implement complex logic, such as a multi-stage protocol parser, you can use bpf_tail_call. This function effectively replaces the current eBPF program with another one, similar to an exec() system call, without returning to the original program. This allows you to chain programs together to perform complex tasks that would otherwise exceed the verifier's limits.
Conclusion: eBPF as the Future of Cloud-Native Infrastructure
eBPF is not just another tool; it represents a fundamental shift in how we build and manage cloud-native systems. It allows us to push observability, security, and networking logic into the most privileged and performant location: the kernel itself.
We've seen how high-level tools like Cilium use eBPF to provide sophisticated L7 policies that outperform sidecar proxies. More importantly, we've demonstrated that senior engineers can and should leverage eBPF directly to build bespoke solutions for their unique challenges, like our DNS latency tracker.
Understanding the principles of eBPF programming, the role of the verifier, the importance of CO-RE, and advanced concepts like map types and tail calls is becoming a critical skill. As the ecosystem matures with projects like Tetragon for security observability and Parca for continuous profiling, proficiency in eBPF will be a defining characteristic of the next generation of elite infrastructure and security engineers.